A New Exact State Reconstruction Strategy for Conjugate Gradient Methods with Arbitrary Preconditioners
With growing numbers of nodes in large-scale parallel computers the likelihood of unanticipated node failures increases. Furthermore, global reduction operations become major bottlenecks due to their limited parallel scalability. The Preconditioned Conjugate Gradient (PCG) method, an important iterative solver for large sparse linear systems, faces these challenges. The negative impact of global reduction operations on scalability can be reduced by using a preconditioner which significantly reduces the number of iterations, by overlapping communication with computation (communication-hiding variants of PCG), or by reducing synchronization points (communication-avoiding variants of PCG). However, efficient algorithm-based resilience to unanticipated node failures that does not impact the convergence of the solver was so far studied only for a single scalable variant of PCG, but not for arbitrary preconditioners. In an effort to address both challenges mentioned above in combination, we present variants of standard PCG and communication-hiding PCG which are resilient to node failures. By exploiting algorithm-specific properties of PCG the overhead of storing redundant information during the failure-free phase can be made very small. Efficient recovery from multiple node failures is based on adapting an exact state reconstruction (ESR) strategy. Existing ESR strategies are not applicable for all preconditioners as they require the explicit availability of the preconditioner matrix. We extend the ESR approach to work efficiently with arbitrary preconditioners for both standard PCG and communication-hiding PCG methods. Experiments on the Vienna Scientific Cluster (VSC) illustrate very low runtime overheads compared to the non-resilient methods.
Top- Mayer, Viktoria
- Gansterer, Wilfried
Category |
Paper in Conference Proceedings or in Workshop Proceedings (Poster) |
Event Title |
2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) |
Divisions |
Theory and Applications of Algorithms |
Subjects |
Parallele Datenverarbeitung |
Event Location |
San Francisco |
Event Type |
Workshop |
Event Dates |
May 27-31, 2024 |
Date |
May 2024 |
Export |