On the Resilience of Conjugate Gradient and Multigrid Methods to Node Failures
In this paper, we examine the inherent resilience of multigrid (MG) and conjugate gradient (CG) methods in the search for algorithm- based approaches to deal with node failures in large parallel HPC sys- tems. In previous work, silent data corruption has been modeled as the perturbation of values in the work arrays of a MG solver. It was con- cluded that MG recovers fast from errors of this type. We explore how fast MG and CG methods recover from the loss of a contiguous sec- tion of their working memory, modeling a node failure. Since MG and CG methods dier in their convergence rates, we propose a methodol- ogy to compare their resilience: Time is represented as a fraction of the iterations required to reach a certain target precision, and failures are introduced when the residual norm reaches a certain threshold. We use the two solvers on a linear system that represents a model elliptic par- tial dierential equation, and we experimentally evaluate the overhead caused by the introduced faults. Additionally, we observe the behavior of the conjugate gradient solver under node failures for additional test problems. Approximating the lost values of the solution using interpo- lation reduces the overhead for MG, but the eect on the CG solver is minimal. We conclude that the methods also have the inherent ability to recover from node failures. However, we illustrate that the relative overhead caused by node failures is signicant.
Top- Pachajoa, Carlos
- Gansterer, Wilfried
Category |
Paper in Conference Proceedings or in Workshop Proceedings (Paper) |
Event Title |
Euro-Par 2017: Parallel Processing - 23rd International Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain, August 28 - September 1, 2017, Proceedings |
Divisions |
Theory and Applications of Algorithms |
Subjects |
Informatik Allgemeines Parallele Datenverarbeitung |
Event Location |
Santiago de Compostela, Spain |
Event Type |
Workshop |
Event Dates |
August 28 - September 1 |
Date |
28 August 2017 |
Export |