On the Resilience of Conjugate Gradient and Multigrid Methods to Node Failures

On the Resilience of Conjugate Gradient and Multigrid Methods to Node Failures

Abstract

In this paper, we examine the inherent resilience of multigrid (MG) and conjugate gradient (CG) methods in the search for algorithm- based approaches to deal with node failures in large parallel HPC sys- tems. In previous work, silent data corruption has been modeled as the perturbation of values in the work arrays of a MG solver. It was con- cluded that MG recovers fast from errors of this type. We explore how fast MG and CG methods recover from the loss of a contiguous sec- tion of their working memory, modeling a node failure. Since MG and CG methods dier in their convergence rates, we propose a methodol- ogy to compare their resilience: Time is represented as a fraction of the iterations required to reach a certain target precision, and failures are introduced when the residual norm reaches a certain threshold. We use the two solvers on a linear system that represents a model elliptic par- tial dierential equation, and we experimentally evaluate the overhead caused by the introduced faults. Additionally, we observe the behavior of the conjugate gradient solver under node failures for additional test problems. Approximating the lost values of the solution using interpo- lation reduces the overhead for MG, but the eect on the CG solver is minimal. We conclude that the methods also have the inherent ability to recover from node failures. However, we illustrate that the relative overhead caused by node failures is signicant.

Grafik Top
Authors
  • Pachajoa, Carlos
  • Gansterer, Wilfried
Grafik Top
Shortfacts
Category
Paper in Conference Proceedings or in Workshop Proceedings (Paper)
Event Title
Euro-Par 2017: Parallel Processing - 23rd International Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain, August 28 - September 1, 2017, Proceedings
Divisions
Theory and Applications of Algorithms
Subjects
Informatik Allgemeines
Parallele Datenverarbeitung
Event Location
Santiago de Compostela, Spain
Event Type
Workshop
Event Dates
August 28 - September 1
Date
28 August 2017
Export
Grafik Top