Node-Failure-Resistant Preconditioned Conjugate Gradient Method without Replacement Nodes

Node-Failure-Resistant Preconditioned Conjugate Gradient Method without Replacement Nodes

Abstract

As HPC systems grow in scale to meet increased computational demands, the incidence of faults in a given window of time is expected to grow. This issue is addressed by the scientific community with research on solutions in every computational layer. In this paper, we explore strategies for fault tolerance at the algorithmic level. We propose a node-failure-tolerant preconditioned conjugate gradient method, which is able to efficiently recover from node failures without the use of extra spare nodes, i. e., without any overhead in terms of available hardware. For purposes of load balancing, we redistribute the surviving and reconstructed solver data. The objective is to reconstruct the system either as it was before the node failure, or an equivalent, permuted version, and then continue the execution of the solver only on the surviving nodes. In our experimental evaluations, the recovery stage of the solver typically takes around 10% or less of the solver runtime, including the time to retrieve the problem-defining static data from the hard disk, and, when using a suitable preconditioner, an average solver runtime overhead of 3.5% over that of a resilient solver that uses a replacement node. We investigate the influence of the preconditioner on a trade-off between load-balancing and communication cost in the recovery phase. The obtained solutions are correct, and our method is thus a feasible way to recover from a node failure and continue the execution of the solver only on the surviving nodes.

Grafik Top
Authors
  • Pachajoa, Carlos
  • Pacher, Christina
  • Gansterer, Wilfried
Grafik Top
Shortfacts
Category
Paper in Conference Proceedings or in Workshop Proceedings (Paper)
Event Title
2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)
Divisions
Theory and Applications of Algorithms
Subjects
Informatik Allgemeines
Parallele Datenverarbeitung
Event Location
Denver. CO. USA
Event Type
Workshop
Event Dates
22 Nov. 2019
Series Name
FTXS2019:FaultToleranceforHPCateXtremeScaleWorkshop
Page Range
pp. 31-40
Date
2019
Export
Grafik Top