Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method

Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method

Abstract

As computers reach exascale and beyond, the incidence of faults will increase. Solutions to this problem are an active research topic. We focus on strategies to make the preconditioned conjugate gradient (PCG) solver resilient against node failures, specifically, the exact state reconstruction (ESR) method, which exploits redundancies in PCG. Reducing the frequency at which redundant information is stored lessens the runtime overhead. However, after the node failure, the solver must restart from the last iteration for which redundant information was stored, which increases recovery overhead. This formulation highlights the method's similarities to checkpoint-restart (CR). Thus, this method, which we call ESR with periodic storage (ESRP), can be considered a form of algorithm-based checkpoint-restart. The state is stored implicitly, by exploiting redundancy inherent to the algorithm, rather than explicitly as in CR. We also minimize the amount of data to be stored and retrieved compared to CR, but additional computation is required to reconstruct the solver's state. In this paper, we describe the necessary modifications to ESR to convert it into ESRP, and perform an experimental evaluation. We compare ESRP experimentally with previously-existing ESR and application-level in-memory CR. Our results confirm that the overhead for ESR is reduced significantly, both in the failure-free case, and if node failures are introduced. In the former case, the overhead of ESRP is usually lower than that of CR. However, CR is faster if node failures happen. We claim that these differences can be alleviated by the implementation of more appropriate preconditioners.

Grafik Top
Authors
  • Pachajoa, Carlos
  • Pacher, Christina
  • Levonyak, Markus
  • Gansterer, Wilfried
Grafik Top
Shortfacts
Category
Paper in Conference Proceedings or in Workshop Proceedings (Paper)
Event Title
49th International Conference on Parallel Processing (ICPP 2020)
Divisions
Theory and Applications of Algorithms
Subjects
Parallele Datenverarbeitung
Event Location
Edmonton, Alberta, Canada
Event Type
Conference
Event Dates
17-20 Aug 2020
Series Name
Proceedings of the 49th International Conference on Parallel Processing
Date
17 August 2020
Export
Grafik Top