Coordinated Checkpointing with Sender-based Logging and Asynchronous Recovery from Failure

Authors

  • Alexey A. Bondarenko Author
  • Pavel A. Lyakhov Author
  • Mikhail V. Yakobovskiy Author

Abstract

The increasing growth in the number of components of supercomputers leads HPC specialists to unfavorable estimates for future supercomputers: “the range of the mean time between failures will be from 1 hour to 9 hours.” This estimate leads to the problem of long calculations on supercomputers. In this paper, we propose a recovery method from failure which does not require rollback for all processes. This method can reduce overhead costs for some computational algorithms. The standard fault tolerance method consists of two phases: coordinated checkpointing and rollback of all processes to the last checkpoint in the case of a failure. The proposed method includes coordinated checkpointing with sender-based logging and asynchronous recovery when most processes wait and several processes recalculate the lost data. We developed parallel programs to solve the problem of heat transfer in the thin plate which computation algorithm has a small amount of data for logging. In these programs, failures occur by calling the function raise(SIGKILL), coordinated or asynchronous recovery is performed by ULFM functions. In order to obtain theoretical estimates of overhead costs, we propose a simulation model of program execution with failures. This model assumes that failures strike during the computations, checkpointing and recovery. We made a comparison of recovery methods with different failure rates. The comparison showed that the use of asynchronous recovery results in a reduction of overhead costs by theoretical estimates from 22% to 40%, and by computational experiments from 13% to 53%.

Author Biographies

  • Alexey A. Bondarenko
    researcher
  • Pavel A. Lyakhov
    postgraduate student
  • Mikhail V. Yakobovskiy
    deputy director,  an associate  member of Russian Academy of Sciences, doctor of physical and mathematical sciences, professor

Published

2019-06-13

Issue

Section

Supercomputer Modeling