Координированное сохранение с журналированием передаваемых данных и асинхронное восстановление в случае отказа

Алексей Алексеевич Бондаренко; Павел Александрович Ляхов; Михаил Владимирович Якобовский

Authors

Alexey A. Bondarenko Author
Pavel A. Lyakhov Author
Mikhail V. Yakobovskiy Author

Abstract

The increasing growth in the number of components of supercomputers leads HPC specialists to unfavorable estimates for future supercomputers: “the range of the mean time between failures will be from 1 hour to 9 hours.” This estimate leads to the problem of long calculations on supercomputers. In this paper, we propose a recovery method from failure which does not require rollback for all processes. This method can reduce overhead costs for some computational algorithms. The standard fault tolerance method consists of two phases: coordinated checkpointing and rollback of all processes to the last checkpoint in the case of a failure. The proposed method includes coordinated checkpointing with sender-based logging and asynchronous recovery when most processes wait and several processes recalculate the lost data. We developed parallel programs to solve the problem of heat transfer in the thin plate which computation algorithm has a small amount of data for logging. In these programs, failures occur by calling the function raise(SIGKILL), coordinated or asynchronous recovery is performed by ULFM functions. In order to obtain theoretical estimates of overhead costs, we propose a simulation model of program execution with failures. This model assumes that failures strike during the computations, checkpointing and recovery. We made a comparison of recovery methods with different failure rates. The comparison showed that the use of asynchronous recovery results in a reduction of overhead costs by theoretical estimates from 22% to 40%, and by computational experiments from 13% to 53%.

Author Biographies

Alexey A. Bondarenko

researcher
Pavel A. Lyakhov

postgraduate student
Mikhail V. Yakobovskiy

deputy director, an associate member of Russian Academy of Sciences, doctor of physical and mathematical sciences, professor

Coordinated Checkpointing with Sender-based Logging and Asynchronous Recovery from Failure

Authors

Abstract

Author Biographies

Published

Issue

Section