Checkpointing and rollback recovery

Ajay D. Kshemkalyani; Mukesh Singhal

doi:10.1017/CBO9780511805318.014

13 - Checkpointing and rollback recovery

Published online by Cambridge University Press: 05 June 2012

Ajay D. Kshemkalyani and

Mukesh Singhal

Show author details

Ajay D. Kshemkalyani: Affiliation:
University of Illinois, Chicago
Mukesh Singhal: Affiliation:
University of Kentucky

Book contents

Get access

Summary

Introduction

Distributed systems today are ubiquitous and enable many applications, including client–server systems, transaction processing, the World Wide Web, and scientific computing, among many others. Distributed systems are not fault-tolerant and the vast computing potential of these systems is often hampered by their susceptibility to failures. Many techniques have been developed to add reliability and high availability to distributed systems. These techniques include transactions, group communication, and rollback recovery. These techniques have different tradeoffs and focus. This chapter covers the rollback recovery protocols, which restore the system back to a consistent state after a failure.

Rollback recovery treats a distributed system application as a collection of processes that communicate over a network. It achieves fault tolerance by periodically saving the state of a process during the failure-free execution, enabling it to restart from a saved state upon a failure to reduce the amount of lost work. The saved state is called a checkpoint, and the procedure of restarting from a previously checkpointed state is called rollback recovery. A checkpoint can be saved on either the stable storage or the volatile storage depending on the failure scenarios to be tolerated.

In distributed systems, rollback recovery is complicated because messages induce inter-process dependencies during failure-free operation. Upon a failure of one or more processes in a system, these dependencies may force some of the processes that did not fail to roll back, creating what is commonly called a rollback propagation.

Type: Chapter
Information: Distributed Computing
Principles, Algorithms, and Systems
, pp. 456 - 509

DOI: https://doi.org/10.1017/CBO9780511805318.014 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2008

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book contents

13 - Checkpointing and rollback recovery

Summary

Access options

Save book to Kindle

Save book to Dropbox

Save book to Google Drive