Book contents
- Frontmatter
- Contents
- Preface
- 1 Introduction
- 2 A model of distributed computations
- 3 Logical time
- 4 Global state and snapshot recording algorithms
- 5 Terminology and basic algorithms
- 6 Message ordering and group communication
- 7 Termination detection
- 8 Reasoning with knowledge
- 9 Distributed mutual exclusion algorithms
- 10 Deadlock detection in distributed systems
- 11 Global predicate detection
- 12 Distributed shared memory
- 13 Checkpointing and rollback recovery
- 14 Consensus and agreement algorithms
- 15 Failure detectors
- 16 Authentication in distributed systems
- 17 Self-stabilization
- 18 Peer-to-peer computing and overlay graphs
- Index
13 - Checkpointing and rollback recovery
Published online by Cambridge University Press: 05 June 2012
- Frontmatter
- Contents
- Preface
- 1 Introduction
- 2 A model of distributed computations
- 3 Logical time
- 4 Global state and snapshot recording algorithms
- 5 Terminology and basic algorithms
- 6 Message ordering and group communication
- 7 Termination detection
- 8 Reasoning with knowledge
- 9 Distributed mutual exclusion algorithms
- 10 Deadlock detection in distributed systems
- 11 Global predicate detection
- 12 Distributed shared memory
- 13 Checkpointing and rollback recovery
- 14 Consensus and agreement algorithms
- 15 Failure detectors
- 16 Authentication in distributed systems
- 17 Self-stabilization
- 18 Peer-to-peer computing and overlay graphs
- Index
Summary
Introduction
Distributed systems today are ubiquitous and enable many applications, including client–server systems, transaction processing, the World Wide Web, and scientific computing, among many others. Distributed systems are not fault-tolerant and the vast computing potential of these systems is often hampered by their susceptibility to failures. Many techniques have been developed to add reliability and high availability to distributed systems. These techniques include transactions, group communication, and rollback recovery. These techniques have different tradeoffs and focus. This chapter covers the rollback recovery protocols, which restore the system back to a consistent state after a failure.
Rollback recovery treats a distributed system application as a collection of processes that communicate over a network. It achieves fault tolerance by periodically saving the state of a process during the failure-free execution, enabling it to restart from a saved state upon a failure to reduce the amount of lost work. The saved state is called a checkpoint, and the procedure of restarting from a previously checkpointed state is called rollback recovery. A checkpoint can be saved on either the stable storage or the volatile storage depending on the failure scenarios to be tolerated.
In distributed systems, rollback recovery is complicated because messages induce inter-process dependencies during failure-free operation. Upon a failure of one or more processes in a system, these dependencies may force some of the processes that did not fail to roll back, creating what is commonly called a rollback propagation.
- Type
- Chapter
- Information
- Distributed ComputingPrinciples, Algorithms, and Systems, pp. 456 - 509Publisher: Cambridge University PressPrint publication year: 2008