Cluster Recovery Process¶
FoundationDB is designed to recover quickly from any component failure. This guide explains the multi-phase recovery process that ensures no committed data is lost while minimizing downtime.
Advanced Content
This builds on Architecture. Understanding the transaction system components is essential.
Recovery Philosophy¶
FDB embraces failure as inevitable and designs for fast recovery:
"FDB assumes that failures will happen and focuses on recovering quickly rather than preventing failures."
Key principles:
- Fail fast - Detect failures quickly, don't wait
- Recovery over prevention - Optimize recovery time, not MTBF
- Generation-based - Replace entire transaction system atomically
- Guaranteed durability - No committed transactions are ever lost
What Triggers Recovery¶
Recovery is triggered when any critical component fails:
graph TB
subgraph "Transaction System (Generation N)"
Master[Master]
Proxy[Proxies]
Resolver[Resolvers]
TLog[TLogs]
end
Failure[/"Component Failure"/]
Master --> Failure
Proxy --> Failure
Resolver --> Failure
TLog --> Failure
Failure --> NewGen["Recovery: Create Generation N+1"] | Trigger | Description |
|---|---|
| Master failure | Master process crashes or becomes unreachable |
| TLog failure | Transaction log below quorum |
| Resolver failure | Resolver process crashes |
| Proxy failure | Commit proxy crashes |
| Network partition | Components can't communicate |
| Configuration change | Redundancy mode, storage engine changes |
Recovery Phases¶
Recovery proceeds through four distinct phases:
stateDiagram-v2
[*] --> CoordinatorLock: Failure Detected
CoordinatorLock: Phase 1 — Lock Coordinators (~5-20 ms)
MasterRecruit: Phase 2 — Recruit Master (~10-50 ms)
TransactionRecovery: Phase 3 — Transaction System Recovery (~50-200 ms)
AcceptCommits: Phase 4 — Resume Operations
CoordinatorLock --> MasterRecruit: Lock acquired
MasterRecruit --> TransactionRecovery: Master elected
TransactionRecovery --> AcceptCommits: Recovery complete
AcceptCommits --> [*]: Cluster operational Phase 1: Lock Coordinators¶
The Cluster Controller detects failure and begins recovery:
- Failure detection - Heartbeat timeout or explicit notification
- Coordinator lock - CC acquires lock on coordinators (Paxos)
- Prevent split-brain - Old generation cannot make progress
sequenceDiagram
participant CC as Cluster Controller
participant C1 as Coordinator 1
participant C2 as Coordinator 2
participant C3 as Coordinator 3
Note over CC: Failure detected
par Lock coordinators
CC->>C1: Lock (generation N+1)
CC->>C2: Lock (generation N+1)
CC->>C3: Lock (generation N+1)
end
C1-->>CC: Locked ✓
C2-->>CC: Locked ✓
C3-->>CC: Locked ✓
Note over CC: Majority locked - proceed Phase 2: Recruit Master¶
The Cluster Controller recruits a new Master:
- Select candidate - Choose healthy process for Master role
- Master startup - New Master initializes
- Version handoff - New Master gets last committed version
Phase 3: Transaction System Recovery¶
The most complex phase—rebuilding the transaction system:
flowchart TB
subgraph "Recovery Steps"
A[Recruit new TLogs, Resolvers, Proxies]
B[Read recovery state from old TLogs]
C[Determine recovery version]
D[Copy uncommitted data to new TLogs]
E[Accept new transactions]
end
A --> B --> C --> D --> E Determining Recovery Version¶
The new Master must determine which transactions committed:
graph LR
subgraph "Old TLogs"
TL1["TLog 1<br/>Versions: 100-150"]
TL2["TLog 2<br/>Versions: 100-148"]
TL3["TLog 3<br/>Versions: 100-145"]
end
RecoveryPoint["Recovery Version: 148<br/>(Quorum agreement)"]
TL1 --> RecoveryPoint
TL2 --> RecoveryPoint
TL3 --> RecoveryPoint The recovery version is the highest version that:
- Was acknowledged by a quorum of TLogs
- Was marked as committed by the old Master
Phase 4: Resume Operations¶
Once the new transaction system is ready:
- New TLogs ready - Accepting writes
- Resolvers ready - Conflict detection active
- Proxies ready - Accepting client requests
- Master announces - New generation is live
graph TB
subgraph "Generation N (Failed)"
Old[Old Master<br/>Old TLogs<br/>Old Proxies]
style Old fill:#ffcccc
end
subgraph "Generation N+1 (Active)"
New[New Master<br/>New TLogs<br/>New Proxies]
style New fill:#ccffcc
end
Old -->|Recovery| New
Client[Clients] --> New Recovery Guarantees¶
FDB's recovery provides strong guarantees:
| Guarantee | Description |
|---|---|
| No data loss | All committed transactions survive recovery |
| No phantoms | Uncommitted transactions don't appear after recovery |
| Consistency | Database remains consistent throughout |
| Bounded time | Recovery typically < 1 second |
How Durability Is Maintained¶
graph LR
subgraph "Before Failure"
T1["Txn 1: Committed ✓<br/>(Quorum TLogs)"]
T2["Txn 2: Committed ✓<br/>(Quorum TLogs)"]
T3["Txn 3: In-flight<br/>(Not all TLogs)"]
end
subgraph "After Recovery"
R1["Txn 1: Preserved ✓"]
R2["Txn 2: Preserved ✓"]
R3["Txn 3: Lost ✗<br/>(Client will retry)"]
end
T1 --> R1
T2 --> R2
T3 -.->|Not durable| R3 Availability During Recovery¶
FDB remains partially available during recovery:
| Operation | During Recovery |
|---|---|
| Reads (recent) | Available if storage server has data |
| Reads (old) | May fail if version too old |
| Writes | Blocked until recovery completes |
| New transactions | Blocked during GRV unavailability |
Client Behavior¶
Clients automatically handle recovery:
- Retry loop - Built into client bindings
- Backoff - Exponential backoff during unavailability
- Reconnect - Discover new proxies via coordinators
Recovery Timing¶
Typical recovery phases:
gantt
title Recovery Timeline
dateFormat X
axisFormat %L ms
section Phase 1
Lock Coordinators :0, 20
section Phase 2
Recruit Master :20, 50
section Phase 3
Recruit TLogs :50, 100
Read Old TLogs :100, 200
Determine Recovery :200, 250
section Phase 4
Resume Service :250, 300 | Phase | Typical Duration | Factors |
|---|---|---|
| Coordinator lock | 5-20 ms | Coordinator latency |
| Master recruitment | 10-30 ms | Process availability |
| TLog recovery | 50-200 ms | Log size, network |
| Total | 100-500 ms | Cluster size, load |
Failure Scenarios¶
Single TLog Failure¶
flowchart LR
Fail["TLog 3 fails"] --> Detect["Master detects failure"]
Detect --> Recover["Full recovery<br/>(new generation)"]
Recover --> Resume["Resume with<br/>remaining TLogs"] Master Failure¶
The Cluster Controller detects and initiates recovery:
- CC notices Master heartbeat missing
- CC triggers full transaction system recovery
- New Master recruited, new generation started
Coordinator Failure¶
Coordinators use Paxos for fault tolerance:
- Majority available - Cluster continues
- Minority lost - No cluster impact
- Majority lost - Cluster unavailable (configuration issue)
Monitoring Recovery¶
Key recovery metrics:
| Metric | Description |
|---|---|
Recoveries | Count of recovery events |
RecoveryDuration | Time spent in recovery |
RecoveryState | Current recovery phase |
Monitor recovery in fdbcli:
Source Code References¶
- masterserver.actor.cpp
- Master and recovery coordination
- ClusterController.actor.cpp
- Cluster controller logic
- Coordination.actor.cpp
- Coordinator implementation
Further Reading¶
- Architecture Overview - Component roles
- Architecture Deep Dive - Transaction processing
- Simulation Testing - How recovery is tested
- SIGMOD Paper, Section 5 - Recovery design