Architecture Deep Dive¶
This guide explores FoundationDB's internal architecture—how transactions are processed, how consensus is reached, and how the system recovers from failures.
Prerequisites
Read the Architecture overview first for component basics.
System Overview¶
FDB's architecture is "unbundled"—each concern is handled by a separate subsystem:
graph TB
subgraph "Control Plane"
CC[Cluster Controller<br/>Singleton]
Coord[Coordinators<br/>Paxos]
end
subgraph "Transaction System"
direction TB
Proxy[Commit Proxies<br/>Transaction Ordering]
GRV[GRV Proxies<br/>Read Versions]
Resolver[Resolvers<br/>Conflict Detection]
TLog[Transaction Logs<br/>Durability]
end
subgraph "Storage System"
SS1[Storage Server 1]
SS2[Storage Server 2]
SS3[Storage Server 3]
end
Client --> GRV
Client --> Proxy
Coord --> CC
CC --> Proxy
CC --> Resolver
CC --> TLog
Proxy --> Resolver
Proxy --> TLog
TLog --> SS1
TLog --> SS2
TLog --> SS3 Transaction Processing¶
Read Path¶
Reads go directly to storage servers:
sequenceDiagram
participant C as Client
participant GRV as GRV Proxy
participant SS as Storage Server
C->>GRV: Get Read Version
GRV-->>C: Version 12345
C->>SS: Read key at v12345
SS-->>C: Value - Get Read Version (GRV) - Client asks proxy for current committed version
- Read at Version - Client reads from storage servers at that version
- Snapshot Isolation - Reads see consistent snapshot
Write Path¶
Writes are buffered locally, then committed atomically:
sequenceDiagram
participant C as Client
participant Proxy as Commit Proxy
participant Res as Resolver
participant TLog as Transaction Logs
C->>Proxy: Commit(reads, writes)
Proxy->>Res: Check conflicts
Res-->>Proxy: No conflict
Proxy->>TLog: Write to logs (parallel)
TLog-->>Proxy: Durable
Proxy-->>C: Committed @ version 12346 - Buffer locally - Writes accumulate in client memory
- Submit to proxy - Send reads (for conflict check) and writes
- Conflict detection - Resolver checks for overlapping writes
- Write to logs - Parallel write to transaction logs
- Respond to client - Transaction committed
Conflict Detection¶
FDB uses optimistic concurrency control with read-write conflict detection:
// Simplified conflict resolution logic
bool hasConflict(Transaction tr, Version commitVersion) {
for (auto& readRange : tr.readRanges) {
// Check if any committed transaction modified this range
// between our read version and commit version
if (wasModified(readRange, tr.readVersion, commitVersion)) {
return true;
}
}
return false;
}
Resolvers maintain a sliding window of recent commits to detect conflicts efficiently.
Consensus & Coordination¶
Coordinators¶
Coordinators run a Paxos-based consensus protocol: Source
- Cluster file location - Where to find the cluster
- Cluster controller election - Elect the cluster controller
- Configuration changes - Store cluster configuration
# fdb.cluster file points to coordinators
my_cluster:xyz123@10.0.0.1:4500,10.0.0.2:4500,10.0.0.3:4500
Cluster Controller¶
The cluster controller (singleton) manages: Source
- Role assignment - Which processes serve which roles
- Failure detection - Monitor process health
- Reconfiguration - Respond to failures and changes
Transaction Logs¶
Transaction logs provide durability with synchronous replication: Source
graph LR
Proxy --> TLog1[TLog 1]
Proxy --> TLog2[TLog 2]
Proxy --> TLog3[TLog 3]
TLog1 --> SS[Storage Servers]
TLog2 --> SS
TLog3 --> SS - Parallel writes - Commit waits for quorum of TLogs
- Streaming to storage - Storage servers tail the logs
- Bounded memory - Old log data is garbage collected
Recovery¶
Recovery is FDB's most carefully designed subsystem. When failures occur:
stateDiagram-v2
[*] --> Recruiting: Cluster Controller Elected
Recruiting --> Recovery: Transaction System Recruited
Recovery --> Recovered: Log Recovery Complete
Recovered --> [*]
Recruiting --> Recruiting: Process Failure
Recovery --> Recruiting: Process Failure Recovery Phases¶
- Coordinator Election - Elect new cluster controller via Paxos
- Role Recruitment - CC assigns new TLogs, Resolvers, Proxies
- Log Recovery - Determine recovery point, replay logs
- Resume Service - Accept new transactions
Recovery Guarantees¶
FDB guarantees:
- No committed data loss - All committed transactions survive
- No false positives - Uncommitted transactions don't appear
- Bounded recovery time - Typically seconds
// Simplified recovery logic
ACTOR Future<Void> recover(Reference<MasterData> self) {
// Find the recovery point
state Version recoveryVersion = wait(getRecoveryVersion(self));
// Recruit new transaction system
wait(recruitTransactionSystem(self, recoveryVersion));
// Resume serving transactions
return Void();
}
Data Distribution¶
FDB automatically distributes data across storage servers:
Sharding¶
Data is divided into shards (key ranges):
graph TB
subgraph "Key Space"
A[a-h] --> SS1[Storage 1]
B[h-p] --> SS2[Storage 2]
C[p-z] --> SS3[Storage 3]
end Shard Movement¶
The data distributor manages shard placement:
- Balance load - Move hot shards
- Handle failures - Re-replicate lost shards
- Add capacity - Distribute to new servers
Further Reading¶
- SIGMOD Paper - Complete architecture description
- Transaction Processing Code - Implementation details
- Storage Engines - How data is persisted
Key Source Files¶
| Component | Source |
|---|---|
| Commit Proxy | CommitProxyServer.actor.cpp |
| GRV Proxy | GrvProxyServer.actor.cpp |
| Resolver | Resolver.actor.cpp |
| Data Distributor | DataDistribution.actor.cpp |
| Storage Server | storageserver.actor.cpp |
| Recovery | ClusterRecovery.actor.cpp |