Architecture Deep Dive¶

This guide explores FoundationDB's internal architecture—how transactions are processed, how consensus is reached, and how the system recovers from failures.

Prerequisites

Read the Architecture overview first for component basics.

System Overview¶

FDB's architecture is "unbundled"—each concern is handled by a separate subsystem:

graph TB
    subgraph "Control Plane"
        CC[Cluster Controller<br/>Singleton]
        Coord[Coordinators<br/>Paxos]
    end

    subgraph "Transaction System"
        direction TB
        Proxy[Commit Proxies<br/>Transaction Ordering]
        GRV[GRV Proxies<br/>Read Versions]
        Resolver[Resolvers<br/>Conflict Detection]
        TLog[Transaction Logs<br/>Durability]
    end

    subgraph "Storage System"
        SS1[Storage Server 1]
        SS2[Storage Server 2]
        SS3[Storage Server 3]
    end

    Client --> GRV
    Client --> Proxy
    Coord --> CC
    CC --> Proxy
    CC --> Resolver
    CC --> TLog
    Proxy --> Resolver
    Proxy --> TLog
    TLog --> SS1
    TLog --> SS2
    TLog --> SS3

Transaction Processing¶

Read Path¶

Reads go directly to storage servers:

sequenceDiagram
    participant C as Client
    participant GRV as GRV Proxy
    participant SS as Storage Server

    C->>GRV: Get Read Version
    GRV-->>C: Version 12345
    C->>SS: Read key at v12345
    SS-->>C: Value

Get Read Version (GRV) - Client asks proxy for current committed version
Read at Version - Client reads from storage servers at that version
Snapshot Isolation - Reads see consistent snapshot

Write Path¶

Writes are buffered locally, then committed atomically:

sequenceDiagram
    participant C as Client
    participant Proxy as Commit Proxy
    participant Res as Resolver
    participant TLog as Transaction Logs

    C->>Proxy: Commit(reads, writes)
    Proxy->>Res: Check conflicts
    Res-->>Proxy: No conflict
    Proxy->>TLog: Write to logs (parallel)
    TLog-->>Proxy: Durable
    Proxy-->>C: Committed @ version 12346

Buffer locally - Writes accumulate in client memory
Submit to proxy - Send reads (for conflict check) and writes
Conflict detection - Resolver checks for overlapping writes
Write to logs - Parallel write to transaction logs
Respond to client - Transaction committed

Conflict Detection¶

FDB uses optimistic concurrency control with read-write conflict detection:

C++

// Simplified conflict resolution logic
bool hasConflict(Transaction tr, Version commitVersion) {
    for (auto& readRange : tr.readRanges) {
        // Check if any committed transaction modified this range
        // between our read version and commit version
        if (wasModified(readRange, tr.readVersion, commitVersion)) {
            return true;
        }
    }
    return false;
}

Resolvers maintain a sliding window of recent commits to detect conflicts efficiently.

Consensus & Coordination¶

Coordinators¶

Coordinators run a Paxos-based consensus protocol: Source

Cluster file location - Where to find the cluster
Cluster controller election - Elect the cluster controller
Configuration changes - Store cluster configuration

Text Only

# fdb.cluster file points to coordinators
my_cluster:xyz123@10.0.0.1:4500,10.0.0.2:4500,10.0.0.3:4500

Cluster Controller¶

The cluster controller (singleton) manages: Source

Role assignment - Which processes serve which roles
Failure detection - Monitor process health
Reconfiguration - Respond to failures and changes

Transaction Logs¶

Transaction logs provide durability with synchronous replication: Source

graph LR
    Proxy --> TLog1[TLog 1]
    Proxy --> TLog2[TLog 2]
    Proxy --> TLog3[TLog 3]

    TLog1 --> SS[Storage Servers]
    TLog2 --> SS
    TLog3 --> SS

Parallel writes - Commit waits for quorum of TLogs
Streaming to storage - Storage servers tail the logs
Bounded memory - Old log data is garbage collected

Recovery¶

Recovery is FDB's most carefully designed subsystem. When failures occur:

stateDiagram-v2
    [*] --> Recruiting: Cluster Controller Elected
    Recruiting --> Recovery: Transaction System Recruited
    Recovery --> Recovered: Log Recovery Complete
    Recovered --> [*]

    Recruiting --> Recruiting: Process Failure
    Recovery --> Recruiting: Process Failure

Recovery Phases¶

Coordinator Election - Elect new cluster controller via Paxos
Role Recruitment - CC assigns new TLogs, Resolvers, Proxies
Log Recovery - Determine recovery point, replay logs
Resume Service - Accept new transactions

Recovery Guarantees¶

FDB guarantees:

No committed data loss - All committed transactions survive
No false positives - Uncommitted transactions don't appear
Bounded recovery time - Typically seconds

C++

// Simplified recovery logic
ACTOR Future<Void> recover(Reference<MasterData> self) {
    // Find the recovery point
    state Version recoveryVersion = wait(getRecoveryVersion(self));

    // Recruit new transaction system
    wait(recruitTransactionSystem(self, recoveryVersion));

    // Resume serving transactions
    return Void();
}

Data Distribution¶

FDB automatically distributes data across storage servers:

Sharding¶

Data is divided into shards (key ranges):

graph TB
    subgraph "Key Space"
        A[a-h] --> SS1[Storage 1]
        B[h-p] --> SS2[Storage 2]
        C[p-z] --> SS3[Storage 3]
    end

Shard Movement¶

The data distributor manages shard placement:

Balance load - Move hot shards
Handle failures - Re-replicate lost shards
Add capacity - Distribute to new servers

Key Source Files¶

Component	Source
Commit Proxy	CommitProxyServer.actor.cpp
GRV Proxy	GrvProxyServer.actor.cpp
Resolver	Resolver.actor.cpp
Data Distributor	DataDistribution.actor.cpp
Storage Server	storageserver.actor.cpp
Recovery	ClusterRecovery.actor.cpp