Architecture¶

FoundationDB's architecture is designed around a principle of decoupling: different processes handle different responsibilities, allowing the system to scale each component independently. This design enables horizontal scaling, fault tolerance, and the strict ACID guarantees that define FoundationDB.

Design Philosophy¶

FoundationDB separates the transaction processing system (write path) from the storage system (read path). This separation allows:

Independent scaling: Add storage servers for capacity, proxies for throughput
Fault isolation: A storage server failure doesn't affect transaction processing
Flexible deployment: Components can be placed on different hardware tiers

graph TB
    subgraph "Clients"
        C1[Client 1]
        C2[Client 2]
        C3[Client N]
    end

    subgraph "Transaction System"
        GRVProxy[GRV Proxies<br/>Read versions]
        CommitProxy[Commit Proxies<br/>Transaction commits]
        Resolver[Resolvers<br/>Conflict detection]
        TLog[Transaction Logs<br/>Durability]
        Master[Master<br/>Coordination]
    end

    subgraph "Storage System"
        SS1[Storage Server 1]
        SS2[Storage Server 2]
        SS3[Storage Server N]
    end

    subgraph "Control Plane"
        Coord[Coordinators<br/>Paxos consensus]
        CC[Cluster Controller<br/>Singleton manager]
        DD[Data Distributor<br/>Shard management]
        RK[Ratekeeper<br/>Admission control]
    end

    C1 & C2 & C3 --> GRVProxy
    C1 & C2 & C3 --> CommitProxy
    C1 & C2 & C3 --> SS1 & SS2 & SS3

    CommitProxy --> Resolver
    CommitProxy --> TLog
    TLog --> SS1 & SS2 & SS3

    Coord --> CC
    CC --> Master
    CC --> DD
    CC --> RK

Core Components¶

Coordinators¶

Coordinators form the stable anchor for cluster discovery and leader election. They run the Paxos consensus algorithm to maintain a small amount of critical shared state. Source

Responsibility	Details
Cluster discovery	Clients and servers connect using coordinator addresses
Leader election	Elect the Cluster Controller
Configuration storage	Store minimal cluster metadata

Coordinator Placement

For production, deploy an odd number of coordinators (3 or 5) across failure domains. A majority must be available for the cluster to operate.

Cluster Controller¶

The Cluster Controller is a singleton process elected by the coordinators. It's the central orchestrator for the cluster: Source

Detects process failures
Recruits processes into roles
Distributes cluster configuration

Master¶

The Master coordinates the transaction system's lifecycle: Source

Provides commit versions to proxies
Monitors transaction system health
Triggers recovery when components fail

The master, proxies, resolvers, and transaction logs form a generation. If any component fails, the entire generation is replaced through a recovery process.

GRV Proxies (Get Read Version)¶

Proxy Split (7.0+)

In FoundationDB 7.0, the original "Proxy" role was split into GRV Proxies and Commit Proxies. This separation reduces GRV tail latency by allowing independent scaling of read version requests.

GRV Proxies provide read versions to clients: Source

sequenceDiagram
    participant Client
    participant GRVProxy
    participant Master
    participant TLogs

    Client->>GRVProxy: Get read version
    GRVProxy->>Master: Latest committed version?
    GRVProxy->>TLogs: Are you alive?
    TLogs-->>GRVProxy: Yes
    Master-->>GRVProxy: Version 123456
    GRVProxy-->>Client: Read version: 123456

The GRV proxy:

Gets the latest committed version from the master
Verifies transaction logs are still responsive (prevents stale reads after recovery)
Applies rate limiting from Ratekeeper

Commit Proxies¶

Commit Proxies handle transaction commits: Source

Get a commit version from the master
Send read/write conflict ranges to resolvers
If no conflicts, send mutations to transaction logs
Return success/failure to client

Commit proxies also maintain the key-location cache—a mapping of key ranges to storage servers that clients use to route reads.

Resolvers¶

Resolvers determine if transactions conflict: Source

graph LR
    subgraph "Resolver"
        History[Last 5 seconds<br/>of commits]
        Check{Conflict?}
    end

    TX[Transaction<br/>Read Set + Commit Version] --> Check
    History --> Check
    Check -->|No conflict| Commit[✓ Proceed]
    Check -->|Conflict| Abort[✗ Abort & Retry]

For each incoming transaction, the resolver:

Compares the transaction's read set against recent commits
If any read key was modified after the transaction's read version, there's a conflict
Maintains ~5 seconds of commit history in memory

Transaction Logs¶

Transaction Logs (TLogs) provide durability: Source

Receive mutations from commit proxies in version order
Write to append-only log files with fsync
Forward mutations to storage servers
Only return success after data is durable

Transaction Logs Are the Durability Boundary

A transaction is considered committed only after the TLogs acknowledge the write. This is the synchronous replication that guarantees durability.

Storage Servers¶

Storage Servers are the workhorses of the cluster: Source

Responsibility	Details
Key-value storage	Each server owns a range of keys (a "shard")
Serving reads	Clients read directly from storage servers
Applying mutations	Pull from transaction logs, apply to storage
Version history	Keep ~5 seconds of MVCC versions in memory

Storage servers use one of several storage engines:

Redwood (ssd-redwood-1): B-tree optimized for SSDs with lower write amplification (recommended for 7.3+)
SQLite (ssd-2): Original SSD engine, still available
Memory Engine: In-memory with append-only log (for testing)

Storage Engine Evolution

Redwood was introduced in 7.0 as ssd-redwood-1-experimental. In 7.3, it was renamed to ssd-redwood-1 (production ready). See Storage Engines for migration guidance.

Data Distributor¶

The Data Distributor manages data placement: Source

Monitors storage server health and capacity
Splits/merges key ranges (shards) as data grows
Rebalances data across storage servers
Handles storage server failures by re-replicating data

Ratekeeper¶

Ratekeeper prevents cluster overload: Source

Monitors storage servers, proxies, and logs
Calculates sustainable transaction rate
Throttles GRV proxies when cluster is saturated

Transaction Flow¶

Read Path¶

sequenceDiagram
    participant Client
    participant GRVProxy
    participant StorageServer

    Client->>GRVProxy: 1. Get read version
    GRVProxy-->>Client: Version: 123456

    Client->>StorageServer: 2. Read key X at version 123456
    StorageServer-->>Client: Value: "hello"

    Client->>StorageServer: 3. Read key Y at version 123456
    StorageServer-->>Client: Value: "world"

Reads are served directly by storage servers—no transaction system involvement. Clients:

Get a read version (consistent snapshot point)
Read directly from storage servers owning the relevant keys
All reads see the database as of the read version

Write Path¶

sequenceDiagram
    participant Client
    participant CommitProxy
    participant Master
    participant Resolver
    participant TLog
    participant StorageServer

    Client->>CommitProxy: 1. Commit transaction<br/>(reads + writes)
    CommitProxy->>Master: 2. Get commit version
    Master-->>CommitProxy: Version: 123457
    CommitProxy->>Resolver: 3. Check conflicts
    Resolver-->>CommitProxy: No conflicts ✓
    CommitProxy->>TLog: 4. Write mutations
    TLog->>TLog: fsync to disk
    TLog-->>CommitProxy: Durable ✓
    CommitProxy-->>Client: 5. Committed!

    TLog-)StorageServer: 6. (async) Forward mutations
    StorageServer->>StorageServer: Apply to storage

The critical path (steps 1-5) determines commit latency. Step 6 happens asynchronously.

Fault Tolerance¶

Storage Server Failure¶

When a storage server fails:

Data Distributor detects the failure
Remaining replicas continue serving reads
Data Distributor re-replicates data to healthy servers
Transaction logs buffer mutations for the dead server (temporarily)

Transaction System Failure¶

When any transaction system component fails:

Master detects the failure and terminates
Cluster Controller elects a new Master
New Master recruits fresh proxies, resolvers, and logs
Recovery process ensures no committed transactions are lost

stateDiagram-v2
    [*] --> Running: Cluster Healthy
    Running --> Recovery: Component Failure
    Recovery --> Running: New Generation Ready

    state Recovery {
        [*] --> LockCoordinators
        LockCoordinators --> StopOldLogs
        StopOldLogs --> RecruitNew
        RecruitNew --> RecoverState
        RecoverState --> [*]
    }

Recovery typically completes in 100-500 milliseconds.

Redundancy Modes¶

Mode	Data Copies	Fault Tolerance	Use Case
`single`	1	None	Development
`double`	2	1 failure	Small clusters
`triple`	3	2 failures	Production
`three_datacenter`	3	1 datacenter	Multi-DC
`three_data_hall`	3	1 data hall	Large facilities

Scaling¶

Horizontal Scaling¶

FoundationDB scales by adding processes:

Bottleneck	Solution
Storage capacity	Add storage servers
Read throughput	Add storage servers
Write throughput	Add commit proxies + resolvers
Durability throughput	Add transaction logs

Performance Characteristics¶

Metric	Typical Value	Scaling
Read latency	1-2 ms	Constant
Write latency	5-15 ms	Constant
Throughput	100K+ ops/sec per proxy	Linear with proxies
Storage	Petabytes	Linear with storage servers

Architecture Highlights¶

Deterministic Simulation¶

FoundationDB is built using a custom deterministic simulation framework that allows testing years of cluster-time in hours, including all failure modes.

Flow Programming Language¶

The database is written in Flow, a custom C++ dialect with actor-model concurrency, enabling deterministic simulation.

Separation of Concerns¶

The architecture cleanly separates:

Control plane: Coordinators, Cluster Controller (slow-changing)
Transaction plane: Master, Proxies, Resolvers, TLogs (per-generation)
Storage plane: Storage Servers (long-lived, stateful)