Simulation Testing¶

FoundationDB's deterministic simulation testing is its most innovative feature—a technique that has found thousands of bugs that would be nearly impossible to detect with traditional testing.

Will Wilson, FoundationDB Founder

"We've run the equivalent of trillions of hours of simulation testing. Every bug we've ever found in production, we could reproduce in simulation."

The Problem with Testing Distributed Systems¶

Traditional testing approaches fail for distributed systems:

Approach	Problem
Unit tests	Don't catch distributed bugs
Integration tests	Can't cover all failure modes
Chaos testing	Non-deterministic, hard to reproduce
Formal verification	Doesn't test implementation

Distributed bugs are:

Rare - May only appear under specific timing
Non-deterministic - Hard to reproduce
Catastrophic - Can cause data loss

Deterministic Simulation¶

FoundationDB's solution: simulate the entire distributed system in a single process with fake time.

graph LR
    subgraph "Real Deployment"
        N1[Node 1] <--> N2[Node 2]
        N2 <--> N3[Node 3]
        N3 <--> N1
    end

    subgraph "Simulation"
        S[Single Process<br/>Simulated Network<br/>Fake Time<br/>All Nodes]
    end

    Real -->|Same Code| S

Key Properties¶

Same code runs in production and simulation - No separate test implementation
Fake time - Simulate years of operation in minutes
Controlled randomness - Every random decision uses seeded PRNG
Full network simulation - All messages are simulated, with delays and failures

How It Works¶

The Flow Language¶

FDB uses Flow, a custom language compiled to C++, that makes all operations deterministic:

C++

// Every async operation goes through the simulator
ACTOR Future<Void> simulatableOperation(Database db) {
    // This wait is intercepted by the simulator
    wait(delay(1.0));  // In simulation: controlled fake time

    state Transaction tr(db);
    // Network calls go through simulated network
    Optional<Value> result = wait(tr.get(key));

    return Void();
}

Failure Injection¶

The simulator can inject any failure at any point:

graph TB
    subgraph "Failure Injection Points"
        A[Network Partitions]
        B[Machine Deaths]
        C[Disk Failures]
        D[Power Loss]
        E[Clock Skew]
        F[Message Reordering]
        G[Memory Corruption]
    end

    Sim[Simulator] --> A
    Sim --> B
    Sim --> C
    Sim --> D
    Sim --> E
    Sim --> F
    Sim --> G

The simulator randomly injects combinations of:

Network failures - Partitions, delays, reordering, duplication
Process failures - Crashes, restarts, hangs
Disk failures - I/O errors, corruption, full disks
Clock issues - Skew, jumps, NTP failures

Buggification¶

"Buggify" macros introduce rare code paths that simulate real-world edge cases:

C++

if (BUGGIFY) {
    // This path is taken randomly in simulation
    // but never in production
    wait(delay(g_random->random01() * 10));
}

Testing Methodology¶

Continuous Simulation¶

FDB runs thousands of simulation tests continuously:

Seed-based tests - Each test run has a unique seed
Failure injection - Random failures throughout execution
Invariant checking - Verify correctness properties constantly
Regression testing - Save failing seeds, replay to reproduce

What Gets Tested¶

Every simulation run verifies:

ACID properties - Transactions are atomic, consistent, isolated, durable
Linearizability - Operations appear to execute in order
Recovery correctness - System recovers from any failure
Liveness - System makes progress when possible

Bug Detection¶

Simulation has detected bugs like:

Race conditions in transaction ordering
Recovery failures after specific crash sequences
Data loss scenarios with particular network partitions
Deadlocks under high contention

Running Simulation Tests¶

To run simulation tests locally:

Bash

# Build with simulation enabled
cmake -DFDB_RELEASE=ON ..
make -j

# Run a simulation workload
bin/fdbserver -r simulation -f tests/fast/AtomicRestore.toml

Test Configuration¶

Simulation tests are defined in TOML:

TOML

[[test]]
testTitle = "AtomicRestore"

[[test.workload]]
testName = "AtomicRestore"
simBackupAgents = "BackupToFile"

The Philosophy¶

Test the Real Thing¶

"Test the code you ship, ship the code you test."

FDB doesn't have a separate test implementation. The same binary runs in simulation and production.

Embrace Failure¶

Instead of preventing failures, FDB:

Assumes everything can fail
Tests every failure combination
Ensures recovery always works
Ships with confidence

Reproducibility¶

Every simulation run is reproducible: