Simulation Testing¶
FoundationDB's deterministic simulation testing is its most innovative featureāa technique that has found thousands of bugs that would be nearly impossible to detect with traditional testing.
Will Wilson, FoundationDB Founder
"We've run the equivalent of trillions of hours of simulation testing. Every bug we've ever found in production, we could reproduce in simulation."
The Problem with Testing Distributed Systems¶
Traditional testing approaches fail for distributed systems:
| Approach | Problem |
|---|---|
| Unit tests | Don't catch distributed bugs |
| Integration tests | Can't cover all failure modes |
| Chaos testing | Non-deterministic, hard to reproduce |
| Formal verification | Doesn't test implementation |
Distributed bugs are:
- Rare - May only appear under specific timing
- Non-deterministic - Hard to reproduce
- Catastrophic - Can cause data loss
Deterministic Simulation¶
FoundationDB's solution: simulate the entire distributed system in a single process with fake time.
graph LR
subgraph "Real Deployment"
N1[Node 1] <--> N2[Node 2]
N2 <--> N3[Node 3]
N3 <--> N1
end
subgraph "Simulation"
S[Single Process<br/>Simulated Network<br/>Fake Time<br/>All Nodes]
end
Real -->|Same Code| S Key Properties¶
- Same code runs in production and simulation - No separate test implementation
- Fake time - Simulate years of operation in minutes
- Controlled randomness - Every random decision uses seeded PRNG
- Full network simulation - All messages are simulated, with delays and failures
How It Works¶
The Flow Language¶
FDB uses Flow, a custom language compiled to C++, that makes all operations deterministic:
// Every async operation goes through the simulator
ACTOR Future<Void> simulatableOperation(Database db) {
// This wait is intercepted by the simulator
wait(delay(1.0)); // In simulation: controlled fake time
state Transaction tr(db);
// Network calls go through simulated network
Optional<Value> result = wait(tr.get(key));
return Void();
}
Failure Injection¶
The simulator can inject any failure at any point:
graph TB
subgraph "Failure Injection Points"
A[Network Partitions]
B[Machine Deaths]
C[Disk Failures]
D[Power Loss]
E[Clock Skew]
F[Message Reordering]
G[Memory Corruption]
end
Sim[Simulator] --> A
Sim --> B
Sim --> C
Sim --> D
Sim --> E
Sim --> F
Sim --> G The simulator randomly injects combinations of:
- Network failures - Partitions, delays, reordering, duplication
- Process failures - Crashes, restarts, hangs
- Disk failures - I/O errors, corruption, full disks
- Clock issues - Skew, jumps, NTP failures
Buggification¶
"Buggify" macros introduce rare code paths that simulate real-world edge cases:
if (BUGGIFY) {
// This path is taken randomly in simulation
// but never in production
wait(delay(g_random->random01() * 10));
}
Testing Methodology¶
Continuous Simulation¶
FDB runs thousands of simulation tests continuously:
- Seed-based tests - Each test run has a unique seed
- Failure injection - Random failures throughout execution
- Invariant checking - Verify correctness properties constantly
- Regression testing - Save failing seeds, replay to reproduce
What Gets Tested¶
Every simulation run verifies:
- ACID properties - Transactions are atomic, consistent, isolated, durable
- Linearizability - Operations appear to execute in order
- Recovery correctness - System recovers from any failure
- Liveness - System makes progress when possible
Bug Detection¶
Simulation has detected bugs like:
- Race conditions in transaction ordering
- Recovery failures after specific crash sequences
- Data loss scenarios with particular network partitions
- Deadlocks under high contention
Running Simulation Tests¶
To run simulation tests locally:
# Build with simulation enabled
cmake -DFDB_RELEASE=ON ..
make -j
# Run a simulation workload
bin/fdbserver -r simulation -f tests/fast/AtomicRestore.toml
Test Configuration¶
Simulation tests are defined in TOML:
[[test]]
testTitle = "AtomicRestore"
[[test.workload]]
testName = "AtomicRestore"
simBackupAgents = "BackupToFile"
The Philosophy¶
Test the Real Thing¶
"Test the code you ship, ship the code you test."
FDB doesn't have a separate test implementation. The same binary runs in simulation and production.
Embrace Failure¶
Instead of preventing failures, FDB:
- Assumes everything can fail
- Tests every failure combination
- Ensures recovery always works
- Ships with confidence
Reproducibility¶
Every simulation run is reproducible:
# Reproduce a failing test with exact same seed
bin/fdbserver -r simulation -s 12345 -f test.toml
Further Reading¶
- Testing Distributed Systems w/ Deterministic Simulation - Will Wilson's Strange Loop talk
- SIGMOD Paper, Section 6 - Detailed testing methodology
- FDB Simulation Code - Test workloads source