Data Distribution & Shard Management¶
This guide explores how FoundationDB distributes data across storage servers—including keyspace partitioning, team formation, shard splitting and merging, and automatic rebalancing.
Advanced Content
This section covers internal implementation details. Familiarity with Architecture and Storage Engines is recommended.
Overview¶
The Data Distributor is a singleton process that manages how data is placed across storage servers. It ensures:
- Balanced load - Data and workload spread evenly across servers
- Fault tolerance - Data replicated according to redundancy mode
- Automatic healing - Re-replication when servers fail
- Efficient scaling - Data migrates when servers are added/removed
graph TB
subgraph "Data Distributor Responsibilities"
DD[Data Distributor]
DD --> ShardMgmt[Shard Management<br/>Split & Merge]
DD --> TeamBuild[Team Building<br/>Server Groups]
DD --> Movement[Data Movement<br/>Rebalancing]
DD --> Health[Health Monitoring<br/>Failure Detection]
end
subgraph "Storage Servers"
SS1[SS1: a-f]
SS2[SS2: f-m]
SS3[SS3: m-z]
end
DD --> SS1
DD --> SS2
DD --> SS3 Keyspace Partitioning¶
FoundationDB divides the entire key space into shards (also called key ranges). Each shard is a contiguous range of keys assigned to a team of storage servers.
Shard Structure¶
graph LR
subgraph "Key Space"
direction LR
S1["Shard 1<br/>['' - 'customer:1000')"]
S2["Shard 2<br/>['customer:1000' - 'order:')"]
S3["Shard 3<br/>['order:' - 'product:5000')"]
S4["Shard 4<br/>['product:5000' - '')"]
end
S1 --> T1[Team A<br/>SS1, SS2, SS3]
S2 --> T2[Team B<br/>SS2, SS4, SS5]
S3 --> T1
S4 --> T3[Team C<br/>SS1, SS3, SS6] Key properties of shards:
| Property | Details |
|---|---|
| Size target | ~500 MB per shard (configurable) |
| Replication | Each shard replicated to a team of servers |
| Boundaries | Split at key boundaries, not byte boundaries |
| Dynamic | Shards split/merge as data changes |
How Shards Are Tracked¶
The Data Distributor maintains shard metadata in the system keyspace (\xff/):
- Shard map - Key range → team assignment
- Server info - Storage server locations and health
- Move queues - Pending shard movements
Team Building¶
A team is a group of storage servers that together store replicas of a shard. Teams are formed to maximize fault tolerance.
Team Formation Strategy¶
flowchart TB
subgraph "Failure Domains"
DC1[Data Center 1]
DC2[Data Center 2]
subgraph DC1
R1[Rack 1]
R2[Rack 2]
end
subgraph DC2
R3[Rack 3]
R4[Rack 4]
end
end
R1 --> SS1[SS1]
R1 --> SS2[SS2]
R2 --> SS3[SS3]
R2 --> SS4[SS4]
R3 --> SS5[SS5]
R4 --> SS6[SS6]
Team["Optimal Team: SS1, SS3, SS5<br/>(Different racks, different DCs)"] Team building considers:
- Locality - Servers across failure domains (racks, data centers)
- Capacity - Balance data across servers by free space
- Machine diversity - Avoid placing replicas on the same physical machine
- Zone awareness - Respect configured failure zones
Team Health States¶
| State | Meaning |
|---|---|
| Healthy | All servers responsive, data fully replicated |
| Unhealthy | One or more servers unresponsive |
| Missing | Replicas below redundancy requirement |
Shard Splitting & Merging¶
Shards dynamically split and merge to maintain optimal sizes.
Splitting¶
When a shard exceeds the size threshold:
- Data Distributor identifies a split point (typically the median key)
- Creates two new shards from the original range
- Assigns each new shard to teams (may reuse existing team)
graph LR
subgraph "Before Split"
Big["Shard A<br/>['' - 'z')<br/>1.2 GB"]
end
subgraph "After Split"
Small1["Shard A1<br/>['' - 'm')<br/>600 MB"]
Small2["Shard A2<br/>['m' - 'z')<br/>600 MB"]
end
Big --> Small1
Big --> Small2 Merging¶
When adjacent shards are small enough to combine:
- Data Distributor identifies mergeable neighbors
- Combines ranges into a single shard
- Updates team assignments
Merging helps reduce metadata overhead and improve scan performance.
Data Movement & Rebalancing¶
When the Data Distributor needs to move data:
Reasons for Movement¶
| Reason | Trigger |
|---|---|
| Server failure | Re-replicate to maintain redundancy |
| Server added | Distribute data to new capacity |
| Server removed | Migrate data off before decommission |
| Load balancing | Move hot shards to reduce skew |
| Storage rebalancing | Even out disk usage |
Movement Process¶
sequenceDiagram
participant DD as Data Distributor
participant Source as Source SS
participant Dest as Destination SS
participant TLog as Transaction Log
Note over DD: Shard S needs to move
DD->>Dest: 1. Assign shard S
Dest->>Source: 2. Fetch snapshot at version V
Source-->>Dest: Key-value pairs
loop Catch up
TLog->>Dest: 3. Stream mutations > V
Dest->>Dest: Apply mutations
end
DD->>DD: 4. Update shard map
DD->>Source: 5. Remove shard assignment Throttling & Prioritization¶
Data movement is throttled to avoid impacting production:
- Bandwidth limits - Configurable bytes/second for replication
- Priority queue - Re-replication prioritized over rebalancing
- Concurrent moves - Limited parallel shard movements
Monitoring Data Distribution¶
Key metrics to monitor:
| Metric | Description | Warning Threshold |
|---|---|---|
MovingData | Bytes currently being moved | Sustained high values |
MachineNotResponding | Servers not responding | Any > 0 |
TeamCount | Number of healthy teams | Below expected |
HighestPriority | Priority of pending work | > 0 indicates issues |
Use fdbcli to inspect shard state:
# View data distribution status
fdb> status details
# Check specific key range location
fdb> getrangekeys \\x00 \\xff 100
Source Code References¶
Key implementation files:
- DataDistribution.actor.cpp
- Main data distribution logic
- DDTeamCollection.actor.cpp
- Team building and management
- DataDistributorInterface.h
- Interface definitions
Further Reading¶
- Architecture Overview - Component roles
- Storage Engines - How data is stored on disk
- Architecture Deep Dive - Transaction processing
- SIGMOD Paper, Section 4 - Replication architecture