Troubleshooting¶
This guide helps you diagnose and resolve common issues with FoundationDB. It covers cluster status interpretation, diagnostic commands, log analysis, and recovery procedures.
Quick Diagnostics¶
First Steps¶
When experiencing issues, start with these commands:
# Check if service is running
systemctl status foundationdb
# Verify cluster connectivity
fdbcli --exec "status"
# Check detailed status
fdbcli --exec "status details"
# Get JSON status for scripts
fdbcli --exec "status json"
Status Summary Interpretation¶
| Status Message | Meaning | Urgency |
|---|---|---|
Healthy | Normal operation | None |
Healthy (Rebalancing) | Moving data between servers | Low |
Healing | Recovering from failure | Medium |
Unavailable | Database not accepting requests | Critical |
Recovery in progress | Cluster recovering | Medium |
Common Issues¶
Cluster Not Reachable¶
Symptoms: - Clients timeout on connection - fdbcli hangs or reports "Unable to locate..." - Application errors mentioning connection
Diagnostic Steps:
# 1. Check if fdbserver processes are running
ps aux | grep fdbserver
# 2. Verify fdbmonitor is running
systemctl status foundationdb
# 3. Check cluster file
cat /etc/foundationdb/fdb.cluster
# 4. Test coordinator connectivity
for addr in $(grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+:[0-9]+' /etc/foundationdb/fdb.cluster); do
nc -zv ${addr%:*} ${addr#*:} 2>&1
done
Solutions:
| Cause | Solution |
|---|---|
| Processes not running | systemctl restart foundationdb |
| Wrong cluster file | Verify file matches coordinators |
| Firewall blocking | Open port 4500 (or configured port) |
| Network partition | Check network connectivity between nodes |
| All coordinators down | Restore coordinator machines first |
Database Unavailable¶
Symptoms: - status shows "Unavailable" - Clients receive unavailable errors - No transactions completing
Diagnostic Steps:
Look for: - Missing processes - Insufficient fault tolerance - Recovery state issues
Common Causes and Solutions:
| Cause | Status Indicator | Solution |
|---|---|---|
| Too few coordinators | "Unable to communicate with quorum" | Restore coordinators or coordinators auto |
| Insufficient machines | "Zone failures exceed configured" | Add machines or reduce redundancy |
| All logs lost | "Recovery requires all logs" | Restore from backup (data loss likely) |
| Network partition | Some processes unreachable | Fix network connectivity |
High Conflict Rate¶
Symptoms: - Many transaction retries - Slow perceived performance - status shows high conflict rate
Diagnostic Steps:
fdb> status
# Check "Conflict rate" in Workload section
# Get detailed metrics
fdb> status json | jq '.cluster.workload.transactions'
Solutions:
- Reduce transaction scope - Read/write fewer keys per transaction
- Avoid hot keys - Distribute load across key ranges
- Use read snapshots - For read-only portions of transactions
- Optimize retry logic - Implement exponential backoff
Example - Using Snapshot Reads:
@fdb.transactional
def read_with_snapshot(tr):
# Use snapshot read for non-conflicting reads
value = tr.snapshot[key]
return value
High Latency¶
Symptoms: - Slow transactions - Commit latency > 100ms - Read latency > 50ms
Diagnostic Steps:
fdb> status
# Check latency_probe section in JSON status
fdb> status json | jq '.cluster.latency_probe'
Common Causes:
| Cause | Indicator | Solution |
|---|---|---|
| Overloaded storage | High disk usage, queue depth | Add storage servers |
| Network latency | High ping times between nodes | Check network infrastructure |
| Large transactions | Large read/write sets | Break into smaller transactions |
| GRV contention | High GRV queue | Add grv_proxies |
| Commit contention | High commit queue | Add commit_proxies, logs |
Tuning Commands:
Storage Space Issues¶
Symptoms: - "Storage is filling up" warnings - Write failures - Cluster marked unhealthy
Diagnostic Steps:
fdb> status details
# Check "Storage server" free space
# Per-process disk usage
fdb> status json | jq '.cluster.processes | to_entries[] | {address: .value.address, disk_free: .value.disk.free_bytes}'
Solutions:
- Add storage capacity - Add new machines with storage class
- Delete data - Remove old data ranges if applicable
- Exclude full machines - Move data off problematic servers
# Exclude a full storage server
fdb> exclude 10.0.4.5:4500
# Wait for data to move, then remove machine
fdb> status details # Verify data moved
Process Failures¶
Symptoms: - Individual process crashes - Reduced fault tolerance - "Zone failures" in status
Diagnostic Steps:
# Check process logs
tail -f /var/log/foundationdb/trace*.xml
# Find crash signatures
grep -i "fatal\|crash\|segfault" /var/log/foundationdb/trace*.xml
# Check recent process restarts
grep "ProcessStart" /var/log/foundationdb/trace*.xml | tail -20
Common Process Issues:
| Error Pattern | Cause | Solution |
|---|---|---|
| Out of memory | Memory limit exceeded | Increase memory in conf |
| Segfault | Bug or corruption | Update to latest version |
| Disk I/O error | Storage failure | Check disk health, replace |
| Too many open files | FD limit | Increase ulimit |
Coordinator Issues¶
Symptoms: - "Unable to communicate with quorum" - Slow cluster startup - Connection instability
Diagnostic Steps:
# List current coordinators
fdb> coordinators
# Check coordinator connectivity from each node
for coord in $(fdbcli --exec "coordinators" | grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+:[0-9]+'); do
echo "Testing $coord"
nc -zv ${coord%:*} ${coord#*:}
done
Solutions:
# Change coordinators to available machines
fdb> coordinators 10.0.4.1:4500 10.0.4.2:4500 10.0.4.3:4500
# Or let FDB choose automatically
fdb> coordinators auto
Warning
Coordinator changes require a majority of current coordinators to be reachable.
Diagnostic Commands Reference¶
fdbcli Status Commands¶
| Command | Description |
|---|---|
status | Human-readable summary |
status details | Per-process details |
status minimal | One-line health check |
status json | Machine-readable status |
fdbcli Investigation Commands¶
# Get range of keys
fdb> getrange \xff\x02/processClass/ \xff\x02/processClass0
# Check data distribution
fdb> datadistribution on
# View configuration
fdb> configure
# Kill and recover a process (simulate failure)
fdb> kill; kill 10.0.4.1:4500; status
# Check version
fdb> getversion
# Profile transactions
fdb> profile client set 1.0 10000
# Throttle tags
fdb> throttle on tag mytag
System Key Ranges¶
| Key Range | Contents |
|---|---|
\xff\x02/processClass/ | Process class assignments |
\xff\x02/conf/ | Cluster configuration |
\xff/serverList/ | Server list |
\xff\xff/status/json | Status JSON (special key) |
Log Analysis¶
Trace File Location¶
| Platform | Path |
|---|---|
| Linux | /var/log/foundationdb/ |
| macOS | /usr/local/var/log/foundationdb/ |
Trace File Format¶
Trace files are XML with entries containing: - Time - Event timestamp - Type - Event type - Severity - 10 (info), 20 (debug), 30 (warn), 40 (error) - Various event-specific fields
Searching Trace Files¶
# Find all errors
grep 'Severity="40"' /var/log/foundationdb/trace*.xml
# Find warnings
grep 'Severity="30"' /var/log/foundationdb/trace*.xml
# Find specific event types
grep 'Type="MachineRecovery"' /var/log/foundationdb/trace*.xml
# Recent crashes
grep -i "crash\|fatal\|segfault\|abort" /var/log/foundationdb/trace*.xml
# Connection issues
grep 'Type="ConnectionTimedOut\|ConnectionFailed"' /var/log/foundationdb/trace*.xml
Important Event Types¶
| Event Type | Meaning |
|---|---|
MachineRecovery | Machine rejoined cluster |
RecoveryComplete | Cluster recovery finished |
SlowTransaction | Transaction exceeded threshold |
ConnectionTimedOut | Network timeout |
StorageServerFailure | Storage server stopped |
CommitProxyTerminated | Proxy process ended |
Log Rotation¶
Control log file size in foundationdb.conf:
Recovery Procedures¶
Recovering from Machine Failure¶
-
Check current state:
-
If fault tolerance > 0: Wait for automatic recovery
-
If fault tolerance = 0:
- Add replacement machine
- Or exclude failed machine to redistribute
# Exclude failed machine
fdb> exclude 10.0.4.5
# Monitor recovery
fdb> status details
# Wait for "Moving data" to reach 0
# Remove exclusion after machine replaced
fdb> include 10.0.4.5
Recovering from Quorum Loss¶
If majority of coordinators are lost:
-
With surviving minority:
-
If all coordinators lost: Restore from backup
Forcing Recovery (Last Resort)¶
Data Loss Warning
Force recovery can result in data loss. Only use when normal recovery fails.
Restoring from Backup¶
See Backup & Recovery for detailed restore procedures.
# Check available backups
fdbbackup describe -d file:///backup/fdb
# Restore to empty cluster
fdbrestore start -r file:///backup/fdb -C /etc/foundationdb/fdb.cluster
Performance Troubleshooting¶
Identifying Bottlenecks¶
# Get status JSON and analyze
fdb> status json
# Key metrics to check:
# - cluster.workload.operations - throughput
# - cluster.latency_probe - latencies
# - cluster.processes.*.disk.busy - disk utilization
# - cluster.processes.*.cpu.usage_cores - CPU usage
# - cluster.qos - throttling state
Storage Server Bottlenecks¶
Indicators: - High disk queue depth - disk.busy near 1.0 - High read/write latencies
Solutions:
# Add storage servers
fdb> setclass 10.0.4.6:4500 storage
# Or exclude overloaded servers
fdb> exclude 10.0.4.5:4500
Proxy Bottlenecks¶
Indicators: - High commit latency - GRV latency spikes - Proxy queues backing up
Solutions:
Transaction Log Bottlenecks¶
Indicators: - Slow commits despite low storage load - Log server disk fully busy
Solutions:
Cluster Maintenance¶
Graceful Machine Removal¶
# 1. Exclude the machine
fdb> exclude 10.0.4.5:4500 10.0.4.5:4501
# 2. Monitor data movement
fdb> status
# Wait for "Moving data" to reach 0
# 3. Stop services on excluded machine
sudo systemctl stop foundationdb
# 4. Optionally include again if returning
fdb> include 10.0.4.5:4500 10.0.4.5:4501
Rolling Restart¶
Restart processes one at a time without downtime:
# On each machine in sequence:
sudo systemctl restart foundationdb
# Wait for status to show healthy before proceeding
fdbcli --exec "status minimal"
Getting Help¶
Self-Service Resources¶
Gathering Information for Support¶
When seeking help, collect:
-
Version information:
-
Relevant logs:
-
Configuration:
-
Error messages - Exact error text and timestamps
Next Steps¶
- Review Monitoring for proactive issue detection
- Configure Backup & Recovery for disaster preparedness
- See Configuration for performance tuning