Gray Failure Detection¶
A gray failure is a process or network link that is still alive — it answers pings, opens connections, and reports up — but is responding slowly enough or dropping enough traffic that it drags down the rest of the cluster. Simple liveness checks miss this case: the process passes them, so the cluster keeps routing work through it.
FoundationDB ships an in-cluster gray failure detector that has each worker measure its peers and report degraded or disconnected links to the cluster controller, which can then exclude the offending process or trigger a recovery / region failover. The detector is off by default in 7.3 and 7.4, but enabling it in suggest-only mode is recommended for production clusters — see the recommendation block below.
Status in 7.3¶
- Core detection (worker peer monitor + cluster-controller aggregator) is present and knob-gated.
- The
gray_failurestatus-JSON section and SS-complaint knobs are not in 7.3 — they ship in 7.4.
Recommendation
For most production clusters, enable the detector in suggest-only mode as a baseline configuration: turn the monitors on so the cluster publishes degradation signal in trace logs, but leave the action triggers off so it can't kill the master or flip regions until you trust the signal.
[fdbserver]
knob_enable_worker_health_monitor = true
knob_cc_enable_worker_health_monitor = true
knob_cc_health_trigger_recovery = false
knob_cc_health_trigger_failover = false
Supporting evidence: as of the May 2026 FoundationDB working group, at least one operator has run the detector in production for roughly one week with no observed stability impact and no false-positive recoveries. Once the suggest-only signal is clean, follow the Recommended rollout ramp to promote CC_HEALTH_TRIGGER_RECOVERY and (later) CC_HEALTH_TRIGGER_FAILOVER.
How it works¶
- Every worker in the transaction system runs a peer health monitor (gated by
ENABLE_WORKER_HEALTH_MONITOR). EveryWORKER_HEALTH_MONITOR_INTERVAL(default 60s) it inspects ping latency and connection-failure counts to its peers. - A peer is flagged degraded when its latency at
PEER_LATENCY_DEGRADATION_PERCENTILE(default 50th percentile) exceedsPEER_LATENCY_DEGRADATION_THRESHOLD(default 50 ms) — or when its ping-timeout fraction exceedsPEER_TIMEOUT_PERCENTAGE_DEGRADATION_THRESHOLD(default 10%). It is flagged disconnected whenPEER_DEGRADATION_CONNECTION_FAILURE_COUNT(default 5) connection failures accumulate. Cross-DC links to the primary satellite use the separate*_SATELLITEthresholds. - Workers send
UpdateWorkerHealthRequestmessages to the cluster controller listing their currently degraded and disconnected peers (and any peers that have recovered). - The cluster controller (gated by
CC_ENABLE_WORKER_HEALTH_MONITOR) aggregates complaints. A peer must remain reported across at leastCC_MIN_DEGRADATION_INTERVAL(default 120s) before it counts; reports older thanCC_DEGRADED_LINK_EXPIRATION_INTERVAL(default 300s) are dropped. A server must be named by enough complainers (CC_DEGRADED_PEER_DEGREE_TO_EXCLUDE_MIN…CC_DEGRADED_PEER_DEGREE_TO_EXCLUDE) and the total degraded set must stay underCC_MAX_EXCLUSION_DUE_TO_HEALTH(default 2) before it acts. - If
CC_HEALTH_TRIGGER_RECOVERYis enabled the controller kills the master to drive recovery and excludes the degraded servers from the new transaction system; otherwise it only logs a warning. IfCC_HEALTH_TRIGGER_FAILOVERis enabled and degradation is widespread, it can flip primary and remote DC priorities.
How to enable it¶
Both the per-worker monitor and the cluster-controller aggregator must be turned on. Set these in foundationdb.conf under [fdbserver] or via fdbcli's setknob (transaction-class processes):
| Knob | Default | Purpose |
|---|---|---|
ENABLE_WORKER_HEALTH_MONITOR | false | Master switch on each worker (collects per-peer measurements). |
CC_ENABLE_WORKER_HEALTH_MONITOR | false | Master switch on the cluster controller (aggregates complaints). |
CC_HEALTH_TRIGGER_RECOVERY | false | If true, exclude degraded servers via a recovery. If false, only log warnings — recommended for initial enablement. |
CC_HEALTH_TRIGGER_FAILOVER | false | If true, allow region failover when degradation is widespread (CC_FAILOVER_DUE_TO_HEALTH_MIN_DEGRADATION … CC_FAILOVER_DUE_TO_HEALTH_MAX_DEGRADATION). |
| CC_ENABLE_REMOTE_LOG_ROUTER_MONITORING | true | Detect degraded log-router connectivity (already on by default once the master switches are on). |
| CC_ENABLE_ENTIRE_SATELLITE_MONITORING | false | Try to detect a fully degraded satellite DC. | | CC_INVALIDATE_EXCLUDED_PROCESSES | false | Drop complaints from processes already excluded by a gray-failure-triggered recovery. | | GRAY_FAILURE_ENABLE_TLOG_RECOVERY_MONITORING | true | Run the health monitor during TLog recovery as well. |
Tuning thresholds (PEER_LATENCY_DEGRADATION_*, PEER_TIMEOUT_PERCENTAGE_DEGRADATION_THRESHOLD, PEER_DEGRADATION_CONNECTION_FAILURE_COUNT, CC_DEGRADED_PEER_DEGREE_TO_EXCLUDE*, CC_MAX_EXCLUSION_DUE_TO_HEALTH, CC_MAX_HEALTH_RECOVERY_COUNT, CC_TRACKING_HEALTH_RECOVERY_INTERVAL) all exist as separate knobs; defaults are documented in fdbclient/ServerKnobs.cpp.
Don't enable recovery/failover triggers first
Start with CC_HEALTH_TRIGGER_RECOVERY = false and CC_HEALTH_TRIGGER_FAILOVER = false. The detector will still publish trace events so you can watch what it would do without letting it kill the master or flip regions on day one.
What to watch after enabling¶
The detector emits trace events you can grep for in process logs:
| Trace event | Source | Meaning |
|---|---|---|
HealthMonitorDetectDegradedPeer | worker | A peer crossed the latency / timeout / connection-failure threshold. Useful detail fields include Peer, MedianLatency, CheckedPercentileLatency, PingTimeoutCount, ConnectionFailureCount, Disconnected. |
HealthMonitorDetectRecoveredPeer | worker | A previously degraded peer is now healthy. |
HealthMonitorDetectRecentClosedPeer | worker | Reports a recently closed transport peer that the monitor still considers part of the txn system. |
ClusterControllerUpdateWorkerHealth | cluster controller | A complaint arrived from a worker. Detail fields: WorkerAddress, DegradedPeers, DisconnectedPeers, RecoveredPeers. |
ClusterControllerHealthMonitor | cluster controller | Per-cycle summary of currently degraded servers. Detail fields: DegradedServers, DisconnectedServers, DegradedSatellite. |
WorkerPeerHealthRecovered, WorkerAllPeerHealthRecovered | cluster controller | A peer / worker fell out of the degraded set. |
DegradedServerDetectedAndSuggestRecovery | cluster controller | The controller would have triggered recovery if CC_HEALTH_TRIGGER_RECOVERY were on (SevWarnAlways). |
DegradedServerDetectedAndTriggerRecovery | cluster controller | The controller is forcing a master failure to exclude a degraded server (SevWarnAlways). |
DegradedServerDetectedAndSuggestFailover / DegradedServerDetectedAndTriggerFailover | cluster controller | Equivalent pair for region failover. The Trigger* event is SevWarnAlways; the Suggest* event is SevWarnAlways. |
Recommended rollout¶
- Pre-prod first. Enable only
ENABLE_WORKER_HEALTH_MONITORandCC_ENABLE_WORKER_HEALTH_MONITOR(suggest mode), keepCC_HEALTH_TRIGGER_RECOVERYandCC_HEALTH_TRIGGER_FAILOVERfalse. - Watch for at least a week. Look for
HealthMonitorDetectDegradedPeerandClusterControllerHealthMonitorevents. Confirm the rate matches your idea of cluster health and that you don't see continuous false positives — a noisy NIC, a slow coordinator host, or a single hot pipe will all show up here. - Promote gradually. Once the suggest-only signal is clean, enable
CC_HEALTH_TRIGGER_RECOVERY. WatchDegradedServerDetectedAndTriggerRecoveryevents and recovery counts;CC_MAX_HEALTH_RECOVERY_COUNT(default 5 inCC_TRACKING_HEALTH_RECOVERY_INTERVALof 1 hour) caps how often gray failure can drive a recovery. - Failover last. Only enable
CC_HEALTH_TRIGGER_FAILOVERafter you have run with recovery on for some time and trust the signal.
What's coming next¶
Newer gray-failure work — including additional integration with cluster health metrics and more aggressive region failover — is on the roadmap for the next major release. See the Roadmap page for an inventory of what is merged into main but not yet shipped in 7.3 or 7.4.
References¶
- Worker health monitor:
fdbserver/worker.actor.cpp - Cluster-controller aggregation:
fdbserver/ClusterController.actor.cppandfdbserver/include/fdbserver/ClusterController.actor.h - Knob defaults:
fdbclient/ServerKnobs.cpp,fdbclient/include/fdbclient/ServerKnobs.h - Earlier guard-knob refactor: apple/foundationdb#10848