The cluster is down
Production Risk
Critical. The cluster is unavailable for writes and possibly reads, causing a total service outage.
This error indicates that a Redis Cluster node cannot process a command because it believes the cluster is not in a healthy, 'OK' state. This usually means a majority of primary nodes are not reachable.
- 1A network partition is preventing nodes from communicating with each other.
- 2More than half of the primary nodes in the cluster have crashed or are unreachable.
- 3A misconfiguration of `cluster-node-timeout` is causing nodes to prematurely failover.
In a 3-primary cluster, two of the primaries go offline. A client connected to the remaining primary tries to run a command.
# Two out of three primaries are down GET mykey
expected output
(error) CLUSTERDOWN The cluster is down.
Fix 1
Restore failed nodes and fix network partitions
WHEN This is the fundamental solution
# On each node, check the cluster state redis-cli -c CLUSTER INFO redis-cli -c CLUSTER NODES
Why this works
The cluster requires a majority of primary nodes to be online to operate (form a quorum). You must restart the failed Redis processes or resolve the network issues that are preventing them from communicating.
Fix 2
Use `redis-cli --cluster fix` to attempt an automatic repair
WHEN For simple cases where nodes just need to be correctly rediscovered
redis-cli --cluster fix <any-node-ip>:<port>
Why this works
The Redis CLI has a cluster management utility that can often fix minor issues by forcing nodes to update their configuration from other nodes that are still online.
✕ Immediately start promoting replicas to primaries manually
This can lead to a 'split-brain' scenario where different parts of the cluster disagree on who the primaries are, leading to data loss. Always let the cluster's failure detection algorithm manage failovers.
Content generated with AI assistance and reviewed for accuracy. Found an error? hello@errcodes.dev