Failover is the automatic switching of requests from a failed service instance to a healthy one. It's how your system stays up when components break—and in distributed systems, something is always breaking.
Without failover, every service crash is a user-facing outage. With it, failures become transparent. A single instance going down doesn't affect anyone because traffic automatically reroutes to another healthy instance.
How failover works
The basic flow is simple: detect → remove → reroute.
Failover mechanisms automatically detect and reroute traffic around failed instances.
Failure detection: The first step
Before you can failover, you need to know something failed. There are three main ways:
Health checks
The load balancer or service mesh regularly pings each instance (e.g., GET /health every 10 seconds). If an instance doesn't respond or returns an error status, it's marked unhealthy and removed from the active pool.
# Example health check response
HTTP/1.1 200 OK
{ "status": "healthy", "uptime": 12345 }
Timeouts
If a request takes longer than expected (say, 5 seconds), it times out. Repeated timeouts indicate the instance is struggling or dead, so it gets flagged.
Error rates
If an instance returns 5xx errors on most requests, the system detects the pattern and removes it. Some systems use sliding windows: "if 50% of requests fail in the last 1 minute, mark unhealthy."
The circuit breaker pattern
Once an instance is suspected of failing, you don't want to keep pounding it with requests. The circuit breaker pattern prevents this.
Think of it like an electrical circuit breaker: when too much current flows (errors), the breaker trips and cuts the circuit (stops sending requests).
Three states:
- Closed (normal): Requests flow freely to the instance
- Open (tripped): No requests sent; instance is skipped
- Half-open (recovery): Periodically send test requests to see if the instance recovered
Closed → [5 consecutive errors] → Open → [30 second wait] → Half-open → [test request succeeds] → Closed
This prevents wasting resources on a dead instance and gives it time to recover.
Retries and exponential backoff
When a request fails, you can retry it on another instance. But retrying blindly leads to a retry storm—too many requests hammering the already-stressed system.
The solution: exponential backoff. Each retry waits longer than the last.
Attempt 1: Fails immediately
Attempt 2: Wait 1 second, then retry
Attempt 3: Wait 2 seconds, then retry
Attempt 4: Wait 4 seconds, then retry (give up after 4 retries)
This throttles the load and gives the failing instance breathing room to recover.
Idempotency: Critical for safe retries
Retries only work safely if your operations are idempotent—calling them multiple times produces the same result as calling once.
Safe to retry:
GET /user/123 # Reading is always safe
PUT /order/456 { status: "paid" } # If idempotent, setting to same state is safe
Dangerous to retry without idempotency:
POST /charge { amount: 100 } # Might charge twice!
POST /email/send { to: user@x.com } # Might send twice!
For non-idempotent operations, use idempotency keys (a unique request ID that prevents duplicates):
POST /charge
X-Idempotency-Key: req-12345-abc
{ "amount": 100 }
# Server stores result of req-12345-abc
# If same key retried, return cached result instead of charging again
Service mesh: Automatic failover without code changes
Tools like Istio, Linkerd, and Consul run a sidecar proxy on every service instance. The proxy handles retries, timeouts, circuit breaking, and failover transparently—your code doesn't need to know about any of it.
Your code → Sidecar proxy → Load balancer → Other services
(retries here) (detects failures here)
This is powerful because you can tune failover behavior globally without redeploying services.
Real-world concerns
Cascading failures
If Service A times out waiting for Service B, don't retry endlessly or you'll overload B further. Use bulkheads: limit the number of concurrent requests to B so A can still serve other users.
Graceful shutdown
When you deploy a new version or scale down, tell the load balancer "I'm going down" first via a shutdown signal. The load balancer stops sending new requests and waits for in-flight requests to finish before terminating the instance.
RTO and RPO
- RTO (Recovery Time Objective): How quickly must the system recover? (e.g., 30 seconds)
- RPO (Recovery Point Objective): How much data loss is acceptable? (e.g., zero—must not lose transactions)
Design your failover to meet these targets.
Summary
Failover is your safety net in distributed systems. The core pieces are:
- Detect failures via health checks, timeouts, or error rates
- Stop using the bad instance via circuit breakers
- Reroute traffic to healthy instances with retries
- Avoid retry storms with exponential backoff and idempotency
- Monitor and alert so you know when failover happens
Build these into your architecture from the start, and you'll sleep better knowing your system can handle failures gracefully.