Failover in Microservices: Keeping Your System Resilient

Failover is the automatic switching of requests from a failed service instance to a healthy one. It's how your system stays up when components break—and in distributed systems, something is always breaking.

Without failover, every service crash is a user-facing outage. With it, failures become transparent. A single instance going down doesn't affect anyone because traffic automatically reroutes to another healthy instance.

How failover works

The basic flow is simple: detect → remove → reroute.

Failover mechanisms automatically detect and reroute traffic around failed instances.

Failure detection: The first step

Before you can failover, you need to know something failed. There are three main ways:

Health checks

The load balancer or service mesh regularly pings each instance (e.g., GET /health every 10 seconds). If an instance doesn't respond or returns an error status, it's marked unhealthy and removed from the active pool.

# Example health check response
HTTP/1.1 200 OK
{ "status": "healthy", "uptime": 12345 }

Timeouts

If a request takes longer than expected (say, 5 seconds), it times out. Repeated timeouts indicate the instance is struggling or dead, so it gets flagged.

Error rates

If an instance returns 5xx errors on most requests, the system detects the pattern and removes it. Some systems use sliding windows: "if 50% of requests fail in the last 1 minute, mark unhealthy."

The circuit breaker pattern

Once an instance is suspected of failing, you don't want to keep pounding it with requests. The circuit breaker pattern prevents this.

Think of it like an electrical circuit breaker: when too much current flows (errors), the breaker trips and cuts the circuit (stops sending requests).

Three states:

Closed (normal): Requests flow freely to the instance
Open (tripped): No requests sent; instance is skipped
Half-open (recovery): Periodically send test requests to see if the instance recovered

Closed → [5 consecutive errors] → Open → [30 second wait] → Half-open → [test request succeeds] → Closed

This prevents wasting resources on a dead instance and gives it time to recover.

Retries and exponential backoff

When a request fails, you can retry it on another instance. But retrying blindly leads to a retry storm—too many requests hammering the already-stressed system.

The solution: exponential backoff. Each retry waits longer than the last.

Attempt 1: Fails immediately
Attempt 2: Wait 1 second, then retry
Attempt 3: Wait 2 seconds, then retry
Attempt 4: Wait 4 seconds, then retry (give up after 4 retries)

This throttles the load and gives the failing instance breathing room to recover.

Idempotency: Critical for safe retries

Retries only work safely if your operations are idempotent—calling them multiple times produces the same result as calling once.

Safe to retry:

GET /user/123      # Reading is always safe
PUT /order/456 { status: "paid" }   # If idempotent, setting to same state is safe

Dangerous to retry without idempotency:

POST /charge { amount: 100 }        # Might charge twice!
POST /email/send { to: user@x.com } # Might send twice!

For non-idempotent operations, use idempotency keys (a unique request ID that prevents duplicates):

POST /charge
X-Idempotency-Key: req-12345-abc
{ "amount": 100 }
# Server stores result of req-12345-abc
# If same key retried, return cached result instead of charging again

Service mesh: Automatic failover without code changes

Tools like Istio, Linkerd, and Consul run a sidecar proxy on every service instance. The proxy handles retries, timeouts, circuit breaking, and failover transparently—your code doesn't need to know about any of it.

Your code → Sidecar proxy → Load balancer → Other services
             (retries here)  (detects failures here)

This is powerful because you can tune failover behavior globally without redeploying services.

Real-world concerns

Cascading failures

If Service A times out waiting for Service B, don't retry endlessly or you'll overload B further. Use bulkheads: limit the number of concurrent requests to B so A can still serve other users.

Graceful shutdown

When you deploy a new version or scale down, tell the load balancer "I'm going down" first via a shutdown signal. The load balancer stops sending new requests and waits for in-flight requests to finish before terminating the instance.

RTO and RPO

RTO (Recovery Time Objective): How quickly must the system recover? (e.g., 30 seconds)
RPO (Recovery Point Objective): How much data loss is acceptable? (e.g., zero—must not lose transactions)

Design your failover to meet these targets.

Summary

Failover is your safety net in distributed systems. The core pieces are:

Detect failures via health checks, timeouts, or error rates
Stop using the bad instance via circuit breakers
Reroute traffic to healthy instances with retries
Avoid retry storms with exponential backoff and idempotency
Monitor and alert so you know when failover happens

Build these into your architecture from the start, and you'll sleep better knowing your system can handle failures gracefully.