c10r runs a small fleet of microservices alongside a Next.js monolith. Three of them today; more on the way. They're written in different languages, run on different hosts, and we want the monolith to know — in real time — when one of them goes sideways.
This is the story of the health protocol we built, why it's deliberately boring, and the one place where boring wasn't enough.
The constraint: no orchestrator
We deploy to plain Linux boxes via Coolify. No Kubernetes, no Consul, no Istio. The team has three microservices, not thirty. Anything we'd ship as "health infrastructure" had to be simpler than the services it was watching, or we were doing it wrong.
So the prompt was: how does a monolith know if its remote helpers are alive, with no shared infrastructure between them?
The default answer: pull
The monolith polls every registered service on a cadence the service declares. Each service exposes one route, GET /health, that returns 200 OK with a body the monolith can ignore. Latency is measured at the caller. State lives in a Map<slug, HealthState> inside the monolith — no Redis, no DB row, just an in-memory snapshot per process.
State derives entirely from time since last successful check:
| Time since last success | Status |
|---|---|
| < 90 s | running |
| 90 s – 300 s | degraded |
| > 300 s | down |
Three thresholds, one rule. A service that never responds drifts naturally through the states. A service that recovers snaps back the next time a probe succeeds.
The whole probe loop is fewer than 200 lines of TypeScript, and it has a property I care about a lot: the monolith is the source of truth. If state.status === 'running', that means we observed a healthy response recently. We're not relying on the service to tell us it's alive — that's a story we'd have to believe.
The day pull broke
A docker stop on the parser container.
- 19:58:31 — container killed.
- 19:58:32 — next probe sent. Connection refused. Probe fails, marks
lastError, butlastSuccessAtwas 19:58:01, sosecondsSinceSuccess = 31. Still under 90. Status:running. - 19:59:31 — next probe. Still fails.
secondsSinceSuccess = 90. Status:degraded. - 20:03:31 — probe still failing.
secondsSinceSuccess = 330. Finally:down.
The UI shows "Running" for a service that has been off for 90 seconds. That's not a bug in our protocol — the time math is doing exactly what we asked. It's a limit of pull-only. There's no way for the caller to know the difference between "slow response" and "kernel just killed the process" without listening to the kernel.
The fix: a push lane, kept small
We added one route to the protocol — POST /api/internal/microservices/[slug]/lifecycle — accepting one of two events:
{ event: 'shutdown', reason?: string }
{ event: 'starting', reason?: string }
The service calls this on its way down (via a SIGTERM handler / framework shutdown hook) and on its way up (after its startup is complete). Auth is the same shared secret we already use for catalog-to-service calls, plus a slug match so a service can only update its own state.
Crucially, push is a hint, not the truth. Pull is still running every 30 seconds, the time math is still authoritative. Push just lets a service short-circuit the grace window in the two cases where it knows something the probe loop can't observe yet.
The first thing that broke after the fix
We shipped it. Stop the container, status flips to down instantly. Wait 30 seconds. The probe loop runs. Probe fails, but secondsSinceSuccess is only 31 seconds, so deriveStatus returns running. The probe loop overwrites the pushed down back to running. UI now lies again, for a different reason.
The fix was a one-line flag on the state:
interface HealthState {
// ...
forcedDownAt: Date | null;
}
markShutdown sets it. The probe loop respects it: on probe failure, if forcedDownAt != null, override derived to down. On probe success, clear it — the service has demonstrably recovered, the push is no longer needed.
The invariant: a successful probe always wins. The push can hold the line during the grace window, but it can't override evidence that the service is back.
The symmetric case: starting
Push-down was the obvious case. Push-up was the subtle one.
The first version of markStarting did this:
- Trust the signal, but verify — fire an immediate probe back to the service.
- If the probe succeeds, flip to
running. If it fails, ignore the push.
This deadlocked. The service was sending starting from inside its startup hook, using a synchronous HTTP client. That client was blocking the service's own event loop. So when the monolith's probe came back to the service one millisecond later, the service couldn't answer. Probe timed out. Push got ignored. Service held at down for the next 30 seconds despite running fine.
The fix was to trust the bearer-authed signal:
export async function markStarting(config: MicroserviceConfig): Promise<HealthState> {
const now = new Date();
const next: HealthState = {
slug: config.slug,
status: 'running',
lastCheckedAt: now,
lastSuccessAt: now, // <-- optimistic
lastLatencyMs: null,
lastError: null,
selfReportedStatus: 'ok',
selfReportedMessage: 'starting notified by service',
selfReportedDetails: null,
forcedDownAt: null,
};
setHealthState(config.slug, next);
await emitChecked(next);
return next;
}
If the service lied — said starting but isn't actually ready — the next periodic probe will fail and the time math will degrade it on schedule. The cost of trusting a lie is at most one probe interval. The cost of not trusting an honest signal is a guaranteed UX bug on every restart.
Live updates without a refresh button
The probe loop runs server-side. Browsers don't poll the monolith for health — that would N-times the load for no benefit. Instead, every authenticated socket joins a public room (microservices:public-health) and the monolith broadcasts a slim event on every probe result:
type MicroservicePublicHealthEvent = {
slug: string;
status: 'running' | 'degraded' | 'down' | 'unknown';
lastCheckedAt: string | null;
message: string | null;
};
The React hook that powers feature-gating UI (useMicroserviceHealth(slug)) fetches once on mount, then subscribes to this socket event filtered by slug. Stop a container — the Scan Receipt button in the expense form goes red within ~2 seconds, no refresh.
The admin page gets a richer event with internal details on a separate room. Same data flow, different shape.
The protocol fits on a napkin
After all of this — three revisions, one push channel, one sticky flag, one optimistic case, one live broadcast — the protocol still fits on a napkin:
- The monolith pulls
GET /healthon a cadence each service declares. - Status derives from time since last success: 90 s = degraded, 300 s = down.
- Services may push
{event: shutdown|starting}to short-circuit the grace window. - Push is a hint. Successful probes override pushes. Failed probes after
shutdownhonor the sticky flag. - Status changes broadcast to a public room. UI subscribes by slug.
That's it. No service mesh. No Consul. One config file lists the services. One probe loop owns the state. One push route handles the two cases where pull is too slow. One room delivers the truth to every browser.
What I'd tell my past self
If we'd reached for an orchestrator on day one, we'd still be configuring it. Pull-first gave us 80% of the value with 5% of the moving parts. The push lane closed the last 20% — but only after we had a real-world failure that told us which 20% mattered.
Boring infrastructure scales until the day it doesn't. Then you patch the one thing. Then it's boring again.