The Daemon Died, the Bot Kept Polling: Layered Process Supervision in Rust
We were testing a Telegram interface for a Rust agentic orchestrator when the bot went quiet. The user, on the other hand, went very loud.
user: please list the MCP connections you have
bot: lists 7 tools connected to mcsoftsolution… ✅
user: please use mcsoftsolution mcp and tell me everything about this company
bot: silence
user: any update on my question?
user: you are not answering...
No thinking indicator. No error. Just the vibes of being ghosted. By a bot. That we wrote.
Debugging begins where debugging always begins — ps:
$ ps aux | grep bot_telegram | grep -v grep | wc -l
13
Thirteen. Supposed to be one. We had a coven.
The crime scene
telegram.log was one unbroken scream:
[telegram] getUpdates error: Conflict: terminated by other getUpdates request;
make sure that only one bot instance is running
[telegram] getUpdates error: Conflict: ...
[telegram] getUpdates error: Conflict: ...
Telegram's Bot API allows exactly one long-poller per bot token. A second process calling getUpdates kicks the first one off the lease and takes over. With thirteen processes elbowing each other for the token, every poll cycle had a different winner — and each of them thought it was doing great.
The first user message happened to land on the one live bot, the one still tied to a running daemon. That's why the first reply got through. The follow-up got picked up by an orphan whose parent had died hours ago. The orphan received the message, nodded sagely, and had absolutely nowhere to route the reply. Classic listener, not a communicator.
That's the symptom. The bug was further upstream: nothing was actually stopping orphans from accumulating.
How do you get thirteen orphans?
On paper, shutdown was simple: daemon SIGTERMs the bot, waits 3 seconds, exits. Three bugs were quietly holding hands:
- The bot had no SIGTERM handler. It sat in a 40-second HTTP long-poll like it was waiting for its turn at the DMV.
- The daemon never escalated to SIGKILL. After 3 seconds it shrugged and walked away. The bot shrugged back and kept polling.
- Startup didn't check for prior instances. Every
startspawned a new bot unconditionally. Orphan meet orphan.
Each stop && start added one. Over a few hours: thirteen.
And here's the thing — you can't fix this with just a SIGTERM handler. Parents die in creative ways:
- A panic Tokio can't unwind
- An OOM kill from the kernel (the kernel does not care about your feelings)
kill -9from an impatient human- Machine shutdown
In every one of those, SIGTERM never fires. A SIGTERM handler protects you from exactly zero of these. You need defenses that hold when the parent disappears without saying goodbye.
Four layers, because one is never enough
L1 — The stdin pipe trick
When the daemon spawns the bot, it creates a pipe and holds the write end. It never writes anything. When the daemon dies — any way, including SIGKILL — the kernel closes the write end. The bot's stdin read returns Ok(0) (EOF), a tokio::select! arm fires, and the bot exits.
The kernel is the one holding the gun here. The daemon doesn't get a vote. This is the trick you cannot fake with signals.
tokio::select! {
_ = poll_loop(...) => {}
_ = watch_parent_death() => {
eprintln!("[telegram] parent daemon exited, shutting down");
}
_ = sigterm.recv() => { /* ... */ }
_ = sigint.recv() => { /* ... */ }
}
Cancel-safety comes free: when any arm resolves, the others drop, which cancels the in-flight getUpdates request. Actual shutdown takes tens of milliseconds.
L2 — Singleton via flock
On startup the bot grabs flock(LOCK_EX | LOCK_NB) on telegram.lock and holds it for life. A second instance fails the lock and exits with code 2.
Why flock and not a PID file? Because PID files lie. A process can crash without cleaning up its PID file — or worse, the PID gets recycled and now your "liveness check" is pointing at some unrelated process living its best life. The kernel's lock table cannot lie. When you die, your locks die with you. No stale state possible, no forensics required.
L3 — SIGTERM → 5s → SIGKILL
The daemon stopped being polite. It sends SIGTERM, counts to five, unconditionally sends SIGKILL. Audit log records every escalation. We also shortened the bot's internals so the select! can actually wake up in time:
| Constant | Before | After |
|---|---|---|
getUpdates long-poll | 30 s | 10 s |
reqwest HTTP timeout | 40 s | 15 s |
| Daemon shutdown grace | 3 s (no SIGKILL) | 5 s (SIGKILL fallback) |
The 10s long-poll is just a ceiling — in practice select! cancels it the moment another arm fires.
L4 — Startup reap
Before spawning, the daemon finds any prior PID, SIGTERMs it, waits, SIGKILLs the survivor. Mostly belt-and-suspenders with L2, but it catches the case where the previous daemon died ungracefully and its child is still mid-poll, blissfully unaware that anything happened.
Gotcha we hit building this: kill(pid, 0) returns success for zombies. If a test harness was somehow the parent, we'd loop forever thinking the zombie was alive. Fix: non-blocking waitpid(WNOHANG) inside the liveness check. In production it's a no-op. In tests it keeps us from lying to ourselves.
Why four?
Because each layer lives somewhere different:
- L1 in the kernel's file descriptor table
- L2 in the kernel's file lock table
- L3 in the daemon's tokio runtime
- L4 on the filesystem
No shared failure domain. To produce an orphan, all four would have to fail at once — at which point you have bigger problems than a rogue bot.
| Failure mode | L1 | L2 | L3 | L4 |
|---|---|---|---|---|
| Clean daemon shutdown | — | — | ✓ | — |
| Daemon SIGKILLed / OOM / panic | ✓ | — | — | ✓ |
| Manual duplicate spawn | — | ✓ | — | — |
| Child ignores SIGTERM | ✓ | — | ✓ | — |
| Orphan from previous crash | ✓ | ✓ | — | ✓ |
Every failure mode has a primary defender and a backstop.
Beyond one Telegram bot
This isn't a Telegram fix — it's a pattern for any long-lived child of a daemon. Gmail bridges, SMTP workers, channel bots, cron-like poll loops. When we wire up the Gmail channel next, spawn_satellite(SatelliteSpec::gmail(...)) gives it L1–L4 for free. The failure modes were thought through once, and now they're reused forever.
Most process-supervision tutorials stop at SIGTERM handlers. That's fine when the parent can always send signals. It falls apart the moment the parent dies unexpectedly — and production is nothing but creative new ways for parents to die unexpectedly.
The one-line version
Don't trust the daemon to kill the bot. Give the bot four independent ways to figure out it's time to go — and make the happy path just one of them.
Thirteen orphans taught us that lesson. Don't let them die in vain.
If you're building anything with long-lived worker processes or writing system software in Rust, the pattern generalises. Default to defenses that hold under crashes, not defenses that assume a clean handoff. Get in touch if you've hit a subtle supervision bug of your own.