Apr 14, 2026 · 6 min read

The Daemon Died, the Bot Kept Polling: Layered Process Supervision in Rust

Thirteen Telegram bot processes were fighting for the same API token like gulls on a chip. The daemon was long dead. The bots had not gotten the memo.

Rust Tokio POSIX System Design Process Supervision

The Daemon Died, the Bot Kept Polling: Layered Process Supervision in Rust

We were testing a Telegram interface for a Rust agentic orchestrator when the bot went quiet. The user, on the other hand, went very loud.

user: please list the MCP connections you have

bot: lists 7 tools connected to mcsoftsolution…

user: please use mcsoftsolution mcp and tell me everything about this company

bot: silence

user: any update on my question?

user: you are not answering...

No thinking indicator. No error. Just the vibes of being ghosted. By a bot. That we wrote.

Debugging begins where debugging always begins — ps:

$ ps aux | grep bot_telegram | grep -v grep | wc -l
13

Thirteen. Supposed to be one. We had a coven.

The crime scene

telegram.log was one unbroken scream:

[telegram] getUpdates error: Conflict: terminated by other getUpdates request;
  make sure that only one bot instance is running
[telegram] getUpdates error: Conflict: ...
[telegram] getUpdates error: Conflict: ...

Telegram's Bot API allows exactly one long-poller per bot token. A second process calling getUpdates kicks the first one off the lease and takes over. With thirteen processes elbowing each other for the token, every poll cycle had a different winner — and each of them thought it was doing great.

The first user message happened to land on the one live bot, the one still tied to a running daemon. That's why the first reply got through. The follow-up got picked up by an orphan whose parent had died hours ago. The orphan received the message, nodded sagely, and had absolutely nowhere to route the reply. Classic listener, not a communicator.

That's the symptom. The bug was further upstream: nothing was actually stopping orphans from accumulating.

How do you get thirteen orphans?

On paper, shutdown was simple: daemon SIGTERMs the bot, waits 3 seconds, exits. Three bugs were quietly holding hands:

  1. The bot had no SIGTERM handler. It sat in a 40-second HTTP long-poll like it was waiting for its turn at the DMV.
  2. The daemon never escalated to SIGKILL. After 3 seconds it shrugged and walked away. The bot shrugged back and kept polling.
  3. Startup didn't check for prior instances. Every start spawned a new bot unconditionally. Orphan meet orphan.

Each stop && start added one. Over a few hours: thirteen.

And here's the thing — you can't fix this with just a SIGTERM handler. Parents die in creative ways:

  • A panic Tokio can't unwind
  • An OOM kill from the kernel (the kernel does not care about your feelings)
  • kill -9 from an impatient human
  • Machine shutdown

In every one of those, SIGTERM never fires. A SIGTERM handler protects you from exactly zero of these. You need defenses that hold when the parent disappears without saying goodbye.

Four layers, because one is never enough

Four layers of satellite lifetime binding L1 — Parent-death via stdin pipe Covers: daemon SIGKILL, OOM, panic. Child reads EOF on stdin and self-exits. L2 — Singleton via flock Covers: duplicate spawns. Kernel releases the flock on any process death. L3 — SIGTERM → 5s → SIGKILL escalation Covers: stubborn child. Daemon guarantees the child is dead before returning. L4 — Startup reap Covers: leftover orphans from prior crashes. Fresh daemon reaps stale PID before spawning.

L1 — The stdin pipe trick

When the daemon spawns the bot, it creates a pipe and holds the write end. It never writes anything. When the daemon dies — any way, including SIGKILL — the kernel closes the write end. The bot's stdin read returns Ok(0) (EOF), a tokio::select! arm fires, and the bot exits.

The kernel is the one holding the gun here. The daemon doesn't get a vote. This is the trick you cannot fake with signals.

tokio::select! {
    _ = poll_loop(...) => {}
    _ = watch_parent_death() => {
        eprintln!("[telegram] parent daemon exited, shutting down");
    }
    _ = sigterm.recv() => { /* ... */ }
    _ = sigint.recv()  => { /* ... */ }
}

Cancel-safety comes free: when any arm resolves, the others drop, which cancels the in-flight getUpdates request. Actual shutdown takes tens of milliseconds.

L2 — Singleton via flock

On startup the bot grabs flock(LOCK_EX | LOCK_NB) on telegram.lock and holds it for life. A second instance fails the lock and exits with code 2.

Why flock and not a PID file? Because PID files lie. A process can crash without cleaning up its PID file — or worse, the PID gets recycled and now your "liveness check" is pointing at some unrelated process living its best life. The kernel's lock table cannot lie. When you die, your locks die with you. No stale state possible, no forensics required.

L3 — SIGTERM → 5s → SIGKILL

The daemon stopped being polite. It sends SIGTERM, counts to five, unconditionally sends SIGKILL. Audit log records every escalation. We also shortened the bot's internals so the select! can actually wake up in time:

ConstantBeforeAfter
getUpdates long-poll30 s10 s
reqwest HTTP timeout40 s15 s
Daemon shutdown grace3 s (no SIGKILL)5 s (SIGKILL fallback)

The 10s long-poll is just a ceiling — in practice select! cancels it the moment another arm fires.

L4 — Startup reap

Before spawning, the daemon finds any prior PID, SIGTERMs it, waits, SIGKILLs the survivor. Mostly belt-and-suspenders with L2, but it catches the case where the previous daemon died ungracefully and its child is still mid-poll, blissfully unaware that anything happened.

Gotcha we hit building this: kill(pid, 0) returns success for zombies. If a test harness was somehow the parent, we'd loop forever thinking the zombie was alive. Fix: non-blocking waitpid(WNOHANG) inside the liveness check. In production it's a no-op. In tests it keeps us from lying to ourselves.

Why four?

Because each layer lives somewhere different:

  • L1 in the kernel's file descriptor table
  • L2 in the kernel's file lock table
  • L3 in the daemon's tokio runtime
  • L4 on the filesystem

No shared failure domain. To produce an orphan, all four would have to fail at once — at which point you have bigger problems than a rogue bot.

Failure modeL1L2L3L4
Clean daemon shutdown
Daemon SIGKILLed / OOM / panic
Manual duplicate spawn
Child ignores SIGTERM
Orphan from previous crash

Every failure mode has a primary defender and a backstop.

Beyond one Telegram bot

This isn't a Telegram fix — it's a pattern for any long-lived child of a daemon. Gmail bridges, SMTP workers, channel bots, cron-like poll loops. When we wire up the Gmail channel next, spawn_satellite(SatelliteSpec::gmail(...)) gives it L1–L4 for free. The failure modes were thought through once, and now they're reused forever.

Most process-supervision tutorials stop at SIGTERM handlers. That's fine when the parent can always send signals. It falls apart the moment the parent dies unexpectedly — and production is nothing but creative new ways for parents to die unexpectedly.

The one-line version

Don't trust the daemon to kill the bot. Give the bot four independent ways to figure out it's time to go — and make the happy path just one of them.

Thirteen orphans taught us that lesson. Don't let them die in vain.


If you're building anything with long-lived worker processes or writing system software in Rust, the pattern generalises. Default to defenses that hold under crashes, not defenses that assume a clean handoff. Get in touch if you've hit a subtle supervision bug of your own.