Centralizing AI Keys: One Gateway, Many Microservices

The receipt parser microservice does three AI calls per scan:

OCR. A vision model (gpt-4o-mini today) reads pixels into text.
Extraction. A reasoning model (Qwen-2.5-72B today) parses the OCR text into a structured receipt: store, line items, totals, tax.
Embedding. A small embedding model (text-embedding-3-small) generates a 1536-dim vector for future RAG.

Three calls, three providers, three sets of API keys. Per environment. The first version of the parser had all of them in its own .env file. That immediately felt wrong.

Here's the question that started this work: where should an LLM API key live when a microservice needs it?

The bad answers

In the microservice's environment. Every microservice now has its own key rotation cadence. Every new microservice that needs AI re-implements the provider-selection logic. Every cost dashboard has to scrape N services to attribute spend. Every key leak from a misconfigured deploy has its own blast radius.

Pass the key from the monolith on each call. Now we're shipping bearer tokens to upstream providers through services we don't fully trust. Also, the same key flows through more network hops, more logs, more memory dumps. Worse.

Run a separate "AI proxy" service. Now we have a fourth thing to deploy, secure, and monitor. For three microservices. No.

The answer we shipped

The monolith already has every LLM provider's keys — it uses them for its own AI assistants and workspace tasks. It already has a key-encryption layer (encryptedApiKey on PlatformAIProvider). It already has an admin UI for routing AI tasks to providers.

So we exposed that capability on an internal surface, and put the keys behind it:

POST /api/internal/ai/chat
POST /api/internal/ai/embed

OpenAI-compatible request/response. Microservices send messages and a model hint; the gateway looks up which provider that task is routed to, decrypts the upstream key, forwards the call, logs it, returns the response.

A single MS-side helper (Python, ~80 lines) replaces three SDK integrations.

The routing question

There are two distinct shapes of "AI task" in c10r, and conflating them was the first thing we got wrong.

Workspace tasks. Things like "draft a marketing email" or "score this contact's loyalty." These are workspace-level features. The admin picks a provider AND a model. At request time, the gateway uses exactly those — no caller choice. This is "fixed mode."

Internal microservice tasks. Things like parser:extraction or parser:ocr-vision. Here the caller knows what model it needs — OCR needs a vision model, extraction needs reasoning, embedding needs an embedding model. The admin only chooses which provider to route through. The caller picks the model in each request, subject to the provider's allowed-models list.

We called this "provider-passthrough mode." The schema change was one field on the TaskRoute document:

{
  taskId: 'parser:extraction',
  providerType: 'openrouter',
  mode: 'provider-passthrough',  // <-- NEW
  audience: 'internal',           // <-- NEW
  // model: omitted in passthrough; required in fixed
}

The admin UI splits into two sections — Workspace Tasks (provider + model picker, mode='fixed') and Internal Microservice Tasks (provider-only picker, mode='provider-passthrough', grouped by microservice). The save handler is mode-aware: fixed rows require both fields, passthrough rows require only provider.

Catalog-driven, so every new MS comes for free

The first version of internal tasks had them hardcoded:

export const AI_TASK_TYPES = [
  'workspace:marketing-email',
  'workspace:loyalty-score',
  // ... etc
  'parser:extraction',
  'parser:embedding',
  'parser:ocr-vision',
] as const;

Two problems. First, every new MS that needs AI requires editing two files in lockstep — the microservice catalog AND this enum. Second, the admin UI had a static layout that couldn't group dynamically by microservice.

So we moved the declaration to the microservice catalog itself:

{
  slug: 'parser',
  // ...
  aiTasks: [
    { id: 'extraction', shape: 'chat',      label: 'Receipt extraction' },
    { id: 'embedding',  shape: 'embedding', label: 'Receipt embedding' },
    { id: 'ocr-vision', shape: 'chat',      label: 'Receipt OCR' },
  ],
}

The wire ID is ${slug}:${id} ("parser:extraction"), so route persistence keys stay stable. The routing module exposes getAITaskTypes() which computes "workspace tasks ∪ catalog-derived internal tasks" at request time. The admin UI iterates services and renders one card per MS with its task rows.

Adding a new microservice that needs AI is now: write the service, add a block to services.ts with aiTasks: [...], set its aiTokenEnvVar. The admin UI grows a new card automatically. The routing table accepts the new task IDs automatically. Zero edits to the gateway, zero edits to the admin page.

Shape guards

Each task declares a shape: 'chat' | 'embedding'. The gateway uses this to validate at the boundary:

POST /api/internal/ai/chat rejects a task with shape: 'embedding' — HTTP 400, error code wrong_endpoint.
POST /api/internal/ai/embed rejects a task with shape: 'chat' — same idea.

Caller bugs surface here, with a clear error, instead of as confusing upstream responses ("404 model not found" from OpenAI when you POST embedding input to chat-completions).

Auth, in two directions

Microservices already had a shared secret for c10r → MS calls (the tokenEnvVar). We didn't reuse it for MS → c10r — leaking one direction would have compromised the other. Instead, every microservice catalog entry now also declares aiTokenEnvVar:

{
  slug: 'parser',
  tokenEnvVar: 'MS_PARSER_TOKEN',         // c10r → parser
  aiTokenEnvVar: 'MS_PARSER_AI_TOKEN',    // parser → c10r
}

The gateway accepts an inbound Bearer, scans every catalog entry for a matching aiToken, and identifies the calling MS. That (msSlug, taskId) pair is what drives routing, audit, and the shape guard. No identity assertion needed — the secret IS the identity.

Audit, separated from the data path

Every successful or failed call writes one row to AiCallLog (manager DB):

{
  ts, msSlug, workspaceId, task, provider, model,
  status, latency, tokens, errorCode?, errorMessage?
}

Cost attribution by (msSlug, task). Latency percentiles per provider. Error-rate dashboards per model. All from one collection.

One discipline: audit failures must never throw. The data path is a passthrough to a paying provider — the worst possible outcome is "the LLM call succeeded but we couldn't write the log row, so we returned an error to the caller and they retried." The log write is wrapped, errors are swallowed (logged to stderr), the response goes through.

The local-models adapter

This was the surprise complication. We also run an in-house model server on a GPU box (models.c10r.io) hosting open-weight models — gemma3:27b for vision, qwen3:8b for reasoning, etc. Two things make it different:

No bearer auth. It runs on a private network. Adding auth there would have meant nothing — anyone on the network can already reach it.
Bespoke wire shape. POST /run/<model> with Ollama-style messages ({role, content, images: [b64]}) instead of OpenAI's /chat/completions.

We did NOT want to leak these details to microservice clients. The whole point of the gateway is that an MS implements one OpenAI-compat client and stops caring about provider differences.

So the gateway adapts:

Request:  OpenAI content array  →  Ollama images field
Response: {model, type, result} →  OpenAI choices[]

A new provider type (models-c10r) and a KEYLESS_PROVIDERS set make the rule explicit — the admin UI hides the API key field for this type, the create endpoint accepts missing/empty apiKey, and the routing logic skips decryption. A typo in the enum can't accidentally enable auth where the upstream doesn't expect it.

Microservice clients still send the same OpenAI-compat body. The gateway figures out the rest.

The result

The parser's .env is one line shorter than its first version: no provider keys. A single setting points it at the c10r AI gateway URL and a single bearer identifies the service. Three AI calls flow through one OpenAI-compatible client.

A platform admin can re-route parser:ocr-vision from gpt-4o-mini to our in-house Gemma model from a dropdown. The parser doesn't notice — same request shape, different upstream. We can A/B providers per task, swap models per environment, and pull cost reports per microservice — all without redeploying anything.

The next microservice that needs AI inherits all of this for the cost of one block in services.ts.

The bigger pattern

The bigger pattern, looking back: a microservice should hold business logic, not credentials.

Credentials are an undifferentiated commodity — provider keys, encryption keys, secrets to other internal services. They belong wherever you've already built the muscle to store, rotate, and audit them. For us, that was the monolith. For your team it might be Vault, or a secrets manager, or a single environment.

But it should not be "wherever the microservice happens to need them." Once you've centralized that — built the gateway, the audit log, the routing UI — the marginal cost of the next microservice drops to nearly zero. And that's the test for whether a piece of infrastructure has earned its place.