Unifying two registries into one: the provider catalog refactor

When two registries do the same thing two different ways, you don't have two registries — you have a coordination tax. Every code path that needs to talk to "a thing that responds over HTTP and authenticates with a bearer" has to first decide which lookup table it's reading, which key field it cares about, and which set of feature flags applies. Doubling that surface area is what bit us.

This is the story of collapsing two registries into one inside c10r, what the seams looked like before, and what we kept when we redrew them.

Why we had two registries at all

We started with two completely separate ideas in the codebase.

Microservices were our internal HTTP services: the receipt parser, the echo service, the models server. Defined in config/microservices/services.ts — a TypeScript file checked into the repo. Each entry described the slug, base URL, health endpoint, log/queue capabilities, and which AI tasks (if any) it could serve. Authentication was a MS_<slug>_TOKEN env variable, looked up at request time.

AI Providers were external LLM vendors: OpenAI, Anthropic, Google, OpenRouter, plus our self-hosted models-c10r. Defined in a MongoDB collection called platformaiproviders. Each row held an encrypted API key, a list of allowed models, and a provider-name discriminator that gateway code switched on.

Both registries answered the same fundamental question — "how do I dispatch this request to a backend?" — but with different storage, different auth shapes, different admin UIs, and different code paths in the routing layer.

What that cost us

Three concrete pain points kept showing up:

Routing logic forked on the wrong axis. The AI task router dispatched on provider === 'openai' vs 'anthropic' vs 'models-c10r'. But the actual difference that mattered — the wire shape of the request — is orthogonal to the provider name. OpenRouter speaks OpenAI's wire format. Self-hosted Ollama instances speak something close to OpenAI but with quirks. Provider name was a bad proxy for protocol.

Credentials lived in three places. Internal MS tokens lived in env vars. AI provider keys lived encrypted inside platformaiproviders rows. Workspace-scoped API keys lived in ai_assistant_config documents. Rotating a key meant remembering which path it took.

Admin UI doubled up. /manage/microservices listed one kind of thing. /manage/settings/ai-providers listed another kind of thing. Both pages had a status pill, a test button, a credentials field, a "pause" toggle. We were maintaining two implementations of the same screen.

The refactor in one sentence

A single providers collection in MongoDB, with a type field (internal | external), capability flags that describe what the provider can actually do, and a separate credentials collection that any provider can point at.

interface Provider {
  slug: string;
  type: 'internal' | 'external';
  baseUrl: string;
  category: string;
  lifecycleStatus: 'active' | 'deploying' | 'archived';
  pausedAt?: Date | null;
  capabilities: {
    http?: { ... };
    mcp?: { ... };
    logs?: { ... };
    queue?: { ... };
    aiProvider?: {
      wireShape: 'openai-compat' | 'anthropic-native'
                | 'google-native' | 'ollama-c10r';
      keyless: boolean;
      models: string[];
    };
  };
  apiKeyCredentialId?: ObjectId;
  inboundTokenCredentialId?: ObjectId;
}

Two things make this load-bearing:

wireShape instead of provider name. The gateway no longer cares whether you call your provider "Frank's Discount LLMs". It cares that Frank speaks openai-compat. Routing dispatch reads capabilities.aiProvider.wireShape and picks the right adapter. Provider name becomes a display label, nothing more.

Capabilities, not type. A provider isn't "an AI provider" or "a microservice". A provider has capabilities. The receipt parser has capabilities.http and capabilities.queue and capabilities.aiProvider (because it does on-device extraction). The same row also has capabilities.logs so the admin UI can show its log stream. Asking "can this thing serve AI task X?" becomes a capability check, not a type check.

What the architecture looks like now

graph TB
  subgraph DB["MongoDB - manager scope"]
    P[providers]
    C[credentials]
    R[ai_task_routing]
  end

  subgraph App["c10r app"]
    H[hydrateCatalog]
    Cache["globalThis catalog cache"]
    G[AI gateway]
    UI["/manage/providers"]
    Admin["/manage/credentials"]
  end

  subgraph External
    O[OpenAI]
    A[Anthropic]
    M[Self-hosted models]
    Par[Receipt parser MS]
  end

  H --> P
  H --> Cache
  Cache --> G
  Cache --> UI
  P --> C
  R --> P
  G --> O
  G --> A
  G --> M
  G --> Par
  UI --> P
  Admin --> C

A few things are doing real work in that diagram.

hydrateCatalog() runs once at boot, reads every row from providers into a globalThis-pinned cache, and a background loop refreshes the cache every 60 seconds. The catalog is queried synchronously by the gateway and UI — async hydration up front buys sync access at request time. The globalThis pin survives Next.js HMR, which matters more than you'd think during development.

Fallback is loud. If providers is empty at boot — fresh DB, post-migration before seed, whatever — hydrateCatalog() falls back to the legacy services.ts file and logs a warning every cycle. We deliberately kept the fallback because it lets a broken seed migration not take down the platform, but we made sure you can't ignore it.

Credentials are not optional. The previous code path had an env-var fallback: if no API key was found in the DB, look at MS_PARSER_TOKEN. We removed that. The gateway requires a linked credential. The seed migration imports the legacy env values into credentials rows during the cutover, so existing services keep working — but new providers must go through the credentials path.

The wire shape question

This is the part that took the longest to get right.

Every AI provider speaks one of four wire formats:

Wire shape	Used by	What's special
`openai-compat`	OpenAI, OpenRouter, vLLM	The de-facto baseline. Everything else is "OpenAI but with…".
`anthropic-native`	Claude API	Different message shape, system prompt is top-level, tool use is structured differently.
`google-native`	Gemini API	Generative Language API with its own request envelope.
`ollama-c10r`	Our self-hosted models	Ollama's API with c10r-specific extensions for streaming and batching.

The gateway has one adapter per wire shape. Adding a new provider that speaks an existing wire shape is a database insert, not a code change. Adding a genuinely new wire format means writing a new adapter — but at least that's the only thing it means.

We tried the alternative: dispatching on provider name and letting each branch handle its own quirks. It worked for the four providers we had. It would have broken on the fifth. We caught it because OpenRouter spoke OpenAI's wire format but had provider !== 'openai', and the special-case ladder got ugly.

Credentials as a first-class store

interface Credential {
  slug: string;
  type: 'internal' | 'external';
  knownProvider?: 'openai' | 'anthropic' | 'google'
                | 'openrouter' | 'c10r-internal-bearer';
  encryptedValue: string;
  ownedBy?: ObjectId;
}

The knownProvider field is metadata: it tells the UI what shape the credential is supposed to take so it can validate format and offer a smarter form. The encryption is the same scheme the workspace tier uses — we deliberately did not invent a new one.

The admin UI gets its own page at /manage/credentials. The form has type pickers driven by icons (not dropdowns) because credentials are the kind of thing you scan a list of, not read. We added a usage-aware delete guard: deleting a credential that's referenced by any provider gives you the list of references and refuses to drop until the references are cleared.

This was the smallest of the three sub-changes but probably the highest-leverage. Once credentials are a real noun in the system, every other thing — provider rotation, audit, env var migration — gets simpler.

What we kept

A few things we explicitly did not change:

The legacy platformaiproviders collection is still there. We don't drop it during migration. The new code path doesn't read it after seed, but rollback works by just redeploying the previous build — no data restoration needed. After a stable window, it gets dropped in a cleanup PR.

services.ts is still in the repo. It's the fallback path when the catalog hydration fails. We treat it like a safety rail, not load-bearing config. The next PR removes it once we're confident the seed migration is bulletproof on production.

Slugs are plain strings, not a typed union. The original services.ts exported a MicroserviceSlug union ('parser' | 'echo-service' | 'models'). The unified registry deals in arbitrary user-created slugs, so the union doesn't make sense anymore. We marked it @deprecated and widened the call sites instead of forcing every consumer to keep up with a moving target.

Lessons we'll carry forward

Stop dispatching on names. Names are labels. The thing you actually care about — protocol, capability, behavior — should be its own field. We knew this in the abstract; it took the AI provider switch ladder getting unwieldy to actually act on it.

Capability is the right unit. When you find yourself writing if (provider.type === 'foo') { check bar }, that's usually a sign that you want if (provider.capabilities.bar) instead. The check gets local, the type gets descriptive, and you stop needing to update every site when you add a new type.

Make rollback boring. The migration scripts are idempotent and additive. They don't drop legacy collections. They snapshot what they replace. The deploy is just a code push; the data migration is a separate operator step. If anything goes wrong, the worst-case path is redeploy the previous build — and the data sits there harmlessly until you're ready to try again.

The fallback path earns its keep. We could have made hydration mandatory and let the app refuse to boot on an empty DB. We didn't, because the fallback is what makes the migration safe to run during business hours. The cost is a code path that mostly never executes — the benefit is that nobody has to take the platform down to seed a database.

If you're staring at two registries in your own system and wondering whether to unify them: the answer is usually yes, but the part that takes thinking is finding the right axis of variation. Type fields don't generalize. Capabilities and protocols do.