The Bill That Started This
I opened our OpenAI dashboard on a Tuesday morning and stared at four thousand dollars. Not bad for a quarter β except this was the previous month. I pulled the usage breakdown and the picture got worse. Roughly eighty percent of that spend was code-review prompts and pull-request summaries: tasks with a closed output shape, a clear evaluation rubric, and a small ceiling on how much "intelligence" was actually required. We were paying frontier-model prices to surface missing semicolons.
That was the moment routing stopped being a clever optimization and started being the obvious move. The interesting part wasn't realizing we were over-spending; anyone running a serious LLM workload has that suspicion. It was realizing we'd never decided to over-spend. We'd defaulted to it. The biggest model was the easiest model to integrate, "we'll downgrade later" went on the roadmap, and then it lived there forever.
The Three Axes
Routing, when you strip the marketing away, is the discipline of scoring each task against three independent dimensions. Most teams collapse these into one β "good vs. cheap" β and then route everything to good. That's the leak.
The first axis is capability. Does this task need genuine reasoning across ambiguous specs, or is it pattern-matching against a closed structure? "Classify this email as spam" and "design a refactor for the auth module" sit on opposite ends. Most teams know this in theory and ignore it in practice β every prompt goes to the same endpoint because the same endpoint is what got wired up first.
The second axis is context. How much input has to fit? A two-million-token Gemini window costs nothing if you don't use it. It costs you provider portability if you do. Some tasks genuinely need long context: auditing a contract, summarizing a 200-page deposition, debugging across a sprawling repo. Most tasks don't, and shipping every prompt to a long-context model is paying tax for capacity you never touch.
The third axis is price β not just the headline per-million figure, but the blended cost on real traffic. Input vs. output tokens, cached vs. fresh, batch vs. live, sustained-discount tiers vs. on-demand. Two providers can advertise the same per-token price and bill you completely differently after a month of real workload. The list price is a starting point; the bill is the truth.
A Working Taxonomy
After running this against real production traffic for two quarters, here's the taxonomy that survived. Classification, extraction, and simple routing belong on the cheapest tier that still produces clean schema-compliant JSON β Haiku 4.5, GPT-4o-mini, Gemini Flash, or a local 7B. Code generation and code review want the middle tier: Sonnet 4.6 or GPT-4.1. Frontier rarely pays for itself on routine PRs and pays even less for boilerplate. Reserve the frontier tier β Opus 4.7, GPT-5, Gemini 2.5 Pro β for hard reasoning, agent orchestration, and ambiguous specs where one wrong step compounds into ten. Long-context document grounding splits between Gemini for raw window size and Claude for retrieval-quality answers on long inputs; benchmark on your corpus before committing. And privacy-sensitive or high-volume work belongs on your own GPUs running Llama, Qwen, or DeepSeek β cheap at scale, zero data leaving your perimeter.
None of these is a permanent assignment. They're starting positions. Your evals tell you when to move a workload up a tier. Your finance team tells you when to move it down.
Keep the Router Boring
The router is the simplest part of a routing system, and somehow always becomes the most over-engineered. A function takes a task descriptor and returns a model client. That's the whole thing:
def route(task):
if task.kind == "classification":
return haiku
if task.kind == "code_review":
return sonnet
if task.context_tokens > 200_000:
return gemini_pro
return opus
Resist the urge to start with LLM-as-judge routing. Static rules cover eighty percent of cases, they're debuggable when something misbehaves, and they don't add a hidden inference call to every request. Learned routing is worth building when you have evals to ground it and enough traffic that the routing decision itself becomes a meaningful cost. Not before. Most teams that build the fancy version first end up with a slow, opaque router whose mistakes nobody can diagnose.
The Context-Portability Trap
Switching providers mid-conversation breaks more things than people expect. The shape of the failure depends on which abstraction layer you trusted, and most teams don't realize what their abstraction layer is doing on their behalf until something breaks in production.
Prompt caching is the loudest gotcha. Anthropic's cache invalidates the moment you route a turn to another provider; long agent loops can lose the cache discount entirely if your router bounces between Sonnet and GPT every other turn. Tool schemas are the next one β OpenAI, Anthropic, Google, and open-weight stacks serialize function definitions differently, and the abstraction layer you chose (LiteLLM, OpenRouter, the Vercel AI SDK) is doing real translation work that occasionally leaks at the edges. System-prompt drift is the third: what counts as "system" vs. "user" varies enough across providers that the same prompt can behave subtly differently. Test for behavioral consistency on real tasks, not just response equivalence on canned ones.
The structural fix is to route at the request boundary, not inside a conversation. Each request goes to one model; the whole conversation lives on that model. Most of the headaches disappear. Finer-grained routing is possible, but it's a problem to solve when you have a reason β not the default starting point.
Caching That Survives Routing
Caching and routing are the two real cost levers in a production LLM workload, and they fight each other if you let them. A handful of rules hold across providers. Cache at the boundary your router doesn't cross β if routing happens at the request level, cache stable system prompts per provider; if routing happens mid-conversation, cache nothing, because eviction cost exceeds hit value. Long, stable prefixes pay for themselves; a 4K-token system prompt cached on Claude is roughly ninety percent cheaper on hits, so move the long, repeated material to the cacheable prefix and pass dynamic values in later messages. Avoid string interpolation in cached regions, because one character change invalidates the whole prefix β treat the cache key like an HTTP ETag. And don't share cache across tenants; cross-tenant cache-key collisions are the kind of mistake you make exactly once.
If your routing strategy fights your caching strategy, the routing strategy is wrong. Cache is real money on a real bill. Routing without cache awareness is theatre.
Token Counting Is Not Cross-Provider
One subtlety bites every team eventually. A token in one provider is not a token in another. Anthropic's tokenizer, OpenAI's tokenizer, Google's tokenizer, and any Llama-family BPE all encode the same input string into different token counts. Your "10K-token context limit" enforced upstream is meaningful per provider, not in aggregate.
This shows up in three places. Budget accounting β use the provider's own tokenizer when you check limits. Cost prediction β if your router decides "this prompt is too expensive for Opus," compute Opus tokens, not GPT tokens, to make that decision. And truncation β cutting to "120K tokens" with the wrong tokenizer cuts the prompt in the wrong place, and the resulting failure is usually invisible until someone notices missing context in an answer. Most abstraction layers paper this over with approximations. Read what yours actually does before you trust it for billing decisions.
What to Measure
You can't optimize what you don't measure, and aggregate spend hides the leak you're trying to find. Track cost per task type, not just total β that's how you discover that code review is eating sixty percent of your bill and surface the next routing decision. Track output quality per route, against a small fixed eval set per task category, run weekly; capability cliffs only become visible against a stable baseline. Track routing-decision latency, because a router that takes 200ms eats a meaningful share of the gains it was supposed to deliver. And track cache hit rate per provider β below forty percent on a routed workload means your routing strategy is fighting your prompt caching, and the fix is usually to push caching boundaries inward.
The Default to Avoid
The most expensive mistake I keep seeing is the plan to start with the biggest model and downgrade later. Teams never downgrade. The cost compounds, the eval pipeline never gets built, and three quarters later someone notices the bill and asks how it got this big.
Invert the default. Start with a router that has three branches β small, medium, large β and route most traffic to small. The tasks that genuinely need frontier-tier reasoning will surface themselves through your evals; the capability cliffs will be visible against the baseline. Build the router first; tune it later. Most production AI is shaped like rules with vibes, and most of those rules fit comfortably on the small tier.
One model for everything is a strategy. It's just the wrong one for almost everything you'd actually want a model to do.