Hallucinations Are Architectural, Not Bugs: Manage Them, Don't Fix Them

Hallucinations Are Architectural, Not Bugs: Manage Them, Don't Fix Them

The API That Didn't Exist

A senior engineer on my team shipped what looked like a routine migration script. Forty lines of TypeScript, one schema change, well-tested in staging. The script called db.batchInsert(rows, { onConflict: "merge" }). The model that wrote it had been confident. The code reviewer had been confident. The migration ran in production at 3am on a Tuesday — and we caught the problem three hours later when an alert fired on duplicate user records.

The onConflict option didn't exist. Not in the version of the driver we were using, not in any version. The driver had silently accepted the unknown configuration key, ignored it, and inserted duplicates instead of merging them. Nobody had checked. The model had been so confident in a plausible-sounding option that following its suggestion had become the path of least resistance for everyone who looked at the code.

The hallucination wasn't a fluke. It was the architecture working exactly as designed. And once you really internalize that, the rest of your engineering practice around LLMs changes shape.

Why "Bigger Models Will Fix It" Is Wrong

Autoregressive transformers predict the next token by minimizing cross-entropy against training data. The loss function rewards producing tokens that are plausibly correct given everything before them. It does not reward being calibrated about uncertainty. It does not reward saying "I don't know." It does not, in any direct sense, reward being true — it rewards being consistent with the training distribution.

There is no "I don't know" head in the architecture. There is no epistemic state being tracked alongside the next-token distribution. The model produces tokens with high probability, full stop, and "high probability" doesn't mean "true." It means "consistent with what the model has seen before." A confident hallucination and a confidently correct answer are produced by exactly the same mechanism. The model has no way to tell them apart, because at the level it operates, there's nothing to tell apart.

Scaling laws say that bigger models are better at the loss they optimize. They don't say bigger models are better at knowing when they don't know — that's a different objective, and current pretraining doesn't optimize for it. We may eventually have architectures that do, but the ones we have right now don't, and waiting for them is not a strategy.

You don't fix hallucination. You build systems that survive it.

Pattern 1: Retrieval — Ground the Answer

The model that doesn't know which version of Postgres added the MERGE INTO statement is a fundamentally different problem from the model that has the Postgres 16 release notes in context and is asked to summarize what they say about MERGE INTO. The first problem is unsolvable inside the model. The second is well-bounded and tractable. Retrieval is what moves you from the first problem to the second.

That's why retrieval-augmented generation isn't a feature; it's a discipline. The naive version — embed everything, search at query time, prepend top results — works on demos and fails in production for reasons that take months to discover. The hard parts are the parts the tutorials skip.

Indexing strategy matters more than people realize: chunk size, overlap, semantic vs. keyword vs. hybrid retrieval. The defaults are usually wrong for your specific corpus, and finding the right configuration is empirical work that nobody puts on the roadmap. Recency matters: your index will go stale, sometimes faster than you expect, and you need to plan reindex cadence and handle deprecated documents that still live in the index. Citation matters: make the model show its sources, and if it can't cite, you should be suspicious. If it cites and the citations don't resolve, refuse to answer. And failure modes matter most: a retrieval miss is invisible unless you measure it, because the model will generate something regardless of whether retrieval surfaced anything useful, and that something will look fine if you don't have a labeled eval set telling you otherwise.

Retrieval doesn't eliminate hallucination. It constrains the surface where hallucination can happen. That's a real win — instead of the model confabulating across the entire training distribution, it can only confabulate within the documents you've retrieved. The smaller surface is dramatically easier to monitor.

Pattern 2: Verification — Check Before You Commit

The strongest single defense against hallucination is execution. If the model wrote SQL, run it in a sandbox before you show it to the user. If it cited an API endpoint, hit the endpoint and check the response shape matches what the model claimed. If it wrote a function, run the unit tests. If it produced a config file, validate it against the schema.

The principle generalizes: turn open-ended generation into verifiable claims, and then verify them. A claim you can check is a claim you can catch wrong. A claim you can't check is a claim you have to trust, and trust scales badly when the source is an LLM with no epistemic state.

This is why agentic coding tools have turned out to be more reliable than their chat-based predecessors. The agent runs the code, sees the failure, and corrects. The verification step is in the loop, not bolted on afterward. The model still hallucinates, in the same architectural sense it always has — but the bad outputs get caught and corrected before they ship, because the next step of the loop reveals them. The verification step turns a single confident wrong answer into a recoverable iteration. That changes everything about how reliable the overall system can be.

Pattern 3: Refusal — Make "I Don't Know" First-Class

The third pattern is the hardest to deploy because it requires a cultural shift inside your team. Most engineering organizations are trained to measure success as "did the system produce an answer?" — and that metric optimizes directly for false confidence. The system that always answers, regardless of whether it knows, will outperform on that metric while quietly making everything worse downstream.

The fix is to make "I cannot determine X with the given context" a first-class output. Train or prompt the system to produce it. Plumb it through your downstream code as a real success state, not as a fallback. Measure how often it fires; treat low refusal rates on adversarial queries as a red flag, not a green one.

Concretely, this looks like structured outputs with a confidence field and a missing_information field, and a downstream router that escalates low-confidence answers to a human reviewer or to a more capable model. The router is the unglamorous piece that makes the whole architecture work. The model doesn't have to be perfect; the system around it has to acknowledge imperfection and act accordingly.

The Stack, Together

Real-world reliable systems compose all three patterns. Retrieve the relevant context from your knowledge base. Generate a candidate answer with explicit citations to the retrieved sources. Verify the answer — execute code if applicable, confirm cited sources exist, run schema or constraint checks against the output. Refuse if verification fails or confidence falls below a threshold. Escalate refused answers to a more capable model or to a human reviewer.

Each layer catches a category of hallucination the others miss. None of them is sufficient alone. A retrieval-only system still hallucinates within retrieved documents. A verification-only system still hallucinates on tasks it can't verify. A refusal-only system collapses to uselessness if your refusal threshold is wrong. The composition is where reliability lives.

How to Evaluate It

The default evaluation question — "does the model hallucinate?" — is unmeasurable in any production setting. There's no single binary signal. The useful questions are narrower and individually answerable. Did retrieval surface the answer? Build a labeled set of queries with known-good source documents, and measure recall@k of your retrieval layer independent of the LLM. If recall@5 is below ninety percent on a representative set, no model on earth will save the downstream answer. Did the model cite correctly? Parse citations out of the response and verify each one maps to a real retrieved document — citation accuracy is the leading indicator of overall hallucination rate. Did verification catch the bad answer? Track the rate at which your verification layer rejects model output; if it's near zero, either the model is unusually reliable on your task (possible but suspicious) or your verifier isn't doing anything. Did refusal fire when it should have? Build adversarial queries where the answer isn't in the retrieved context, and confirm the system refuses instead of confabulating.

Each of these is a separately measurable signal. Track them as four independent KPIs, not one aggregate "quality" number. The aggregate hides which layer is failing, and you need to know which layer is failing to fix anything.

What This Means for Engineering Teams

Stop chasing "the model that doesn't hallucinate." That model isn't shipping. The pretraining objective doesn't optimize for it, and changing the pretraining objective is a research problem that won't be solved on your release timeline.

Start designing for the model that always might. The right metric isn't hallucination rate; it's caught hallucination rate — what percentage of model errors does your system catch before they reach a user or a downstream system? Drive that number to ninety-nine percent, and your system is reliable even when the underlying model is occasionally confidently wrong. The model's failures stop mattering because the system around it is designed to assume them.

That's an engineering problem. It's solvable. It's not waiting for a new model release. And it's the work that distinguishes a team that ships reliable LLM features from a team that ships impressive demos and lives with the production fallout.