When 'We Own the Model' Beats the API: A Self-Hosting Playbook

When 'We Own the Model' Beats the API: A Self-Hosting Playbook

The Spreadsheet That Decides It

Every quarter, somewhere in the engineering org, a senior person opens a spreadsheet and asks the question. "At what point does running this model ourselves beat paying the API?" The honest answer is: it depends, and most of the time the answer changes inside of six months because someone shipped a new model or someone's GPU got cheaper. The right way to approach self-hosting is not as a religious commitment to the open-source stack, but as a math problem with a moving solution. At low requests-per-second, the API wins on cost and operational simplicity. At high RPS, owning the inference flips the math.

A rule of thumb that holds across teams I've worked with: sustained 5 RPS at frontier-equivalent quality is roughly where self-hosting starts beating frontier APIs. Below that, the API is cheaper after you account for the SRE time you didn't have to spend. Above that, the GPU-hour math becomes attractive in a hurry. The exact crossover depends on your model choice, your GPU lease price, and your sustained utilization — but the order of magnitude is real, and it's worth doing the math on your own workload before you commit either direction.

The Four Candidates

The 2026 open-weight landscape is crowded enough to be confusing and concentrated enough to have clear leaders. Four families matter for serious self-hosting.

Llama 3.3 (Meta) is the safe institutional default. 70B-class quality, mature ecosystem, permissive license. Every serving stack, every fine-tuning framework, every inference optimization library supports Llama first. The choice that's hardest to second-guess in a budget review, which matters more than engineering teams want to admit.

Qwen 2.5 (Alibaba) brings strong multilingual support, strong code performance, and smaller variants — 1.5B, 7B, 32B — that hold their own against same-size competitors. If your workload is multilingual or your team is allergic to Meta's licensing terms, Qwen is the move.

DeepSeek-V3 is the model that finally made open-weight frontier quality feel real. Frontier-tier reasoning on a mixture-of-experts architecture, outstanding quality, demanding to run. The operational footprint reflects the architecture: multiple GPUs, careful memory tuning, more SRE attention than the dense models. Worth the trouble if you need it; not worth it if you don't.

Mistral Large 2 and Codestral round out the field. Codestral 22B is the dark-horse pick for code workloads on commodity hardware. Specialized, narrow, predictable. If you're building a code-specific feature and you don't need general-purpose capability, Codestral is consistently the right call.

None of these are interchangeable. Don't pick three "to compare" — that's how you get a model-zoo strategy where none of the deployments gets the operational attention it needs. Pick one. Eval it on your actual tasks. Switch only if you hit a capability cliff you can't engineer around.

On a Mac

Apple Silicon keeps punching above its weight, thanks to unified memory. An M3 Max with 64GB of RAM will run a quantized 70B model at usable speed — not production-grade throughput, but more than enough for a developer to iterate on prompts, debug behaviors, and run internal tools off a laptop.

# Easiest path
ollama pull llama3.3:70b-instruct-q4_K_M
ollama run llama3.3:70b-instruct-q4_K_M

# Faster, less ergonomic
pip install mlx-lm
python -m mlx_lm.generate \
  --model mlx-community/Llama-3.3-70B-Instruct-4bit \
  --prompt "Write a haiku about caching"

Mac is for development and small-scale internal tools. It's not the right place to serve customer traffic at scale. But the dev loop on Apple Silicon is genuinely the best on any platform — fast model swaps, no GPU driver pain, no CUDA version mismatches eating an afternoon.

On Linux

For real serving, vLLM is the default and has been for over a year. The operational surface is small, the throughput is excellent, and the OpenAI-compatible endpoint means your existing application code stays unchanged:

pip install vllm
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 2 \
  --quantization awq \
  --max-model-len 32768

vLLM exposes /v1/chat/completions on the standard OpenAI shape. Throughput on dual A100 80GB lands around 400 tokens per second aggregate at AWQ-int4 — enough for hundreds of concurrent users on a single instance. For multi-node setups, Ray Serve and TGI both have their advocates; for single-node deployments, vLLM is the obvious choice and the one I'd recommend you start with.

On DigitalOcean

DigitalOcean's GPU droplet pricing makes the math work for mid-scale deployments where the AWS sticker shock would have made it a non-starter. The relevant tiers as of Q1 2026: one H100 80GB fits a 70B model at AWQ-int4 with about 32K context, comfortable for steady 10 RPS workloads. One L40S 48GB fits 32B models comfortably — a great fit for Qwen 2.5 32B or Codestral 22B, cheaper per hour than the H100, often the sweet spot for production code-specific workloads. Two H100s are required for full-precision 70B inference or for DeepSeek-V3, whose mixture-of-experts memory footprint outgrows a single H100 at any reasonable context length.

The provisioning pattern that's emerged: GPU droplet on a private VPC, vLLM running inside a Docker container, nginx in front for TLS termination, API-key auth on inbound, and your own monitoring stack — Prometheus, node_exporter, a vLLM metrics scraper. It's a couple of days of plumbing for someone who's done it before, and a couple of weeks for someone learning as they go. Either way it's a finite project, not an ongoing operational burden. Once it's set up, it mostly just runs.

Which Model When

A practical decision frame for picking among the four. Default to Llama 3.3 70B if you want permissive licensing, broad tooling, and a model that just works with every serving stack and fine-tuning framework — the safe institutional choice. Pick Qwen 2.5 32B if your workload is multilingual or code-heavy; it fits on a single L40S, runs faster than 70B-class, and matches Llama on most code benchmarks. Pick DeepSeek-V3 if you need frontier-tier reasoning and you've got the GPUs to back it up — the MoE architecture means high quality at lower active-parameter cost, but the operational footprint is real and the multi-GPU setup demands more SRE attention. Pick Codestral 22B for code-only deployments where you want maximum throughput per dollar — narrow surface, predictable behavior, the kind of model that does exactly what you need and nothing more.

The Integration Story (Where Most Teams Stall)

This is the part most tutorials skip, and it's the part where most teams get stuck. Once you've stood up vLLM, you don't have to rewrite your application. You don't even have to change your client libraries. The OpenAI SDK works against your own GPU exactly as it works against OpenAI's:

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.internal.example.com/v1",
    api_key="your-internal-key"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Hello"}]
)

Same SDK. Same code path. The endpoint is the only thing that changes. This is the real unlock — your existing application doesn't know it's talking to your GPU instead of someone else's. Every modern serving stack exposes the OpenAI shape on purpose, because the industry collectively decided this was the right boundary to standardize on. Use that. The boundary is the URL, not the code.

Observability Is Not Optional

The hardest part of self-hosting isn't the GPU bill or the model selection. It's the loss of the vendor's observability. APIs ship with dashboards, alerts, and SLAs that someone else maintains. Your vLLM container ships with stdout.

Wire these up before you put real traffic on the box. Per-request metrics first — latency p50, p95, p99, time-to-first-token, input and output token counts. vLLM exposes Prometheus metrics on /metrics; scrape them. Then queue depth, which vLLM's continuous batcher reports; if it grows unbounded, you're under-provisioned and you'll feel it in tail latencies before you see it in dashboards. Then OOM canaries on KV-cache exhaustion — the most common production failure mode for vLLM, fully preventable with an alert at 85% cache utilization. And finally output sampling: log a sampled 1% of completions to S3 for offline quality review, because model behavior drifts and you'll want the history when someone asks "did the output get worse last week?"

None of this is hard. It's all work you didn't have to do when the vendor handled it.

The Honest Tradeoff

Self-hosting buys you data sovereignty, predictable cost at scale, complete control over model choice, and the ability to fine-tune for your domain. It costs you an SRE function you didn't have before, a 24/7 oncall surface that someone has to wake up for, and the engineering time to keep up with weekly model releases.

For most teams, the right answer in 2026 is hybrid. Route privacy-sensitive workloads and high-volume workloads to your own infrastructure. Send everything else to the APIs and don't apologize for the cost. Use the OpenAI-compatible endpoint as the boundary between the two — switching workloads from API to self-hosted should be a configuration change, not a code change. Switch the URL, not the code, and the math gets easier to revisit every quarter.