The economics of LLM inference at scale

Training gets the headlines, but inference is where the money quietly goes. A look at what actually drives the bill — context length, cache hit rate, batching — and the levers that move it.


Everyone budgets for training and gets surprised by inference. Training is a capital expense you incur once and amortize; inference is an operating expense you pay on every single request, forever, and it scales with success. The more people use the thing you built, the more it costs to run — which is the opposite of how software economics are supposed to work. If you don’t understand the cost structure, growth feels like a tax.

The good news is that the bill is legible. Inference cost is not a black box; it’s a handful of variables you can name, measure, and pull on. Most teams overpay not because the technology is expensive but because they never looked at which lever was stuck.

Three things move the bill more than anything else: how many tokens you push through the context window, how much of your prompt you can cache, and how you batch requests. Get those right and inference stays affordable. Ignore any one of them and growth starts to feel like a tax.

Tokens are the unit of cost

The first thing to internalize is that you pay per token, in both directions, and the two directions are not priced the same. Input tokens — the prompt, the context, the history you stuff in — are usually cheaper than output tokens, but there are vastly more of them. A long system prompt repeated on every request is a recurring charge most people forget they’re paying.

This reframes prompt design as a cost-engineering problem. Every token in your context window is a token you pay for on every call, so the discipline is not “what could I include” but “what must I include.” The agent that re-reads the entire codebase on every turn isn’t thorough; it’s expensive. The one that’s handed exactly the three files it needs does the same work for a tenth of the cost.

Inference is the only part of an AI product where being popular makes you poorer. Treat every token like it’s billed — because it is. — what I tell every team that’s about to ship

Prompt caching is the highest-leverage lever

The single biggest win available to most teams is caching. If a large chunk of your prompt is identical across requests — a fixed system prompt, a shared instruction block, a stable document you keep asking questions about — you can have the provider cache the computed representation of that prefix and reuse it.

The economics are stark. A cache hit on a long shared prefix can cost a fraction of recomputing it, sometimes an order of magnitude less. The catch is that caching is prefix-sensitive: the cacheable part has to come first and stay byte-identical. Reorder your prompt, inject a timestamp near the top, or vary the instructions per request, and you blow the cache without realizing it.

# cache-hostile: the variable part is at the front, so nothing caches
prompt = f"User {user_id} at {now()}\n\n{HUGE_SYSTEM_PROMPT}\n{question}"

# cache-friendly: stable prefix first, variable suffix last
prompt = f"{HUGE_SYSTEM_PROMPT}\n\nUser context: {user_id}\n{question}"
#         ^^^^^^^^^^^^^^^^^^^^ identical across calls -> cached

The two prompts ask the same question. One pays full price for the system prompt on every request; the other pays once and reads from cache thereafter. The difference at scale is the difference between a sustainable product and a science project.

Batching and the latency–throughput trade

Underneath the API, inference servers are throughput machines. They process requests in batches because a GPU running one request at a time is a GPU mostly sitting idle. Larger batches mean higher throughput and lower cost per token — but they also mean any individual request may wait a few milliseconds for its batch to fill.

This is the central tension of inference at scale: throughput and latency pull in opposite directions, and you have to pick a point on that curve deliberately. A user-facing chat wants low latency and will pay for smaller batches; a nightly bulk-processing job wants maximum throughput and doesn’t care if any single item waits. Running both through the same path at the same settings means one of them is paying for a trade-off it didn’t need.

The teams that keep their inference bill sane are the ones that route by workload. Interactive traffic gets latency-optimized serving; batch traffic gets throughput-optimized serving; cacheable prefixes get cached; and context gets trimmed to what the task actually requires. None of it is exotic. It’s just refusing to pay for capacity, tokens, or latency you didn’t need — the same discipline that makes any system at scale affordable.

Common questions

Why does inference cost more than training over time?

Training is a one-time capital cost you amortize. Inference is an operating cost you pay on every request, forever, and it grows with usage. The more people use the product, the more it costs to run, which is the opposite of normal software economics.

What drives the cost of LLM inference?

Three levers: context length, since you pay per token in both directions and input tokens pile up fast; cache hit rate on the stable prefix of your prompt; and batching, which trades latency for throughput. Most overspending traces back to one of these being left untuned.

How does prompt caching cut inference costs?

If a large chunk of your prompt is identical across requests, the provider can cache its computed representation and reuse it, often an order of magnitude cheaper than recomputing. The catch is that the cacheable part must come first and stay byte-identical, so do not put a timestamp or per-request data at the top.

Should every workload use the same inference settings?

No. Interactive traffic wants low latency and smaller batches. Bulk jobs want maximum throughput and do not care if an item waits. Routing by workload, caching prefixes, and trimming context is what keeps the bill sane.