← Journal0124 min read

What inference costs now

The AI bill stopped being about access to models in early 2026. It became about what it costs to run them.


There is a conversation happening inside engineering organizations right now that was not happening eighteen months ago. It is about the AI line on the cloud bill.

Not the storage line. Not networking. The inference line — the cost of actually running a model on real requests, at the scale of a real business, across a workload that someone deployed because the demo looked good and the capability was real. That line has been growing faster than most of the teams who own it anticipated, because the pattern they are running changed without a formal decision to change it.

In early 2026, inference crossed a threshold the industry has taken to calling the inference flip. Inference now accounts for eighty-five percent of the enterprise AI budget and roughly fifty-five percent of total cloud AI spend, according to analysis by Spheron and byteiota. A year ago, the ratios were inverted: training dominated the cost conversation, and inference was the cheap part.

The flip happened because of a shift in what the tools are actually doing.

The AI pattern that most companies budgeted for in 2024 was the chat pattern: a user sends a message, a model replies, the interaction is relatively self-contained. Per-call cost is visible. Usage is intermittent. The bill is manageable. The AI pattern that most companies are deploying in 2026 is the agentic pattern: a model plans, retrieves context from multiple sources, invokes external tools, evaluates its own output, and iterates. A task that takes a user one minute to describe might take a model ten or twenty steps to complete, and each step is a call. Some steps invoke other models for specialized subtasks. Context windows are large and stay large throughout.

According to analysis by Zylos Research, a multi-step agentic task completion costs between $0.10 and $1.00 per task — a hundred to a thousand times more than a simple chatbot API call. A workflow that runs ten thousand times a day at the lower end of that range costs roughly $365,000 a year. At the upper end, it costs $3.65 million. Neither figure was in the 2024 AI budget for a typical mid-size operator.

The teams discovering this disparity are not discovering it because they made a wrong technical decision. They are discovering it because the capability they wanted — an AI that can actually complete a multi-step task end to end — was not available at the cost they planned for when they planned for it. Now it is available. They deployed it. The bill is catching up.

There are real ways to manage this. The RouteLLM framework demonstrates that routing between a stronger and weaker model based on task complexity can cut inference costs by more than half while preserving ninety-five percent of quality on a mixed workload. The substrate argument we examined earlier this month feeds directly into this: above roughly a hundred million tokens per month, self-hosting a model you own is usually cheaper than renting one you do not. Prompt caching — storing intermediate context across a workflow instead of re-passing it on each step — reduces token consumption materially on long agentic chains.

None of these optimizations are defaults. They require someone on the team to have made inference economics a first-class engineering concern rather than a quarterly line-item review. Most teams have not done this yet, because the budget conversation was still about access — which model, which tier, which API — rather than about operations.

The shape of the AI infrastructure decision has changed. The question used to be: which model do we access? The question now is: what does it cost to run our specific workload pattern at our actual scale, and are we running it on the right substrate? The second question requires knowing your token volume by workflow type, your average context length per workflow, your call frequency, and your latency tolerance at each step. Most teams know the first number and not the others.

The inference bill is the first honest measure of whether an AI deployment is actually working at the scale it needs to. That conversation has been sitting in the engineering team. It is about to move to the CFO's office.

The short of it.

In early 2026, AI inference — running models in production — flipped from a minor budget item to fifty-five percent of total cloud AI spend. The cause is the shift from simple chat interactions to multi-step agentic workflows, which cost between $0.10 and $1.00 per completion — a hundred to a thousand times more per task than the pattern most teams budgeted for. Optimizations exist: model routing, prompt caching, self-hosting above certain volumes. Most teams haven't applied them yet, and the bill is arriving before the conversation does.

Working with us: hire MonArch

Founder-led studio. Two engagements at a time. Discovery first, software if needed.