Loading...


Updated 9 Mar 2025 • 5 mins read
Khushi Dubey | Author
Table of Content

For decades, Moore’s Law shaped how we think about technology costs. Faster chips meant lower prices over time. More power, less expense. That pattern trained leaders to expect efficiency gains to translate directly into savings.
In artificial intelligence, the story sounds similar at first. The cost per token for large language model inference continues to fall. According to Epoch AI, token pricing has dropped sharply in recent years. At the unit level, AI is getting cheaper.
Yet in real-world systems, total spending is rising.
As a cloud engineer working with AI workloads, I see this disconnect daily. The per-token price may decline, but the number of tokens consumed per task is growing at a much faster rate. The result is a cost illusion. On paper, inference looks inexpensive. In practice, total AI spend often increases.
Let us unpack what is really happening.
Research from Andreessen Horowitz and Epoch AI shows that LLM inference costs have dropped by more than 10 times per year in some cases. Andreessen Horowitz even coined the term LLMflation to describe this rapid price decline.
For basic use cases such as:
Per-token pricing keeps trending downward.
However, the complexity of AI applications has expanded just as quickly.
According to reporting from The Wall Street Journal, average token consumption per task can vary widely:
Those numbers explain why total AI bills are climbing.
Modern models no longer generate a single response and stop. They reason through tasks, retry failures, call external tools, and chain multiple steps together. Each step consumes additional tokens. Some advanced systems may execute dozens or even hundreds of internal reasoning steps before returning a final answer.
A typical AI reasoning loop often includes:
Agentic frameworks such as AutoGPT and OpenAgents operate this way. Developer tools like Cursor and collaborative platforms such as Replit and Notion are increasingly embedding similar logic.
These systems are not simple chatbots. They are autonomous engines executing layered workflows. More intelligence requires more computation. More computation requires more tokens.
When AI features scale across thousands or millions of users, token-heavy workflows drive substantial infrastructure costs. Even if each token is cheaper than last year, the total cost per task can grow dramatically.
TechRepublic reported that Notion experienced a 10 percentage point decline in profit margins linked to AI-related costs. That is not a minor fluctuation. It is a strategic concern.
An even more striking example surfaced in coverage by Business Insider. Some platforms discovered what they call inference whales. These are users consuming tens of thousands of dollars in compute under flat-rate pricing plans. One case highlighted a developer who used over 35,000 dollars in computing while paying only 200 dollars under a fixed subscription model.
That pricing mismatch creates serious financial exposure.
Meanwhile, reporting from The Wall Street Journal noted that users of Cursor were exhausting usage credits within days. Replit introduced effort-based pricing to control usage, but that decision triggered public backlash and concerns about value perception.
These examples illustrate a broader issue. AI expands product capability and can accelerate growth. At the same time, it can compress margins if cost visibility and pricing discipline are weak.
In traditional SaaS, the Rule of 40 balances revenue growth and profit margin. AI complicates that balance.
AI features may boost customer acquisition and increase revenue. However, if inference costs rise faster than monetization, margins shrink. When margins fall, overall Rule of 40 scores decline. A company may grow rapidly yet drift below sustainable thresholds.
As T3 Chat CEO Theo Browne stated in a Wall Street Journal interview, the competition to build the smartest system has also become a competition to build the most expensive system.
From an engineering perspective, this is not surprising. Complex reasoning chains, recursive calls, and multi-agent coordination require substantial computing. The surprise lies in how quickly those costs accumulate when deployed at scale
Organizations are experimenting with different approaches to manage AI economics.
Some enterprise platforms choose to absorb inference costs temporarily to gain adoption and build a strategic advantage. Notion and GitHub Copilot initiatives illustrate this approach. The goal is long-term market position, even if short-term margins tighten.
Other companies implement usage-based pricing or increase subscription tiers. Flat-rate plans have proven risky when usage varies widely between customers.
Dynamic routing sends simple tasks to lightweight models and reserves premium models for complex work. This architectural decision reduces the average cost per request without degrading user experience.
Some providers invest in specialized accelerators or custom silicon designed specifically for inference workloads. This lowers the cost per output at the infrastructure level.
Engineering teams now implement retry caps, depth limits, throttling rules, and budget constraints. These controls resemble classic cloud FinOps governance practices, adapted for AI workloads.
Despite these strategies, one challenge remains consistent. Many companies lack detailed visibility into what AI workflows truly cost.
Blended infrastructure metrics are no longer sufficient. Leaders need to understand:
Without this level of detail, companies risk scaling usage without protecting profitability.
Opslyft’s State of AI Costs in 2025 report found that only 51 percent of organizations feel confident evaluating AI return on investment. That statistic reflects a visibility gap. Teams see total cloud spend rising, but cannot trace costs back to specific AI behaviors.
From my experience, effective AI cost management requires treating token consumption as a constrained resource. Just as early cloud adopters learned to manage compute and storage carefully, AI-native teams must design systems where cost is part of the architecture.
This includes:
A new discipline is emerging to address this challenge: AI FinOps.
AI FinOps extends traditional cloud financial management into the world of tokens, models, and autonomous agents. It focuses on aligning AI infrastructure spend directly with business value.
Key capabilities include:
The goal is not simply to reduce spending. The goal is to understand it. Visibility enables control. Control protects margins.
Falling token prices can create a false sense of security. At the unit level, AI inference is cheaper than before. At the system level, however, growing task complexity often drives total costs higher.
As AI applications evolve into multi-step, autonomous workflows, token consumption grows rapidly. This shift affects pricing models, profit margins, and even long-term growth narratives.
Sustainable AI adoption requires disciplined cost architecture. Companies must treat inference spend as a strategic resource, not an invisible byproduct of innovation.
In this new AI economy, margin discipline becomes a competitive advantage. Smarter systems are powerful. Profitable systems endure.