Updated 9 Mar 2025 • 5 mins read

Why Falling AI Token Prices Don’t Mean Lower Costs

FinOps Practices

Khushi Dubey
Author

Table of Content

For decades, Moore’s Law shaped how we think about technology costs. Faster chips meant lower prices over time. More power, less expense. That pattern trained leaders to expect efficiency gains to translate directly into savings.

In artificial intelligence, the story sounds similar at first. The cost per token for large language model inference continues to fall. According to Epoch AI, token pricing has dropped sharply in recent years. At the unit level, AI is getting cheaper.

Yet in real-world systems, total spending is rising.

As a cloud engineer working with AI workloads, I see this disconnect daily. The per-token price may decline, but the number of tokens consumed per task is growing at a much faster rate. The result is a cost illusion. On paper, inference looks inexpensive. In practice, total AI spend often increases.

Let us unpack what is really happening.

The cost illusion: cheaper tokens, higher bills

Research from Andreessen Horowitz and Epoch AI shows that LLM inference costs have dropped by more than 10 times per year in some cases. Andreessen Horowitz even coined the term LLMflation to describe this rapid price decline.

For basic use cases such as:

Simple Q and A
Short text summarization
Basic classification

Per-token pricing keeps trending downward.

However, the complexity of AI applications has expanded just as quickly.

According to reporting from The Wall Street Journal, average token consumption per task can vary widely:

Basic Q and A: 50 to 500 tokens
Summary: 2,000 to 6,000 tokens
Basic code assistance: 1,000 to 2,000 tokens
Complex coding: 50,000 to 100,000 plus
Legal document analysis: 250,000 plus
Multi-agent workflows: 1 million plus

Those numbers explain why total AI bills are climbing.

Modern models no longer generate a single response and stop. They reason through tasks, retry failures, call external tools, and chain multiple steps together. Each step consumes additional tokens. Some advanced systems may execute dozens or even hundreds of internal reasoning steps before returning a final answer.

A typical AI reasoning loop often includes:

Interpreting the request
Deciding which tools or models to call
Fetching data or running code
Evaluating intermediate results
Retrying or adjusting logic
Generating the final output

Agentic frameworks such as AutoGPT and OpenAgents operate this way. Developer tools like Cursor and collaborative platforms such as Replit and Notion are increasingly embedding similar logic.

These systems are not simple chatbots. They are autonomous engines executing layered workflows. More intelligence requires more computation. More computation requires more tokens.

Why margins are feeling the pressure

When AI features scale across thousands or millions of users, token-heavy workflows drive substantial infrastructure costs. Even if each token is cheaper than last year, the total cost per task can grow dramatically.

TechRepublic reported that Notion experienced a 10 percentage point decline in profit margins linked to AI-related costs. That is not a minor fluctuation. It is a strategic concern.

An even more striking example surfaced in coverage by Business Insider. Some platforms discovered what they call inference whales. These are users consuming tens of thousands of dollars in compute under flat-rate pricing plans. One case highlighted a developer who used over 35,000 dollars in computing while paying only 200 dollars under a fixed subscription model.

That pricing mismatch creates serious financial exposure.

Meanwhile, reporting from The Wall Street Journal noted that users of Cursor were exhausting usage credits within days. Replit introduced effort-based pricing to control usage, but that decision triggered public backlash and concerns about value perception.

These examples illustrate a broader issue. AI expands product capability and can accelerate growth. At the same time, it can compress margins if cost visibility and pricing discipline are weak.

The rule of 40 meets AI cost inflation

In traditional SaaS, the Rule of 40 balances revenue growth and profit margin. AI complicates that balance.

AI features may boost customer acquisition and increase revenue. However, if inference costs rise faster than monetization, margins shrink. When margins fall, overall Rule of 40 scores decline. A company may grow rapidly yet drift below sustainable thresholds.

As T3 Chat CEO Theo Browne stated in a Wall Street Journal interview, the competition to build the smartest system has also become a competition to build the most expensive system.

From an engineering perspective, this is not surprising. Complex reasoning chains, recursive calls, and multi-agent coordination require substantial computing. The surprise lies in how quickly those costs accumulate when deployed at scale

Five emerging responses to AI cost pressure

Organizations are experimenting with different approaches to manage AI economics.

1. Absorbing the cost

Some enterprise platforms choose to absorb inference costs temporarily to gain adoption and build a strategic advantage. Notion and GitHub Copilot initiatives illustrate this approach. The goal is long-term market position, even if short-term margins tighten.

2. Passing costs to customers

Other companies implement usage-based pricing or increase subscription tiers. Flat-rate plans have proven risky when usage varies widely between customers.

3. Smarter model routing

Dynamic routing sends simple tasks to lightweight models and reserves premium models for complex work. This architectural decision reduces the average cost per request without degrading user experience.

4. Hardware optimization

Some providers invest in specialized accelerators or custom silicon designed specifically for inference workloads. This lowers the cost per output at the infrastructure level.

5. Usage shaping and guardrails

Engineering teams now implement retry caps, depth limits, throttling rules, and budget constraints. These controls resemble classic cloud FinOps governance practices, adapted for AI workloads.

Despite these strategies, one challenge remains consistent. Many companies lack detailed visibility into what AI workflows truly cost.

The need for granular AI unit economics

Blended infrastructure metrics are no longer sufficient. Leaders need to understand:

Cost per workflow
Cost per feature
Cost per customer
Cost per model call
Token consumption by logic path

Without this level of detail, companies risk scaling usage without protecting profitability.

Opslyft’s State of AI Costs in 2025 report found that only 51 percent of organizations feel confident evaluating AI return on investment. That statistic reflects a visibility gap. Teams see total cloud spend rising, but cannot trace costs back to specific AI behaviors.

From my experience, effective AI cost management requires treating token consumption as a constrained resource. Just as early cloud adopters learned to manage compute and storage carefully, AI-native teams must design systems where cost is part of the architecture.

This includes:

Prioritizing high-end models only for tasks that require them
Building budget-aware agent loops
Enforcing retry and recursion limits
Monitoring token flow across workflows
Aligning usage patterns with measurable business outcomes

The rise of AI FinOps

A new discipline is emerging to address this challenge: AI FinOps.

AI FinOps extends traditional cloud financial management into the world of tokens, models, and autonomous agents. It focuses on aligning AI infrastructure spend directly with business value.

Key capabilities include:

Token-level observability by user, model, and task
Per-workflow cost attribution
Effort-based forecasting tied to request complexity
Budget-aware agent design
Model routing dashboards based on cost and accuracy tradeoffs

The goal is not simply to reduce spending. The goal is to understand it. Visibility enables control. Control protects margins.

Conclusion

Falling token prices can create a false sense of security. At the unit level, AI inference is cheaper than before. At the system level, however, growing task complexity often drives total costs higher.

As AI applications evolve into multi-step, autonomous workflows, token consumption grows rapidly. This shift affects pricing models, profit margins, and even long-term growth narratives.

Sustainable AI adoption requires disciplined cost architecture. Companies must treat inference spend as a strategic resource, not an invisible byproduct of innovation.

In this new AI economy, margin discipline becomes a competitive advantage. Smarter systems are powerful. Profitable systems endure.

Cloud waste? Bench it. Opslyft puts the right players on the field.