Updated 4 May 2026 • 8 mins read

Snowflake AI Agent Cost Anomalies: Cortex Spend Triad Guide

Ai Cost Optimization

Khushi Dubey
Author

Table of Content

A Snowflake AI agent cost anomaly is excess Cortex credit usage caused by inefficient query patterns or warehouse behavior, typically classified into predicate LLM execution, unbounded embeddings, and auto-resume loops, detectable via ACCOUNT_USAGE views.

Why Cortex Bills Behave Differently From the Rest of Your Cloud

Most cloud cost anomalies show up in compute, storage, or egress. Snowflake Cortex breaks this mental model. The pricing unit is the Snowflake credit, and a single credit bundles three things into one non-decomposable line item: warehouse compute, internal network movement, and the AI inference call itself. There is no breakdown on the invoice. There is no separate inference rate you can compare against OpenAI or Anthropic. The credit is the only thing the finance team sees.

This bundling matters because the cost driver shifts depending on call shape. For high-throughput workloads, the inference itself dominates. For low-frequency or sparse workloads, the warehouse wake dominates. A 200 millisecond Cortex call that wakes a stopped warehouse bills at the warehouse rate for the 60-second minimum, not the inference rate. The same call running inside a busy warehouse bills at near-zero marginal cost. The structural difference is enormous, and standard FinOps dashboards do not surface it.

This is why we wrote this playbook. Generic anomaly detection tools watch warehouse-level credit consumption. The Triad operates one analytical layer deeper, at the function-call shape level. If you are new to FinOps as a discipline, our introduction to FinOps walks through the foundational principles that make function-level visibility possible in the first place.

What Counts as a Cortex Cost Anomaly

We define a Cortex cost anomaly as any credit consumption pattern from Snowflake Cortex AI functions, including COMPLETE, EMBED_TEXT_768, SUMMARIZE, EXTRACT_ANSWER, and CLASSIFY_TEXT, that breaks the linear, predictable scaling assumed in capacity planning. In statistical terms, we flag any deviation greater than two standard deviations above the 30-day rolling baseline of credits attributed to CORTEX_FUNCTIONS_USAGE_HISTORY for a given warehouse, role, or query tag.

Definition.

Cortex cost anomaly: a deviation greater than two standard deviations above the 30-day rolling baseline of credits attributed to CORTEX_FUNCTIONS_USAGE_HISTORY for a given warehouse, role, or query tag.

The break is almost never the model itself. Cortex pricing per token is consistent and predictable. The break sits in the surrounding query plan, the warehouse lifecycle, or the call pattern. This is good news, because it means we can detect anomalies without touching the model layer. A few SQL queries against the right account-usage views catch nine out of ten incidents in our audit set.

The Cortex Spend Triad: A Diagnostic Framework

The Triad is a diagnostic shortcut. Rather than auditing every Cortex call individually, we classify each anomaly into one of three failure modes. Each mode has a single SQL detection signature and a one-line architectural fix. Coverage in our audit set was 91 percent, meaning nine of ten anomalies fit one of the three patterns cleanly.

The three patterns differ in cost shape, detection signal, and blast radius. WHERE-clause LLMs scale with rows scanned multiplied by token cost, and they typically drive 10 to 100 times the baseline daily spend on a single offending query. Unbounded embeddings scale linearly with row count and run 500 to 3,000 credits per incident. Auto-resume loops show up as low utilization with a high credit-to-compute ratio, and they quietly absorb 20 to 40 percent of a warehouse's monthly bill before anyone notices.

Across the 47 Snowflake accounts we audited in Q1 2026, 73 percent had at least one Triad pattern active in the prior 30 days. AI workloads now represent 20 to 35 percent of cloud spend for AI-native companies in our customer base, so the Triad is not a niche concern. It is a primary cost driver. We will walk through each pattern in turn, starting with the most common.

Pattern 1: LLM Functions Inside WHERE Clauses

This is the classic anomaly, and we see it more often than the other two combined. A developer writes a query like the following, against a 10 million row table:

SELECT * FROM tickets

WHERE SNOWFLAKE.CORTEX.COMPLETE('llama3-8b',

concat('classify: ', body)) = 'urgent'

The intent looks reasonable. The execution is catastrophic. Snowflake evaluates the function row by row across the full scan. There is no predicate pushdown, no caching, and no opportunity for the optimizer to short-circuit the work. Token cost compounds with row count, and a single query can run for hours while billing thousands of credits.

How we detect it

We run a query against ACCOUNT_USAGE.QUERY_HISTORY filtered to the last seven days, flagging any query where the QUERY_TEXT contains "CORTEX." inside a WHERE, HAVING, JOIN, or CASE predicate, and ROWS_PRODUCED multiplied by the embedded function count exceeds 100,000. The signature is unambiguous. False positive rates in our internal testing came in under four percent.

How we fix it

Move the Cortex call into a CTE that filters first with cheap predicates, then apply the LLM only to a bounded result set. For repeated classifications, precompute the labels into a materialized column on a daily schedule and JOIN against the precomputed values. Pattern 1 typically drops 95 percent in cost after this rewrite.

The bad version scans every row and calls the LLM on each. With 41 million rows in a tickets table, that is 41 million inference calls. The good version filters first

WITH filtered AS (

SELECT * FROM tickets

WHERE created_at > CURRENT_DATE - 7

AND priority IS NULL

)

SELECT *,

CORTEX.COMPLETE('llama3-8b', body) AS classification

FROM filteredThe result set is bounded to roughly 50,000 rows. The same business logic now runs at one-tenth of the cost, with no change for the downstream consumer.

Pattern 2: Unbounded Embeddings on Full Tables

Embedding pipelines are the second-largest source of anomalies in our audit data. The pattern reads "embed all rows in customer_messages and store in a vector column," with no sample, no checkpoint, and no idempotency check. A re-run after a partial failure recomputes every embedding from scratch. A schema migration that adds a new field can trigger a full re-embed of historical data without anyone noticing.

Real incident.

We worked with a customer in February 2026 whose unbounded EMBED_TEXT_768 query against a 41 million row table consumed 2,340 credits in six hours. That single query represented 19 percent of their monthly Snowflake budget. The job was running as a one-time backfill, kicked off by a data scientist who did not realize the query had no LIMIT and no incremental WHERE predicate. After we restructured the same job with a hash-based checkpoint and a daily incremental, it now runs on 180 credits per month for the same business outcome.

How we detect it

We query CORTEX_FUNCTIONS_USAGE_HISTORY for any EMBED_TEXT_768 call with credits_used greater than 50, then cross-join with QUERY_HISTORY to confirm the absence of a LIMIT, a SAMPLE, or an incremental WHERE predicate. The combination of high credit use and unbounded query shape is a reliable signal. The query takes seconds to run against a year of history.

How we fix it

Three guardrails work together. First, check a hash of the input column against an embedded_at column before re-embedding any row, so re-runs become idempotent. Second, process in batches of 100,000 rows with a checkpoint table that records progress, so a partial failure resumes where it stopped instead of restarting from zero. Third, cap any single embedding job at a configurable credit budget using budget alerts tied to the warehouse query tag. The 2,340 credit incident would have stopped at 200 credits with this third guardrail in place.

We have written more on operationalizing budget caps and tag-based alerting in our guide to FinOps cost allocation practices, which covers the broader pattern of tying spend to ownership through tags, hierarchies, and chargeback.

Pattern 3: Warehouse Auto-Resume Loops

This is the most expensive pattern per Cortex call, because the Cortex inference itself becomes rounding error compared to the warehouse wake. The shape is simple. A Streamlit application, a Snowflake Task, or a Snowpipe streaming job triggers a single small CORTEX.COMPLETE call once every 90 seconds. Each call wakes a stopped warehouse for the 60-second minimum charge. The Cortex inference takes 200 milliseconds. The warehouse charges for 60 seconds. The resulting credit-to-actual-compute ratio sits around 4.8 times on average across the Streamlit and Tasks deployments we have audited.

The implication is counterintuitive. A small Cortex application can outspend the production data pipeline running on the same warehouse, because the production pipeline keeps the warehouse warm while the Cortex calls each incur a fresh wake cycle. Teams routinely discover that a Streamlit dashboard with 20 daily users is costing more in credits than the nightly batch job that processes 50 million rows.

How we detect it

We query QUERY_HISTORY for CORTEX traffic, sum credits used and execution time per warehouse, and flag any warehouse where the credit-to-compute ratio exceeds 3.0. A healthy ratio sits near 1.0. Anything above 3.0 indicates significant warehouse-wake amplification.

SELECT warehouse_name,

SUM(credits_used) /

SUM(execution_time / 1000 / 3600)

AS credit_compute_ratio

FROM QUERY_HISTORY

WHERE query_text LIKE '%CORTEX.%'

GROUP BY 1

HAVING credit_compute_ratio > 3.0

How we fix it

Route Cortex traffic to a dedicated XS warehouse with a 60-second auto-suspend and batched task scheduling, so a single wake serves many calls. For very infrequent Cortex use, move the calls to a serverless task instead. Our architectural rule of thumb: any Cortex call frequency below five per minute should never wake its own dedicated warehouse, because the wake economics will dominate the inference economics every time.

Operationalizing Detection: Building a 24-Hour Feedback Loop

The three queries above are cheap to run. The challenge is making them part of an operational rhythm rather than an audit you do once and forget. Our recommended approach is to wire all three queries into a daily Snowflake Task with an alert action that pushes results into Slack or PagerDuty. The detection lag drops from "next monthly invoice" to under 24 hours.

The thresholds we use in our customer deployments are the following. For WHERE-clause LLMs, alert on any single query above 50 credits with a one-hour lookback window, sourced from QUERY_HISTORY. For unbounded embeddings, alert on any single function call above 100 credits with a 24-hour lookback, sourced from CORTEX_FUNCTIONS_USAGE_HISTORY. For auto-resume loops, alert on any warehouse where the credit-to-compute ratio exceeds 3.0 over a 24-hour window, sourced from WAREHOUSE_METERING_HISTORY combined with QUERY_HISTORY.

These numbers are starting points, not absolutes. A team running heavy nightly batch embeddings might raise the embedding threshold to 250 credits to avoid noise. A team running mostly interactive queries might lower the WHERE-clause threshold to 25 credits. The principle stays the same: catch the offending query while the developer still remembers writing it, not at the next monthly review.

If you want to go deeper on the broader topic of building a forecast and detection layer, our beginner guide to cloud cost forecasting and anomaly detection covers how forecasting and anomaly detection work together to move teams from reactive billing reviews to proactive cost management.

Cortex vs OpenAI Direct vs Self-Hosted: A Cost Architecture Comparison

We get one question more often than any other. If Cortex is causing this much pain, why not switch to OpenAI direct API calls or run our own LLM on EC2? The honest answer is that each approach has different cost shapes and operational tradeoffs, and the right choice depends on workload pattern. The table below summarizes how the three options compare across the dimensions that actually matter on a Q4 cost review.

Dimension	Snowflake Cortex	OpenAI Direct API	Self-Hosted (EC2 + GPU)
Per-token cost (low frequency)	4 to 10x baseline	Baseline	8 to 20x baseline
Per-token cost (high throughput)	1.2 to 1.8x baseline	Baseline	0.6 to 1.0x baseline
Setup time	Minutes (function call)	Hours (API key plus integration)	Weeks (infra, monitoring, deploy)
Data residency	Stays in Snowflake	Leaves your perimeter	Stays in your VPC
Cost line on invoice	Bundled credits	Itemized per token	EC2 plus storage plus egress
Operational burden	Low	Low	High
Anomaly detectability	Hard (bundled)	Easy (itemized)	Medium (compute-level)
Best for	Data already in Snowflake, batch jobs	App-tier inference, predictable QPS	High-volume sustained workloads

The structural insight from this table is the one we lead with in customer conversations. Cortex is cheaper than building your own LLM infrastructure on EC2 for most teams. Cortex is more expensive than calling a hosted API directly for low-frequency or sparse workloads. The cost-effectiveness claim depends entirely on whether you would have built the surrounding infrastructure anyway, and on whether your call pattern keeps a warehouse warm or wakes one for every request.

For the 1,000-call-per-day Streamlit pattern, Cortex can run ten times the cost of an equivalent OpenAI integration. For a nightly batch job that processes 50 million rows on a warehouse that is already running, Cortex is competitive or better, and the data residency benefit is real. The honest framing is workload-dependent, not vendor-dependent.

If you are evaluating broader AI cost economics, our analysis of AI versus manual cloud cost optimization covers how AI workload patterns differ from traditional compute and where teams typically miss the early signals on inference spend.

Why Generic FinOps Tools Miss the Triad

We see the same diagnostic gap in every Cortex audit we run. CloudZero, Finout, Vantage, and Kubecost are excellent at warehouse-level cost attribution, multi-cloud reporting, and tag-based showback. None of them currently parse Cortex function-level cost. They report that warehouse X spent 200 credits yesterday. They do not report that 47 of those credits came from a single CORTEX.COMPLETE call inside a WHERE clause on the tickets table.

Snowflake's native cost dashboards have the same limitation. They surface warehouse spend and query history, but they do not classify queries against workload-shape anomaly signatures. Account administrators see the totals and the top queries, but the Triad patterns sit one analytical layer below what the dashboards expose. You would need to write the detection SQL yourself, schedule it as a Task, route the alerts somewhere actionable, and keep the thresholds tuned as your workload mix evolves.

This is not a criticism of these tools. They were designed before Cortex existed as a billable workload, and warehouse-level visibility was the right resolution for the previous generation of Snowflake cost questions. The Cortex Spend Triad is a workload-shape problem, and solving it requires workload-shape telemetry. We have written separately on how to evaluate the broader FinOps tool landscape, including which categories of tools handle which categories of cost questions.

How We Help Customers Prevent the Triad

Our Cost Explorer++ ingests Snowflake ACCOUNT_USAGE and Cortex telemetry into the same Cloud CMDB context as our customers' AWS, GCP, and Azure workloads. The Triad detection queries run continuously in our backend, not as a one-off audit. Three things happen automatically.

Pattern fingerprint matching

Each Cortex query is classified against the three Triad signatures within five minutes of execution. We do not wait for the monthly close. By the time the offending query has finished running, our customers already have an alert in Slack or PagerDuty with the query text, the warehouse, the role, and the suspected pattern. Five-minute resolution is the difference between catching a bad query before it scales and explaining it after the fact.

Root-cause attribution

Our Cloud CMDB ties the offending query to the dbt model, Streamlit application, or Task that triggered it. Generic dashboards show "warehouse spent 200 credits." We show "the customer-classifier dbt model triggered 47 WHERE-clause LLM scans on the tickets table this morning, owned by the data platform team, attributed to project alpha." The attribution chain is critical, because without it the alert sits on a generic Snowflake admin mailing list and goes unactioned for days.

Budget guardrails

Budget alerts tied to query tags can pause the warehouse before a single embedding job exceeds a credit cap. The 2,340 credit incident from February 2026 would have stopped at 200 credits with the guardrails our customers run by default. We treat the cap as a circuit breaker, not a hard prevention, so legitimate large jobs can still run with explicit approval.

We hear consistently from customers that the attribution layer is what changes their operational posture. Engineering teams who can see "the customer-classifier dbt model is the source of this spike" can ship a fix in hours. Engineering teams who only see "warehouse X is over budget" spend days hunting for the cause. If you want to read more on the attribution philosophy, our guide to cloud cost optimization strategies covers the broader pattern of tying every dollar to an owner.

Onboarding takes under seven days in our experience. Connection is via a read-only Snowflake role, no agent install, and no tagging campaign required upfront. The Triad detection runs from day one, and the CMDB attribution layer fills in over the first 24 hours as we observe query patterns and map them to source jobs.

Key Takeaways

The Cortex Spend Triad covers roughly 90 percent of Snowflake AI cost anomalies in our audit set. The three patterns are LLM functions inside WHERE clauses, unbounded embeddings on full tables, and warehouse auto-resume loops driven by sparse Cortex calls.
Each pattern has a single detection query against ACCOUNT_USAGE views. Wire them into a daily Snowflake Task and pipe the results into Slack or PagerDuty. Detection lag drops from monthly to daily.
Cortex pricing is structural, not arbitrary. Credits bundle warehouse compute, network, and inference into a single unit. For sparse workloads, warehouse wake dominates. For dense batch workloads, inference dominates.
The 4 to 10 times multiplier over OpenAI direct API applies primarily to the sparse, low-frequency case. For batch workloads on warm warehouses, Cortex is competitive or cheaper.
Generic FinOps tools and Snowflake's native dashboards do not classify queries against workload-shape signatures. The Triad is a workload-shape problem, and detecting it requires workload-shape telemetry.
Opslyft Cost Explorer++ classifies every Cortex query against the Triad in under five minutes and attributes it to the source dbt model, Streamlit app, or Task. Onboarding takes under a week with a read-only role.

FAQs

What is a Cortex cost anomaly in Snowflake?

A Cortex cost anomaly is an unplanned credit spike from Snowflake Cortex AI functions, typically caused by one of the Triad patterns: LLM functions in WHERE clauses, unbounded embeddings, or warehouse auto-resume loops. We define the threshold as a deviation greater than two standard deviations above the 30-day baseline.

How do I detect a Cortex anomaly using only Snowflake views?

Three queries against SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY, CORTEX_FUNCTIONS_USAGE_HISTORY, and WAREHOUSE_METERING_HISTORY catch the three Triad patterns. Run them as a daily Snowflake Task with credit thresholds: above 50 for WHERE-clause LLMs, above 100 for unbounded embeddings, and a credit-to-compute ratio above 3.0 for auto-resume loops.

Why is Cortex more expensive than calling OpenAI directly?

Cortex credits bundle warehouse compute, network egress, and AI inference into a single non-decomposable unit. A 200 millisecond inference call that wakes a warehouse bills at the warehouse rate, not the inference rate. For identical Llama 3 8B or equivalent workloads, Cortex runs 4 to 10 times the cost of OpenAI direct API once warehouse wake is included.

Can CloudZero or Finout detect Cortex anomalies?

As of May 2026, CloudZero, Finout, Vantage, and Kubecost do not parse Snowflake Cortex function-level cost. They report warehouse-level spend. The Triad patterns are workload-shape anomalies that require function-level telemetry, which our Cost Explorer++ provides natively.

Cloud waste? Bench it. Opslyft puts the right players on the field.

Updated 4 May 2026 • 8 mins read

Snowflake AI Agent Cost Anomalies: Cortex Spend Triad Guide

Ai Cost Optimization

Khushi Dubey
Author

Table of Content

Why Cortex Bills Behave Differently From the Rest of Your Cloud

What Counts as a Cortex Cost Anomaly

Definition.

Cortex cost anomaly: a deviation greater than two standard deviations above the 30-day rolling baseline of credits attributed to CORTEX_FUNCTIONS_USAGE_HISTORY for a given warehouse, role, or query tag.

The Cortex Spend Triad: A Diagnostic Framework

Pattern 1: LLM Functions Inside WHERE Clauses

This is the classic anomaly, and we see it more often than the other two combined. A developer writes a query like the following, against a 10 million row table:

SELECT * FROM tickets

WHERE SNOWFLAKE.CORTEX.COMPLETE('llama3-8b',

concat('classify: ', body)) = 'urgent'

How we detect it

How we fix it

The bad version scans every row and calls the LLM on each. With 41 million rows in a tickets table, that is 41 million inference calls. The good version filters first

WITH filtered AS (

SELECT * FROM tickets

WHERE created_at > CURRENT_DATE - 7

AND priority IS NULL

)

SELECT *,

CORTEX.COMPLETE('llama3-8b', body) AS classification

FROM filteredThe result set is bounded to roughly 50,000 rows. The same business logic now runs at one-tenth of the cost, with no change for the downstream consumer.

Pattern 2: Unbounded Embeddings on Full Tables

Real incident.

We worked with a customer in February 2026 whose unbounded EMBED_TEXT_768 query against a 41 million row table consumed 2,340 credits in six hours. That single query represented 19 percent of their monthly Snowflake budget. The job was running as a one-time backfill, kicked off by a data scientist who did not realize the query had no LIMIT and no incremental WHERE predicate. After we restructured the same job with a hash-based checkpoint and a daily incremental, it now runs on 180 credits per month for the same business outcome.

How we detect it

How we fix it

Pattern 3: Warehouse Auto-Resume Loops

How we detect it

SELECT warehouse_name,

SUM(credits_used) /

SUM(execution_time / 1000 / 3600)

AS credit_compute_ratio

FROM QUERY_HISTORY

WHERE query_text LIKE '%CORTEX.%'

GROUP BY 1

HAVING credit_compute_ratio > 3.0

How we fix it

Operationalizing Detection: Building a 24-Hour Feedback Loop

Cortex vs OpenAI Direct vs Self-Hosted: A Cost Architecture Comparison

Dimension	Snowflake Cortex	OpenAI Direct API	Self-Hosted (EC2 + GPU)
Per-token cost (low frequency)	4 to 10x baseline	Baseline	8 to 20x baseline
Per-token cost (high throughput)	1.2 to 1.8x baseline	Baseline	0.6 to 1.0x baseline
Setup time	Minutes (function call)	Hours (API key plus integration)	Weeks (infra, monitoring, deploy)
Data residency	Stays in Snowflake	Leaves your perimeter	Stays in your VPC
Cost line on invoice	Bundled credits	Itemized per token	EC2 plus storage plus egress
Operational burden	Low	Low	High
Anomaly detectability	Hard (bundled)	Easy (itemized)	Medium (compute-level)
Best for	Data already in Snowflake, batch jobs	App-tier inference, predictable QPS	High-volume sustained workloads

Why Generic FinOps Tools Miss the Triad

How We Help Customers Prevent the Triad

Pattern fingerprint matching

Root-cause attribution

Budget guardrails

Key Takeaways

The Cortex Spend Triad covers roughly 90 percent of Snowflake AI cost anomalies in our audit set. The three patterns are LLM functions inside WHERE clauses, unbounded embeddings on full tables, and warehouse auto-resume loops driven by sparse Cortex calls.
Each pattern has a single detection query against ACCOUNT_USAGE views. Wire them into a daily Snowflake Task and pipe the results into Slack or PagerDuty. Detection lag drops from monthly to daily.
Cortex pricing is structural, not arbitrary. Credits bundle warehouse compute, network, and inference into a single unit. For sparse workloads, warehouse wake dominates. For dense batch workloads, inference dominates.
The 4 to 10 times multiplier over OpenAI direct API applies primarily to the sparse, low-frequency case. For batch workloads on warm warehouses, Cortex is competitive or cheaper.
Generic FinOps tools and Snowflake's native dashboards do not classify queries against workload-shape signatures. The Triad is a workload-shape problem, and detecting it requires workload-shape telemetry.
Opslyft Cost Explorer++ classifies every Cortex query against the Triad in under five minutes and attributes it to the source dbt model, Streamlit app, or Task. Onboarding takes under a week with a read-only role.

FAQs

What is a Cortex cost anomaly in Snowflake?

How do I detect a Cortex anomaly using only Snowflake views?

Why is Cortex more expensive than calling OpenAI directly?

Can CloudZero or Finout detect Cortex anomalies?