Loading...


Updated 4 May 2026 • 8 mins read

A Snowflake AI agent cost anomaly is excess Cortex credit usage caused by inefficient query patterns or warehouse behavior, typically classified into predicate LLM execution, unbounded embeddings, and auto-resume loops, detectable via ACCOUNT_USAGE views.
Most cloud cost anomalies show up in compute, storage, or egress. Snowflake Cortex breaks this mental model. The pricing unit is the Snowflake credit, and a single credit bundles three things into one non-decomposable line item: warehouse compute, internal network movement, and the AI inference call itself. There is no breakdown on the invoice. There is no separate inference rate you can compare against OpenAI or Anthropic. The credit is the only thing the finance team sees.
This bundling matters because the cost driver shifts depending on call shape. For high-throughput workloads, the inference itself dominates. For low-frequency or sparse workloads, the warehouse wake dominates. A 200 millisecond Cortex call that wakes a stopped warehouse bills at the warehouse rate for the 60-second minimum, not the inference rate. The same call running inside a busy warehouse bills at near-zero marginal cost. The structural difference is enormous, and standard FinOps dashboards do not surface it.
This is why we wrote this playbook. Generic anomaly detection tools watch warehouse-level credit consumption. The Triad operates one analytical layer deeper, at the function-call shape level. If you are new to FinOps as a discipline, our introduction to FinOps walks through the foundational principles that make function-level visibility possible in the first place.
We define a Cortex cost anomaly as any credit consumption pattern from Snowflake Cortex AI functions, including COMPLETE, EMBED_TEXT_768, SUMMARIZE, EXTRACT_ANSWER, and CLASSIFY_TEXT, that breaks the linear, predictable scaling assumed in capacity planning. In statistical terms, we flag any deviation greater than two standard deviations above the 30-day rolling baseline of credits attributed to CORTEX_FUNCTIONS_USAGE_HISTORY for a given warehouse, role, or query tag.
Definition.
Cortex cost anomaly: a deviation greater than two standard deviations above the 30-day rolling baseline of credits attributed to CORTEX_FUNCTIONS_USAGE_HISTORY for a given warehouse, role, or query tag.
The break is almost never the model itself. Cortex pricing per token is consistent and predictable. The break sits in the surrounding query plan, the warehouse lifecycle, or the call pattern. This is good news, because it means we can detect anomalies without touching the model layer. A few SQL queries against the right account-usage views catch nine out of ten incidents in our audit set.
The Triad is a diagnostic shortcut. Rather than auditing every Cortex call individually, we classify each anomaly into one of three failure modes. Each mode has a single SQL detection signature and a one-line architectural fix. Coverage in our audit set was 91 percent, meaning nine of ten anomalies fit one of the three patterns cleanly.
The three patterns differ in cost shape, detection signal, and blast radius. WHERE-clause LLMs scale with rows scanned multiplied by token cost, and they typically drive 10 to 100 times the baseline daily spend on a single offending query. Unbounded embeddings scale linearly with row count and run 500 to 3,000 credits per incident. Auto-resume loops show up as low utilization with a high credit-to-compute ratio, and they quietly absorb 20 to 40 percent of a warehouse's monthly bill before anyone notices.
Across the 47 Snowflake accounts we audited in Q1 2026, 73 percent had at least one Triad pattern active in the prior 30 days. AI workloads now represent 20 to 35 percent of cloud spend for AI-native companies in our customer base, so the Triad is not a niche concern. It is a primary cost driver. We will walk through each pattern in turn, starting with the most common.
This is the classic anomaly, and we see it more often than the other two combined. A developer writes a query like the following, against a 10 million row table:
SELECT * FROM tickets
WHERE SNOWFLAKE.CORTEX.COMPLETE('llama3-8b',
concat('classify: ', body)) = 'urgent'
The intent looks reasonable. The execution is catastrophic. Snowflake evaluates the function row by row across the full scan. There is no predicate pushdown, no caching, and no opportunity for the optimizer to short-circuit the work. Token cost compounds with row count, and a single query can run for hours while billing thousands of credits.
We run a query against ACCOUNT_USAGE.QUERY_HISTORY filtered to the last seven days, flagging any query where the QUERY_TEXT contains "CORTEX." inside a WHERE, HAVING, JOIN, or CASE predicate, and ROWS_PRODUCED multiplied by the embedded function count exceeds 100,000. The signature is unambiguous. False positive rates in our internal testing came in under four percent.
Move the Cortex call into a CTE that filters first with cheap predicates, then apply the LLM only to a bounded result set. For repeated classifications, precompute the labels into a materialized column on a daily schedule and JOIN against the precomputed values. Pattern 1 typically drops 95 percent in cost after this rewrite.
The bad version scans every row and calls the LLM on each. With 41 million rows in a tickets table, that is 41 million inference calls. The good version filters first
WITH filtered AS (
SELECT * FROM tickets
WHERE created_at > CURRENT_DATE - 7
AND priority IS NULL
)
SELECT *,
CORTEX.COMPLETE('llama3-8b', body) AS classification
FROM filteredThe result set is bounded to roughly 50,000 rows. The same business logic now runs at one-tenth of the cost, with no change for the downstream consumer.
Embedding pipelines are the second-largest source of anomalies in our audit data. The pattern reads "embed all rows in customer_messages and store in a vector column," with no sample, no checkpoint, and no idempotency check. A re-run after a partial failure recomputes every embedding from scratch. A schema migration that adds a new field can trigger a full re-embed of historical data without anyone noticing.
Real incident.
We worked with a customer in February 2026 whose unbounded EMBED_TEXT_768 query against a 41 million row table consumed 2,340 credits in six hours. That single query represented 19 percent of their monthly Snowflake budget. The job was running as a one-time backfill, kicked off by a data scientist who did not realize the query had no LIMIT and no incremental WHERE predicate. After we restructured the same job with a hash-based checkpoint and a daily incremental, it now runs on 180 credits per month for the same business outcome.
We query CORTEX_FUNCTIONS_USAGE_HISTORY for any EMBED_TEXT_768 call with credits_used greater than 50, then cross-join with QUERY_HISTORY to confirm the absence of a LIMIT, a SAMPLE, or an incremental WHERE predicate. The combination of high credit use and unbounded query shape is a reliable signal. The query takes seconds to run against a year of history.
Three guardrails work together. First, check a hash of the input column against an embedded_at column before re-embedding any row, so re-runs become idempotent. Second, process in batches of 100,000 rows with a checkpoint table that records progress, so a partial failure resumes where it stopped instead of restarting from zero. Third, cap any single embedding job at a configurable credit budget using budget alerts tied to the warehouse query tag. The 2,340 credit incident would have stopped at 200 credits with this third guardrail in place.
We have written more on operationalizing budget caps and tag-based alerting in our guide to FinOps cost allocation practices, which covers the broader pattern of tying spend to ownership through tags, hierarchies, and chargeback.
This is the most expensive pattern per Cortex call, because the Cortex inference itself becomes rounding error compared to the warehouse wake. The shape is simple. A Streamlit application, a Snowflake Task, or a Snowpipe streaming job triggers a single small CORTEX.COMPLETE call once every 90 seconds. Each call wakes a stopped warehouse for the 60-second minimum charge. The Cortex inference takes 200 milliseconds. The warehouse charges for 60 seconds. The resulting credit-to-actual-compute ratio sits around 4.8 times on average across the Streamlit and Tasks deployments we have audited.
The implication is counterintuitive. A small Cortex application can outspend the production data pipeline running on the same warehouse, because the production pipeline keeps the warehouse warm while the Cortex calls each incur a fresh wake cycle. Teams routinely discover that a Streamlit dashboard with 20 daily users is costing more in credits than the nightly batch job that processes 50 million rows.
We query QUERY_HISTORY for CORTEX traffic, sum credits used and execution time per warehouse, and flag any warehouse where the credit-to-compute ratio exceeds 3.0. A healthy ratio sits near 1.0. Anything above 3.0 indicates significant warehouse-wake amplification.
SELECT warehouse_name,
SUM(credits_used) /
SUM(execution_time / 1000 / 3600)
AS credit_compute_ratio
FROM QUERY_HISTORY
WHERE query_text LIKE '%CORTEX.%'
GROUP BY 1
HAVING credit_compute_ratio > 3.0
Route Cortex traffic to a dedicated XS warehouse with a 60-second auto-suspend and batched task scheduling, so a single wake serves many calls. For very infrequent Cortex use, move the calls to a serverless task instead. Our architectural rule of thumb: any Cortex call frequency below five per minute should never wake its own dedicated warehouse, because the wake economics will dominate the inference economics every time.
The three queries above are cheap to run. The challenge is making them part of an operational rhythm rather than an audit you do once and forget. Our recommended approach is to wire all three queries into a daily Snowflake Task with an alert action that pushes results into Slack or PagerDuty. The detection lag drops from "next monthly invoice" to under 24 hours.
The thresholds we use in our customer deployments are the following. For WHERE-clause LLMs, alert on any single query above 50 credits with a one-hour lookback window, sourced from QUERY_HISTORY. For unbounded embeddings, alert on any single function call above 100 credits with a 24-hour lookback, sourced from CORTEX_FUNCTIONS_USAGE_HISTORY. For auto-resume loops, alert on any warehouse where the credit-to-compute ratio exceeds 3.0 over a 24-hour window, sourced from WAREHOUSE_METERING_HISTORY combined with QUERY_HISTORY.
These numbers are starting points, not absolutes. A team running heavy nightly batch embeddings might raise the embedding threshold to 250 credits to avoid noise. A team running mostly interactive queries might lower the WHERE-clause threshold to 25 credits. The principle stays the same: catch the offending query while the developer still remembers writing it, not at the next monthly review.
If you want to go deeper on the broader topic of building a forecast and detection layer, our beginner guide to cloud cost forecasting and anomaly detection covers how forecasting and anomaly detection work together to move teams from reactive billing reviews to proactive cost management.
We get one question more often than any other. If Cortex is causing this much pain, why not switch to OpenAI direct API calls or run our own LLM on EC2? The honest answer is that each approach has different cost shapes and operational tradeoffs, and the right choice depends on workload pattern. The table below summarizes how the three options compare across the dimensions that actually matter on a Q4 cost review.
| Dimension | Snowflake Cortex | OpenAI Direct API | Self-Hosted (EC2 + GPU) |
|---|---|---|---|
| Per-token cost (low frequency) | 4 to 10x baseline | Baseline | 8 to 20x baseline |
| Per-token cost (high throughput) | 1.2 to 1.8x baseline | Baseline | 0.6 to 1.0x baseline |
| Setup time | Minutes (function call) | Hours (API key plus integration) | Weeks (infra, monitoring, deploy) |
| Data residency | Stays in Snowflake | Leaves your perimeter | Stays in your VPC |
| Cost line on invoice | Bundled credits | Itemized per token | EC2 plus storage plus egress |
| Operational burden | Low | Low | High |
| Anomaly detectability | Hard (bundled) | Easy (itemized) | Medium (compute-level) |
| Best for | Data already in Snowflake, batch jobs | App-tier inference, predictable QPS | High-volume sustained workloads |
The structural insight from this table is the one we lead with in customer conversations. Cortex is cheaper than building your own LLM infrastructure on EC2 for most teams. Cortex is more expensive than calling a hosted API directly for low-frequency or sparse workloads. The cost-effectiveness claim depends entirely on whether you would have built the surrounding infrastructure anyway, and on whether your call pattern keeps a warehouse warm or wakes one for every request.
For the 1,000-call-per-day Streamlit pattern, Cortex can run ten times the cost of an equivalent OpenAI integration. For a nightly batch job that processes 50 million rows on a warehouse that is already running, Cortex is competitive or better, and the data residency benefit is real. The honest framing is workload-dependent, not vendor-dependent.
If you are evaluating broader AI cost economics, our analysis of AI versus manual cloud cost optimization covers how AI workload patterns differ from traditional compute and where teams typically miss the early signals on inference spend.
We see the same diagnostic gap in every Cortex audit we run. CloudZero, Finout, Vantage, and Kubecost are excellent at warehouse-level cost attribution, multi-cloud reporting, and tag-based showback. None of them currently parse Cortex function-level cost. They report that warehouse X spent 200 credits yesterday. They do not report that 47 of those credits came from a single CORTEX.COMPLETE call inside a WHERE clause on the tickets table.
Snowflake's native cost dashboards have the same limitation. They surface warehouse spend and query history, but they do not classify queries against workload-shape anomaly signatures. Account administrators see the totals and the top queries, but the Triad patterns sit one analytical layer below what the dashboards expose. You would need to write the detection SQL yourself, schedule it as a Task, route the alerts somewhere actionable, and keep the thresholds tuned as your workload mix evolves.
This is not a criticism of these tools. They were designed before Cortex existed as a billable workload, and warehouse-level visibility was the right resolution for the previous generation of Snowflake cost questions. The Cortex Spend Triad is a workload-shape problem, and solving it requires workload-shape telemetry. We have written separately on how to evaluate the broader FinOps tool landscape, including which categories of tools handle which categories of cost questions.
Our Cost Explorer++ ingests Snowflake ACCOUNT_USAGE and Cortex telemetry into the same Cloud CMDB context as our customers' AWS, GCP, and Azure workloads. The Triad detection queries run continuously in our backend, not as a one-off audit. Three things happen automatically.
Each Cortex query is classified against the three Triad signatures within five minutes of execution. We do not wait for the monthly close. By the time the offending query has finished running, our customers already have an alert in Slack or PagerDuty with the query text, the warehouse, the role, and the suspected pattern. Five-minute resolution is the difference between catching a bad query before it scales and explaining it after the fact.
Our Cloud CMDB ties the offending query to the dbt model, Streamlit application, or Task that triggered it. Generic dashboards show "warehouse spent 200 credits." We show "the customer-classifier dbt model triggered 47 WHERE-clause LLM scans on the tickets table this morning, owned by the data platform team, attributed to project alpha." The attribution chain is critical, because without it the alert sits on a generic Snowflake admin mailing list and goes unactioned for days.
Budget alerts tied to query tags can pause the warehouse before a single embedding job exceeds a credit cap. The 2,340 credit incident from February 2026 would have stopped at 200 credits with the guardrails our customers run by default. We treat the cap as a circuit breaker, not a hard prevention, so legitimate large jobs can still run with explicit approval.
We hear consistently from customers that the attribution layer is what changes their operational posture. Engineering teams who can see "the customer-classifier dbt model is the source of this spike" can ship a fix in hours. Engineering teams who only see "warehouse X is over budget" spend days hunting for the cause. If you want to read more on the attribution philosophy, our guide to cloud cost optimization strategies covers the broader pattern of tying every dollar to an owner.
Onboarding takes under seven days in our experience. Connection is via a read-only Snowflake role, no agent install, and no tagging campaign required upfront. The Triad detection runs from day one, and the CMDB attribution layer fills in over the first 24 hours as we observe query patterns and map them to source jobs.
A Cortex cost anomaly is an unplanned credit spike from Snowflake Cortex AI functions, typically caused by one of the Triad patterns: LLM functions in WHERE clauses, unbounded embeddings, or warehouse auto-resume loops. We define the threshold as a deviation greater than two standard deviations above the 30-day baseline.
Three queries against SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY, CORTEX_FUNCTIONS_USAGE_HISTORY, and WAREHOUSE_METERING_HISTORY catch the three Triad patterns. Run them as a daily Snowflake Task with credit thresholds: above 50 for WHERE-clause LLMs, above 100 for unbounded embeddings, and a credit-to-compute ratio above 3.0 for auto-resume loops.
Cortex credits bundle warehouse compute, network egress, and AI inference into a single non-decomposable unit. A 200 millisecond inference call that wakes a warehouse bills at the warehouse rate, not the inference rate. For identical Llama 3 8B or equivalent workloads, Cortex runs 4 to 10 times the cost of OpenAI direct API once warehouse wake is included.
As of May 2026, CloudZero, Finout, Vantage, and Kubecost do not parse Snowflake Cortex function-level cost. They report warehouse-level spend. The Triad patterns are workload-shape anomalies that require function-level telemetry, which our Cost Explorer++ provides natively.