30x faster anomaly detection at 1.8% of monthly cost

The setup

Atlas Voyages had hourly post-booking KPIs that mattered (Shoppers, Purchasers, CVR, hour over hour) and a monitoring story that did not. The legacy stack was a managed-BI dashboard that took 30 minutes per refresh and cost nearly $5,000 a month, and it had no native anomaly detection. When CVR dropped at 14:00, an analyst noticed the next morning. By then the cause had moved.

How it works

The replacement runs as a Cloud Run job on a 60-second cadence, scanning a 42-day window from a single BigQuery view. Two parallel anomaly detectors run on every hour, on every monitored KPI:

Prior-weeks method compares the current hour against the same weekday and same hour over the prior five weeks (e.g., today’s 14:00 vs the previous five Tuesdays at 14:00). This is the production default; it controls for weekly seasonality.
Rolling-days method compares the current hour against the same hour over the prior seven days. Useful when weekly seasonality is unstable or the day-of-week prior is thin.

Both compute the mean, standard deviation, and a sigma-band around the prior. A hour escapes the band when it sits more than the configured sigma threshold (default 1σ) above or below. Severity is classified by how far the hour escapes: Info, Warning, or Critical.

Three KPIs are monitored at hourly granularity. Each gets a severity-tinted card on the dashboard, an inline chart with the sigma band, an anomaly log of the last triggered hours, and a 7-day heatmap. A health banner at the top shows the worst current state in red, amber, or green.

What it ships

Detection latency went from “next morning” to under 60 seconds. The CVR anomaly at 14:00 surfaces by 14:01.
Refresh time went from 30 minutes to under 60 seconds, a 30x speedup.
Monthly cost went from $4,995 to about $54, roughly a 98 percent reduction. The hot path no longer touches a managed BI platform.
Hour-over-hour alerts catch crashes the sigma method does not (a sudden drop within the active band still trips an alert if it exceeds the HoH-drop threshold and the prior hour had real traffic).
Active hours are 8 to 23 EST, set after off-peak alerts proved noisy.

The pattern is now the team’s default for any analytical monitoring pipeline.

What was hard

The architectural rewrite was the easy part. Three things took the rest of the project.

1. Statistical anomaly detection has more knobs than people admit. A single sigma threshold across every metric every hour is wrong: 1σ on CVR catches important shifts, the same threshold on absolute Shoppers fires constantly during low-traffic hours and never during peak. The product needs per-metric, per-hour sensitivity, plus an hour-of-day filter (active hours 8 to 23) so the dashboard does not page on a 4 AM ghost-town anomaly. There is also the difference between rolling-days (comparing today’s 14:00 to the last seven 14:00s) and prior-weeks (comparing today’s Tuesday-14:00 to the last five Tuesday-14:00s); the right answer is “both, side by side, and let the operator pick which one to trust based on whether weekly seasonality is stable this month.”

2. Performance optimization is not optional once a Streamlit app crosses 100 charts. The naive build of this dashboard called the anomaly detector 120 plus times per refresh and re-rendered every Plotly figure on every interaction. We had to cache the anomaly results, cache the band computations separately (they survive hour-slider changes within a fragment), pre-compute daily statistics, lazy-load the overlay and cumulative chart sections behind opt-in checkboxes, force the heavy bar charts to render as static SVG instead of interactive Plotly, and move the BigQuery import inside the cold-cache branch so warm cache hits never pay for grpc and protobuf. None of this is glamorous. Without it the page took eight seconds to load on warm cache and twenty on cold; with it, sub-second warm and four seconds cold.

3. Schema drift in the source data quietly poisoned the dashboard for two weeks. A GA4 session-inflation incident upstream caused per-session pageviews to drop 21 percent against baseline, which silently shifted every CVR metric without flagging anything wrong. The dashboard was correctly reading correctly-shaped data; the data itself was wrong. The fix was to centralize the data-quality status into a single source-of-truth join (a DataQualityNotes table), tag the affected period across all six dependent views, render a non-removable banner on the dashboard while the period is in scope, and treat upstream data-quality detection as a permanent capability rather than an incident response. Production monitoring is only as honest as the upstream pipeline.

Outcome

Refresh time: 30 minutes to 1 minute (30x). Monthly infra cost: $4,995 to $54 (98 percent reduction). Detection latency: next-morning to under 60 seconds. Two anomaly methods running in parallel on three KPIs at hourly granularity, with severity classification, HoH alerts, and a health banner that an executive can read at a glance.

Stack

Cloud Run for the scheduled job (60-second cadence, ~$0.42 per hour). BigQuery for the source view (42-day scan window, 1-hour cache TTL). Streamlit for the operator interface, with aggressive @st.cache_data wrapping of the anomaly cache, band computations, and daily statistics. Plotly for charts, configured static where interactivity is unnecessary. No managed BI platform anywhere on the hot path.