Serrin Book →
Atlas
v2.4
2 critical anomalies in the last 4 hours
CVR · Shoppers

Shoppers · 14:00

4,127

−1.4σ vs P5W · Warning

CVR · 14:00

3.7%

−2.6σ vs P5W · Critical

Purchasers · 14:00

153

−0.7σ vs P5W · Normal

Hourly CVR · today vs prior 5 weeks

CVR drop at 14:00 escapes the 1σ lower band

Today P5W mean ±1σ band
3% 4% 5% 6% 08:00 12:00 16:00 20:00 23:00 −2.6σ

Anomaly log · today

  • 14:00 EST CVR Critical
  • 13:00 EST Shoppers Warning
  • 11:00 EST Purchasers Info

Refresh · 60s · Cloud Run job · ~$0.42/hr

A representation of the Atlas monitoring workspace tracking hourly post-booking KPIs. Three KPI cards show today's 14:00 hour with severity tints; CVR is critical at minus 2.6 sigma versus the prior five weeks. The hourly chart plots today's CVR against the prior-five-weeks mean and a one-sigma band, highlighting the 14:00 anomaly. The anomaly log lists recent flagged hours for CVR, Shoppers, and Purchasers with severity classifications.
Outcome

30x faster anomaly detection at 1.8% of monthly cost

30×

faster, at 1.8% of cost

Atlas Voyages · Operational project

Cloud Run BigQuery Streamlit Plotly

The setup

Atlas Voyages had hourly post-booking KPIs that mattered (Shoppers, Purchasers, CVR, hour over hour) and a monitoring story that did not. The legacy stack was a managed-BI dashboard that took 30 minutes per refresh and cost nearly $5,000 a month, and it had no native anomaly detection. When CVR dropped at 14:00, an analyst noticed the next morning. By then the cause had moved.

How it works

The replacement runs as a Cloud Run job on a 60-second cadence, scanning a 42-day window from a single BigQuery view. Two parallel anomaly detectors run on every hour, on every monitored KPI:

Both compute the mean, standard deviation, and a sigma-band around the prior. A hour escapes the band when it sits more than the configured sigma threshold (default 1σ) above or below. Severity is classified by how far the hour escapes: Info, Warning, or Critical.

Three KPIs are monitored at hourly granularity. Each gets a severity-tinted card on the dashboard, an inline chart with the sigma band, an anomaly log of the last triggered hours, and a 7-day heatmap. A health banner at the top shows the worst current state in red, amber, or green.

What it ships

The pattern is now the team’s default for any analytical monitoring pipeline.

What was hard

The architectural rewrite was the easy part. Three things took the rest of the project.

1. Statistical anomaly detection has more knobs than people admit. A single sigma threshold across every metric every hour is wrong: 1σ on CVR catches important shifts, the same threshold on absolute Shoppers fires constantly during low-traffic hours and never during peak. The product needs per-metric, per-hour sensitivity, plus an hour-of-day filter (active hours 8 to 23) so the dashboard does not page on a 4 AM ghost-town anomaly. There is also the difference between rolling-days (comparing today’s 14:00 to the last seven 14:00s) and prior-weeks (comparing today’s Tuesday-14:00 to the last five Tuesday-14:00s); the right answer is “both, side by side, and let the operator pick which one to trust based on whether weekly seasonality is stable this month.”

2. Performance optimization is not optional once a Streamlit app crosses 100 charts. The naive build of this dashboard called the anomaly detector 120 plus times per refresh and re-rendered every Plotly figure on every interaction. We had to cache the anomaly results, cache the band computations separately (they survive hour-slider changes within a fragment), pre-compute daily statistics, lazy-load the overlay and cumulative chart sections behind opt-in checkboxes, force the heavy bar charts to render as static SVG instead of interactive Plotly, and move the BigQuery import inside the cold-cache branch so warm cache hits never pay for grpc and protobuf. None of this is glamorous. Without it the page took eight seconds to load on warm cache and twenty on cold; with it, sub-second warm and four seconds cold.

3. Schema drift in the source data quietly poisoned the dashboard for two weeks. A GA4 session-inflation incident upstream caused per-session pageviews to drop 21 percent against baseline, which silently shifted every CVR metric without flagging anything wrong. The dashboard was correctly reading correctly-shaped data; the data itself was wrong. The fix was to centralize the data-quality status into a single source-of-truth join (a DataQualityNotes table), tag the affected period across all six dependent views, render a non-removable banner on the dashboard while the period is in scope, and treat upstream data-quality detection as a permanent capability rather than an incident response. Production monitoring is only as honest as the upstream pipeline.

Outcome

Refresh time: 30 minutes to 1 minute (30x). Monthly infra cost: $4,995 to $54 (98 percent reduction). Detection latency: next-morning to under 60 seconds. Two anomaly methods running in parallel on three KPIs at hourly granularity, with severity classification, HoH alerts, and a health banner that an executive can read at a glance.

Stack

Cloud Run for the scheduled job (60-second cadence, ~$0.42 per hour). BigQuery for the source view (42-day scan window, 1-hour cache TTL). Streamlit for the operator interface, with aggressive @st.cache_data wrapping of the anomaly cache, band computations, and daily statistics. Plotly for charts, configured static where interactivity is unnecessary. No managed BI platform anywhere on the hot path.

← All outcomes