Serrin Book →
Atlas
v2.4

You · 09:42 AM

Do early shoppers spend more, and what is actually driving it?

Atlas · 09:43 AM

Type B · structural

A 62% revenue-per-booking premium for early shoppers, driven by within-category spend intensity rather than mix shift.

Decompose
Temporal
Segment
Drill
Confirm
6 Conclude

Revenue per booking

USD · n = 1,287,402 bookings

$289
$321
$368
$427
$468
$427
0-4 wk 1-2 mo 2-3 mo 3-4 mo 4-6 mo 6+ mo

Findings

  • +62% RPB at 4-6 mo vs 0-4 wk
  • CVR inverted: 8.83% to 4.10% across DTD
  • Mix held flat: spend intensity, not mix shift
  • 0-4 wk window = 39% of total revenue
21 queries · 6 phases · 0 analyst hours
Ask a follow-up Deep analysis
A representation of the Atlas analytics workspace. The chart shows revenue per booking rising from 289 to 468 US dollars across days-to-departure windows, with the 4 to 6 month window highlighted as the peak. Supporting findings note inverted conversion and stable category mix. A six-step phase indicator shows the agent has concluded its investigation.
Outcome

60% reduction in analyst reporting load via internal LLM agent

−60%

analyst reporting load

Atlas Voyages · 8 weeks

LangGraph Claude API Streamlit BigQuery

The setup

Atlas Voyages’ analytics team was drowning in ad-hoc reporting requests. Most were variants of “give me X cut by Y for Z window.” A capable agent could handle any of them, end to end, without a human reviewer in the loop.

How it works

A LangGraph agent translates plain-English questions into BigQuery SQL, executes against a read-only role, and returns Plotly charts plus a short narrative through a Streamlit interface. Function calling, schema introspection, and a one-shot SQL repair pass handle the long tail.

The load-bearing decision is the first one. Every incoming question is classified before any SQL runs. Misclassification leads to the wrong data shape, so the classifier is the most important model call in the agent.

Question
├─ Type A · CHANGE     ── decompose → temporal → segment → drill → confirm → conclude
└─ Type B · STRUCTURAL ── overview → segment → drill → conclude

each phase: generate SQL → run on BigQuery → repair on error → render chart → generate insight

What it found

In one early evaluation, an executive asked the agent: “Do early shoppers spend more, and what is actually driving it?”

The agent ran 21 queries across six investigation phases. Anonymized findings:

The agent produced an executive-ready narrative without an analyst writing a single SQL query.

Behind the scenes

The classification step itself is a single short prompt that gates everything downstream. It is small, stable, and deliberately load-bearing.

# from deep_analysis.py: the classifier that owns the agent's first move

# STEP 0. CLASSIFY THE QUESTION

# TYPE A. CHANGE ANALYSIS:
#   "Why did X drop?"  "What changed this week?"
#   Use 3-period comparisons (Current vs prior 5 weeks vs prior year).

# TYPE B. STRUCTURAL ANALYSIS:
#   "How does X vary by Y?"  "Which segment has the highest X?"
#   Use CURRENT PERIOD ONLY. Do NOT add _P5W or _PY columns.

# The first message MUST include either
#   "Question type: A (change)"  or  "Question type: B (structural)"

The rest of the agent runs a phase-by-phase plan with explicit tool calls. The framework is opinionated on purpose: a senior analyst would not start by querying random dimensions; they would decompose, then look at trends, then segment. The agent does the same, in the same order.

What was hard

The agent was the easy part. Four things were the hard part.

1. Schema sprawl, and which table to trust. The warehouse had hundreds of tables. Some were deprecated but not deleted. Three of them had a revenue column that meant three different things. The agent does not know which is canonical, and frankly half the data team is not always sure either. Left to its own devices, an LLM will pick fct_orders_legacy because the column names look cleaner, return a confident answer, and be wrong by 12 percent. Nobody catches it for two weeks. The fix is not better prompting. It is curation: deciding which tables are canonical, writing good descriptions for them, and excluding the rest from the agent’s retrieval scope. That is a data governance problem dressed up as an AI problem, and it is usually the customer’s to solve. Implementation timelines depend on how mature the warehouse is.

2. Trust collapses on a single bad answer. Trust is asymmetric. A user can get fifty correct answers in a row, then catch one wrong number on a board deck, and they are done with the tool, forever. They will tell their team it is unreliable and the org never recovers. The implication for design: over-invest in surfacing uncertainty, even when it makes the product feel less magical. A system that says “here is my SQL, please verify” is more trustworthy than one that confidently answers wrong 5 percent of the time. Demos sell on confidence; production survives on humility.

3. Evaluation is a living system, not a project. Building an eval set once is straightforward. Keeping it honest as schemas drift, business logic changes, judge models update, and the production question distribution shifts away from launch assumptions is the actual job. Mitigations exist: schema-change detection, embedding refresh on warehouse updates, periodic re-eval against fresh data, cross-model judging plus human spot-check to keep the rubric honest. Most teams do not build them until they have been burned.

4. Production performance is a different beast than pre-launch evals. Pre-launch you have a fixed dataset and ground truth. In production you have a flood of unlabeled questions and users who say “this was helpful” while internally rolling their eyes. Stated preference is noise. Revealed preference is the signal: re-ask rate (user re-issues a near-duplicate within five minutes), SQL edit distance, abandonment versus export, repeat usage by cohort, and a quiet narrowing of question diversity per user as they learn what the tool can and cannot do. Plus a small canary suite of questions whose answer you know cold, run hourly, that pages on regression. Plus a weekly human review of fifty triaged interactions, the unglamorous workhorse. None of that is built unless someone owns it, and most teams do not assign that owner until after the first churn. Build time on the agent itself was a couple of weeks. Build time on everything around the agent was the rest of the project.

Outcome

A 60 percent reduction in analyst hours spent on routine reporting. The agent became the template for org-wide agentic AI adoption; the same architecture now sits in three other functional areas at Atlas Voyages.

Stack

LangGraph for orchestration, Claude API for reasoning, Streamlit for the operator surface, BigQuery for execution.

← All outcomes