60% reduction in analyst reporting load via internal LLM agent

The setup

Atlas Voyages’ analytics team was drowning in ad-hoc reporting requests. Most were variants of “give me X cut by Y for Z window.” A capable agent could handle any of them, end to end, without a human reviewer in the loop.

How it works

A LangGraph agent translates plain-English questions into BigQuery SQL, executes against a read-only role, and returns Plotly charts plus a short narrative through a Streamlit interface. Function calling, schema introspection, and a one-shot SQL repair pass handle the long tail.

The load-bearing decision is the first one. Every incoming question is classified before any SQL runs. Misclassification leads to the wrong data shape, so the classifier is the most important model call in the agent.

Question
├─ Type A · CHANGE     ── decompose → temporal → segment → drill → confirm → conclude
└─ Type B · STRUCTURAL ── overview → segment → drill → conclude

each phase: generate SQL → run on BigQuery → repair on error → render chart → generate insight

What it found

In one early evaluation, an executive asked the agent: “Do early shoppers spend more, and what is actually driving it?”

The agent ran 21 queries across six investigation phases. Anonymized findings:

Revenue per booking rose monotonically with days-to-departure: from $289 at 0 to 4 weeks to a peak of $468 at 4 to 6 months out, a 62 percent premium for early shoppers.
Conversion moved the opposite direction: 8.83 percent for last-minute shoppers, 4.10 percent for early ones.
Category mix held flat across every window. The premium came from within-category spend intensity, not category shift. Early shoppers were spending more on the same products.
The 0 to 4 week window generated 39 percent of total revenue despite the lowest RPB, powered by its 2x conversion advantage and a much larger shopper pool.

The agent produced an executive-ready narrative without an analyst writing a single SQL query.

Behind the scenes

The classification step itself is a single short prompt that gates everything downstream. It is small, stable, and deliberately load-bearing.

# from deep_analysis.py: the classifier that owns the agent's first move

# STEP 0. CLASSIFY THE QUESTION

# TYPE A. CHANGE ANALYSIS:
#   "Why did X drop?"  "What changed this week?"
#   Use 3-period comparisons (Current vs prior 5 weeks vs prior year).

# TYPE B. STRUCTURAL ANALYSIS:
#   "How does X vary by Y?"  "Which segment has the highest X?"
#   Use CURRENT PERIOD ONLY. Do NOT add _P5W or _PY columns.

# The first message MUST include either
#   "Question type: A (change)"  or  "Question type: B (structural)"

The rest of the agent runs a phase-by-phase plan with explicit tool calls. The framework is opinionated on purpose: a senior analyst would not start by querying random dimensions; they would decompose, then look at trends, then segment. The agent does the same, in the same order.

What was hard

The agent was the easy part. Four things were the hard part.

1. Schema sprawl, and which table to trust. The warehouse had hundreds of tables. Some were deprecated but not deleted. Three of them had a revenue column that meant three different things. The agent does not know which is canonical, and frankly half the data team is not always sure either. Left to its own devices, an LLM will pick fct_orders_legacy because the column names look cleaner, return a confident answer, and be wrong by 12 percent. Nobody catches it for two weeks. The fix is not better prompting. It is curation: deciding which tables are canonical, writing good descriptions for them, and excluding the rest from the agent’s retrieval scope. That is a data governance problem dressed up as an AI problem, and it is usually the customer’s to solve. Implementation timelines depend on how mature the warehouse is.

2. Trust collapses on a single bad answer. Trust is asymmetric. A user can get fifty correct answers in a row, then catch one wrong number on a board deck, and they are done with the tool, forever. They will tell their team it is unreliable and the org never recovers. The implication for design: over-invest in surfacing uncertainty, even when it makes the product feel less magical. A system that says “here is my SQL, please verify” is more trustworthy than one that confidently answers wrong 5 percent of the time. Demos sell on confidence; production survives on humility.

3. Evaluation is a living system, not a project. Building an eval set once is straightforward. Keeping it honest as schemas drift, business logic changes, judge models update, and the production question distribution shifts away from launch assumptions is the actual job. Mitigations exist: schema-change detection, embedding refresh on warehouse updates, periodic re-eval against fresh data, cross-model judging plus human spot-check to keep the rubric honest. Most teams do not build them until they have been burned.

4. Production performance is a different beast than pre-launch evals. Pre-launch you have a fixed dataset and ground truth. In production you have a flood of unlabeled questions and users who say “this was helpful” while internally rolling their eyes. Stated preference is noise. Revealed preference is the signal: re-ask rate (user re-issues a near-duplicate within five minutes), SQL edit distance, abandonment versus export, repeat usage by cohort, and a quiet narrowing of question diversity per user as they learn what the tool can and cannot do. Plus a small canary suite of questions whose answer you know cold, run hourly, that pages on regression. Plus a weekly human review of fifty triaged interactions, the unglamorous workhorse. None of that is built unless someone owns it, and most teams do not assign that owner until after the first churn. Build time on the agent itself was a couple of weeks. Build time on everything around the agent was the rest of the project.

Outcome

A 60 percent reduction in analyst hours spent on routine reporting. The agent became the template for org-wide agentic AI adoption; the same architecture now sits in three other functional areas at Atlas Voyages.

Stack

LangGraph for orchestration, Claude API for reasoning, Streamlit for the operator surface, BigQuery for execution.