Serrin Book →
Atlas
v2.4

Week 17, 2026 · post-booking

Targeting queue

Targeted

7,213

guests · 4 to 45 DTD

Projected lift

+$284K

vs business as usual

Models active

7

AUC 0.81 to 0.89

Holdout

20%

stratified · since wk 8

  • Photo
    Scheduled
    Audience 985 AUC 0.893

    High 142 · Med 364 · Low 479

  • Beverage Package
    Sending
    Audience 1,842 AUC 0.876

    High 218 · Med 612 · Low 1012

  • Spa & Wellness
    Scheduled
    Audience 1,058 AUC 0.865

    High 124 · Med 405 · Low 529

  • Internet
    Scheduled
    Audience 901 AUC 0.860

    High 88 · Med 343 · Low 470

  • Speciality Restaurants
    Queued
    Audience 723 AUC 0.844

    High 64 · Med 271 · Low 388

  • Delights
    Queued
    Audience 1,127 AUC 0.839

    High 96 · Med 421 · Low 610

  • Excursion
    Queued
    Audience 577 AUC 0.811

    High 41 · Med 198 · Low 338

7 categories · 7,213 sends · 1,803 in 20% control · A/B running since week 8

A representation of the Atlas marketing workspace showing the weekly post-booking targeting queue. Seven product categories list audience size, model AUC, tier mix proportions across High, Medium, and Low buckets, and send status. Top stat tiles show 7,213 guests targeted, a projected lift of $284K, seven active models with AUC between 0.81 and 0.89, and a stratified 20 percent control holdout running since week 8.
Outcome

$15M+ annual incremental revenue from agentic propensity modeling

$15M+

annual incremental revenue

Atlas Voyages · 12 weeks

Python BigQuery H2O AutoML LangGraph

The setup

Atlas Voyages had 2.2 million historical post-booking events across seven ancillary product lines (Beverage Package, Internet, Excursion, Spa & Wellness, Speciality Restaurants, Delights, Photo) but no production ML to act on them. CRM was firing the same pre-departure offers at every guest. The hypothesis was simple: if you can score a guest’s intent per category before the booking ages out, the email you actually send is worth multiples of the email you would have sent.

How it works

Seven separate H2O AutoML classifiers, one per ancillary category, trained on 21 browse-only features per booking-category pair. Models score every open booking weekly in the 4 to 45 days-to-departure window. Each guest gets a per-category probability, gets bucketed into tiers (High, Medium, Low-Medium, Low), and the top three tiers route into the appropriate CRM campaign queue. A 20 percent stratified holdout sits out, untouched, so the lift is measured against a comparable control rather than a fuzzy “before vs after.”

The platform handles the orchestration: feature extraction in BigQuery, scoring in Python, dedup against prior weeks (one email per booking-category, ever), audience handoff to CRM, and a weekly snapshot to a single output table that downstream marketing systems read.

What it ships

Tier precision, on unseen test data, after the leakage fixes settled:

AUC ranges from 0.811 (Excursion, the hardest) to 0.893 (Photo, the easiest). Every model is validated out-of-time, not random k-fold, because production scoring sees fresh weeks not random rows.

The stack identified $15M+ in annual incremental revenue against the holdout, validated across four months. The scoring pipeline became Atlas Voyages’ template for how the org runs targeted offers across all ancillary categories.

What was hard

The model architecture was straightforward. Three other things were the actual project.

1. Leakage you would not catch in code review. The first version had view_cart and add_to_cart as features, and they dominated at 80 percent feature importance. The model was excellent. It was also useless: it was predicting “guest is about to buy” rather than “guest is intent-signaling early.” Both look correct on a confusion matrix; only one targets the audience worth targeting. Removing cart signals tanked AUC from 0.93 to 0.87 and made the system actually work. The same trap fired again on categories_purchased_count (had to exclude the current category and only count pre-45-DTD purchases) and on recency features keyed off CURRENT_DATE() (training-time and scoring-time distributions did not match). Each one passed unit tests. Each one would have shipped a smaller, slower, wrong version of the system.

2. Per-category calibration, not a universal threshold. A 0.7 probability means different things across the seven categories. Photo’s High tier converts at 81 percent; Excursion’s High tier at 73 percent on a much smaller pool. A single threshold optimizes the average and hurts every category. The product needs per-category tier definitions, monitored and recalibrated quarterly, with decisions like “Excursion has no usable High tier this quarter, send to Medium only.” That is product work, not modeling work.

3. Attribution is a separate engineering problem. The hard question after launch is not “did the model score correctly,” it is “did sending change behavior.” 20 percent stratified holdout, A/B since week 8, segment-level lift measured against a comparable control. Without this, every lift number is contaminated: you are measuring “guests who would have bought anyway” plus “guests we nudged” without separating them. Half the project was building the lift-measurement plumbing, not the model. It is also what made the $15M number defensible to a CFO.

Outcome

$15M+ in identified, holdout-validated annual incremental revenue. The pipeline is now the org’s template for any ancillary cross-sell motion. Three categories have reached production maturity; the other four are calibrated and queued.

Stack

Python end to end. BigQuery for feature engineering (always via the bq CLI, never the Python SDK). H2O AutoML for the candidate-model sweep, GBM winning every category. LangGraph for the orchestration layer that hands scored bookings to downstream offer systems.

← All outcomes