The setup
Atlas Voyages had 2.2 million historical post-booking events across seven ancillary product lines (Beverage Package, Internet, Excursion, Spa & Wellness, Speciality Restaurants, Delights, Photo) but no production ML to act on them. CRM was firing the same pre-departure offers at every guest. The hypothesis was simple: if you can score a guest’s intent per category before the booking ages out, the email you actually send is worth multiples of the email you would have sent.
How it works
Seven separate H2O AutoML classifiers, one per ancillary category, trained on 21 browse-only features per booking-category pair. Models score every open booking weekly in the 4 to 45 days-to-departure window. Each guest gets a per-category probability, gets bucketed into tiers (High, Medium, Low-Medium, Low), and the top three tiers route into the appropriate CRM campaign queue. A 20 percent stratified holdout sits out, untouched, so the lift is measured against a comparable control rather than a fuzzy “before vs after.”
The platform handles the orchestration: feature extraction in BigQuery, scoring in Python, dedup against prior weeks (one email per booking-category, ever), audience handoff to CRM, and a weekly snapshot to a single output table that downstream marketing systems read.
What it ships
Tier precision, on unseen test data, after the leakage fixes settled:
- High tier (probability ≥ 0.7): 73 to 81 percent likelihood to purchase, varying by category.
- Medium tier (0.4 to 0.7): 48 to 76 percent.
- Low-Medium tier (0.2 to 0.4): 19 to 51 percent.
- Low (below 0.2): 3 to 15 percent. Not emailed. Sending here costs more than it earns.
AUC ranges from 0.811 (Excursion, the hardest) to 0.893 (Photo, the easiest). Every model is validated out-of-time, not random k-fold, because production scoring sees fresh weeks not random rows.
The stack identified $15M+ in annual incremental revenue against the holdout, validated across four months. The scoring pipeline became Atlas Voyages’ template for how the org runs targeted offers across all ancillary categories.
What was hard
The model architecture was straightforward. Three other things were the actual project.
1. Leakage you would not catch in code review. The first version had view_cart and add_to_cart as features, and they dominated at 80 percent feature importance. The model was excellent. It was also useless: it was predicting “guest is about to buy” rather than “guest is intent-signaling early.” Both look correct on a confusion matrix; only one targets the audience worth targeting. Removing cart signals tanked AUC from 0.93 to 0.87 and made the system actually work. The same trap fired again on categories_purchased_count (had to exclude the current category and only count pre-45-DTD purchases) and on recency features keyed off CURRENT_DATE() (training-time and scoring-time distributions did not match). Each one passed unit tests. Each one would have shipped a smaller, slower, wrong version of the system.
2. Per-category calibration, not a universal threshold. A 0.7 probability means different things across the seven categories. Photo’s High tier converts at 81 percent; Excursion’s High tier at 73 percent on a much smaller pool. A single threshold optimizes the average and hurts every category. The product needs per-category tier definitions, monitored and recalibrated quarterly, with decisions like “Excursion has no usable High tier this quarter, send to Medium only.” That is product work, not modeling work.
3. Attribution is a separate engineering problem. The hard question after launch is not “did the model score correctly,” it is “did sending change behavior.” 20 percent stratified holdout, A/B since week 8, segment-level lift measured against a comparable control. Without this, every lift number is contaminated: you are measuring “guests who would have bought anyway” plus “guests we nudged” without separating them. Half the project was building the lift-measurement plumbing, not the model. It is also what made the $15M number defensible to a CFO.
Outcome
$15M+ in identified, holdout-validated annual incremental revenue. The pipeline is now the org’s template for any ancillary cross-sell motion. Three categories have reached production maturity; the other four are calibrated and queued.
Stack
Python end to end. BigQuery for feature engineering (always via the bq CLI, never the Python SDK). H2O AutoML for the candidate-model sweep, GBM winning every category. LangGraph for the orchestration layer that hands scored bookings to downstream offer systems.