Lift 2: Trust Through Measurement¶

Where We're Starting¶

You're running so much in parallel that you can't manually check all of it anymore. Things look like they're working — but you're not sure. You've lost visibility. You can't review every output.

Lift 1 ended with the visibility problem: when your AI analysis layer generates a briefing about the Salt Lake zone, does it accurately reflect what the forecaster identified? Your tests confirm the deterministic pipeline runs. Your acceptance criteria confirm the parser extracts the right fields. But neither tells you whether the AI-generated analysis is correct.

This lift gives you the tools to answer that question systematically — not by reviewing more, but by measuring better.

What You'll Learn¶

How golden datasets turn expert judgment into automated correctness checks — and why your project already has them
The difference between tests (does the code work?) and evals (does the system produce correct results?) — and why both matter
How structured logging with trace IDs gives you observability across parallel agents
How eval results feed back into AI instruction refinement — closing the loop between measurement and improvement

Sections¶

From Tests to Evals — Golden datasets, eval harnesses, and the difference between "it works" and "it's right"
Observability Across Agents — Structured logging, trace IDs, and following a data point through the system
Closing the Loop — Eval results drive AI instruction refinement — measurement becomes improvement

By the End of This Lift¶

You understand the difference between tests and evals — and why correctness requires both
You can describe what a golden dataset is, what goes into one, and how to use it as an eval harness
You know how structured logging with trace IDs provides observability across parallel workstreams
You can describe the feedback loop: eval failure → diagnose → update AI instructions → re-eval
You've examined the pre-seeded golden datasets and can connect inputs to expected outputs