Lift 2: Trust Through Measurement¶
Where We're Starting¶
You're running so much in parallel that you can't manually check all of it anymore. Things look like they're working — but you're not sure. You've lost visibility. You can't review every output.
Lift 1 ended with the visibility problem: when your AI analysis layer generates a briefing about the Salt Lake zone, does it accurately reflect what the forecaster identified? Your tests confirm the deterministic pipeline runs. Your acceptance criteria confirm the parser extracts the right fields. But neither tells you whether the AI-generated analysis is correct.
This lift gives you the tools to answer that question systematically — not by reviewing more, but by measuring better.
What You'll Learn¶
- How golden datasets turn expert judgment into automated correctness checks — and why your project already has them
- The difference between tests (does the code work?) and evals (does the system produce correct results?) — and why both matter
- How structured logging with trace IDs gives you observability across parallel agents
- How eval results feed back into AI instruction refinement — closing the loop between measurement and improvement
Sections¶
- From Tests to Evals — Golden datasets, eval harnesses, and the difference between "it works" and "it's right"
- Observability Across Agents — Structured logging, trace IDs, and following a data point through the system
- Closing the Loop — Eval results drive AI instruction refinement — measurement becomes improvement
By the End of This Lift¶
- You understand the difference between tests and evals — and why correctness requires both
- You can describe what a golden dataset is, what goes into one, and how to use it as an eval harness
- You know how structured logging with trace IDs provides observability across parallel workstreams
- You can describe the feedback loop: eval failure → diagnose → update AI instructions → re-eval
- You've examined the pre-seeded golden datasets and can connect inputs to expected outputs