Evals & Observability¶
The Gap Tests Don't Close¶
Tests verify that your code does what you told it to do — behavior. Your acceptance criteria turn into executable checks: given this input, the function produces that output. Tests catch regressions and broken behavior.
But tests don't verify correctness. Your deterministic pipeline can pass every test — data ingestion works, the parser extracts danger ratings, alerts fire at configured thresholds — and the AI analysis layer can still generate a briefing that mischaracterizes the forecast. When the AI-generated briefing highlights persistent weak layers as the primary concern for Salt Lake, is that what the forecaster actually identified? That's not a test question. That's an eval question.
| Tests | Evals | |
|---|---|---|
| Question asked | Does the code behave as specified? | Does the system produce correct results? |
| Source of truth | Your acceptance criteria | Domain expert judgment |
| What it catches | Regressions, broken behavior | Wrong answers, bad analysis, missed patterns |
| When it runs | Every code change | Every system change — including data, thresholds, and skills |
Tests and evals are complementary. Tests protect against breaking what works. Evals protect against being confidently wrong.
Golden Datasets: The Answer Key¶
An evaluation harness runs your system against inputs where you already know the correct answer, then checks whether the system got it right. The mechanism that makes this possible is a golden dataset — a curated set of real inputs paired with expert-verified expected outputs.
Your project already has them. The data/shared/golden-datasets/ directory contains eighteen scenarios built from real forecast data from four US avalanche centers. What makes them "golden": UAC forecasters are professional avalanche scientists. When they rate a zone as "Considerable" with "Wind Slab" and "Persistent Weak Layer" problems, that IS the expert assessment. It's the answer key.
Two grading approaches:
Deterministic checks compare outputs to expected values directly — did the system rate Salt Lake as "Considerable"? Binary pass or fail. Fast, cheap, reproducible.
AI-graded rubrics use a language model to evaluate subjective qualities — does the analysis narrative correctly summarize the key risk factors? Handles nuance that string comparison can't, but slower and needs periodic calibration.
The practical pattern: start deterministic for everything with a verifiable answer. Layer AI-graded rubrics only for subjective dimensions. Most teams find 80% of their eval surface can be covered deterministically.
Structured Logging and Trace IDs¶
An eval tells you what is wrong — the system produced the wrong answer. Structured logging with trace IDs tells you where and why.
A trace ID is a unique identifier assigned at the start of a workflow that follows a data point through every stage. When the Salt Lake forecast enters the system, it gets a trace ID. That same ID appears on the ingestion log, the analysis log, and the alert decision log. When an eval fails, you search for that trace ID and see exactly what happened at each step.
Without observability, every eval failure requires manual investigation across the entire pipeline. With trace IDs, you follow the data point from ingestion through analysis to alert decision and pinpoint exactly where the system diverged from the expected output.
The Three Levers and the Feedback Loop¶
When an eval fails, the fix falls into one of three categories:
- Fix the analysis logic. A code bug or wrong assumption. Fix the logic, run tests, re-eval.
- Adjust the thresholds. The configuration doesn't match how experts actually make decisions. Update the thresholds, re-eval.
- Refine the skills. The AI coding assistant's instructions are producing the wrong behavior. Update the skill, re-eval.
The feedback loop: eval failure → diagnose with traces → update the right lever → re-eval to confirm. The golden dataset plays dual roles: it's both the test that reveals failures and the acceptance criteria that confirms fixes. This is the same closed-loop pattern — criteria → tests → fail → implement → pass — operating at the system level.
The ratchet effect makes this compound over time: every eval failure you fix becomes a permanent regression guard. The system gets strictly better because it can never regress past a fixed failure.
Try It: Walk Through a Golden Dataset¶
Ask your AI coding assistant to read data/shared/golden-datasets/gs-001-salt-lake.json and explain the structure — inputs vs. expected outputs.
read data/shared/golden-datasets/gs-001-salt-lake.json and explain what makes this a golden dataset — what are the inputs, what are the expected outputs, and where does the "answer" come from?
Same prompt. Codex reads the file and explains the structure.
Same prompt. pi reads the file and explains the structure.
Note the structure: raw inputs (forecast + weather + snowpack) paired with the forecaster's expert assessment. The expected outputs serve dual purposes: direct comparison targets for testing the deterministic pipeline (did you parse the danger rating correctly?), and rubric criteria for evaluating the AI analysis layer (does the generated briefing accurately reflect the forecaster's assessment?). The forecaster's judgment is the standard of truth.
Team Discussion: From Measurement to Systems¶
Format: Team Discussion Time: ~3 minutes
Your project involves ingesting data from two avalanche centers with genuinely different formats (CAIC and UAC). Each center's data flows through normalization, analysis, alerting, and a dashboard.
Discuss: If you built eval harnesses for each center's pipeline independently, what happens when you try to evaluate the unified platform? Where do golden datasets for the normalization layer come from — the individual centers, the unified schema, or both? What does a trace ID look like when it spans two data formats that merge into one? And the forward-looking question: what happens when the system that produces the eval failures is also the system that fixes them? That's the self-healing factory — Lift 3.
Key Insight¶
The measurement infrastructure — golden datasets, eval harnesses, structured logging with trace IDs, the three levers, the feedback loop — is the foundation that makes everything in this track possible. The shared stack (Lift 2) requires evals that work across contributors. The self-healing factory (Lift 3) automates the feedback loop itself. If the measurement infrastructure is weak, nothing built on top of it is trustworthy. The system-level implication: the quality of your evals determines the ceiling for how much autonomy you can safely grant your system.