From Tests to Evals¶
The Gap Tests Don't Close¶
Tests verify that your code does what you told it to do. They're the safety net from the Blue Square track — acceptance criteria turned into executable checks. If your function should split a bill equally, a test confirms it splits correctly. If your API should return JSON, a test confirms the format.
But here's the gap: tests verify behavior, not correctness. Your deterministic pipeline can pass every test — data ingestion works, the parser extracts the right danger rating, the alert generator fires at the configured thresholds — and the AI analysis layer can still generate a briefing that mischaracterizes the forecast.
The question from Lift 1's visibility problem isn't "does the code work?" It's: when your AI analysis layer generates a briefing about the Salt Lake zone, does it accurately reflect what the forecaster identified?
That's not a test question. That's an eval question.
Evals: Checking Answers Against Expert Judgment¶
An evaluation harness (eval) runs your system against inputs where you already know the correct answer, then checks whether the system got it right.
| Tests | Evals | |
|---|---|---|
| Question asked | Does the code behave as specified? | Does the system produce correct results? |
| Source of truth | Your acceptance criteria | Domain expert judgment |
| What it catches | Regressions, broken behavior | Wrong answers, bad analysis, missed patterns |
| When it runs | Every code change | Every system change — including data, thresholds, and skills |
| Example | "Given a zone with danger level 3, the alert generator produces a draft" | "Given the Salt Lake forecast from March 6, the AI-generated briefing identifies all three avalanche problems and correctly characterizes the persistent weak layer as trending toward dormancy — matching the forecaster's assessment" |
Tests and evals aren't competing concepts — they're complementary. Tests protect against breaking what works. Evals protect against being confidently wrong.
Golden Datasets: The Answer Key¶
The mechanism that makes evals work is a golden dataset — a curated set of real inputs paired with expert-verified expected outputs.
Your project already has them. The data/shared/golden-datasets/ directory contains eighteen scenarios built from real forecast data from four US avalanche centers. Here's what makes them "golden": UAC forecasters are professional avalanche scientists. When they rate a zone as "Considerable" with "Wind Slab" and "Persistent Weak Layer" problems, that IS the expert assessment. It's the answer key.
Each golden dataset file contains:
- Inputs — raw forecast, weather, and snowpack data for one zone on one date
- Expected outputs — the forecaster's danger rating, avalanche problems, and the alert decision derived from the alert threshold rules
Example (Salt Lake, gs-001):
| Field | Value |
|---|---|
| Input | UAC forecast + NWS weather + SNOTEL snowpack for Salt Lake on March 6, 2026 |
| Expected danger rating | Considerable (level 3) |
| Expected problems | Wind Drifted Snow, New Snow, Persistent Weak Layer |
| Expected alert decision | Human review (level 3 triggers review per alert-thresholds.json) |
The golden dataset's expected outputs serve dual purposes. For the deterministic pipeline, they're direct comparison targets: did the parser extract "Considerable"? Did the alert router trigger "human_review"? Those are tests. For the AI analysis layer, they become rubric criteria: does the AI-generated briefing accurately reflect Considerable danger? Does it mention all three avalanche problems? Does it correctly characterize the persistent weak layer situation? Those are evals.
Deterministic Checks vs. AI-Graded Rubrics¶
Not every eval question has a clean yes/no answer. Two grading approaches:
Deterministic checks compare outputs to expected values directly. Did the parser extract danger level 3? Did the alert action match "human_review"? Does the problem list include all three expected problems? These are binary — pass or fail — and they're fast, cheap, and reproducible. They verify the deterministic pipeline.
AI-graded rubrics use a language model to evaluate subjective qualities. Does the AI-generated briefing correctly summarize the key risk factors? Is the contextual alert message clear enough for a field team to act on? Does the cross-zone synthesis accurately reflect patterns across zones? These handle nuance that string comparison can't, but they're slower, cost tokens, and need periodic calibration against human judgment. They verify the AI analysis layer.
The practical pattern: start with deterministic checks for everything that has a verifiable answer. Layer AI-graded rubrics only for subjective dimensions — summary quality, clarity of communication, completeness of analysis narrative. Most teams find that 80% of their eval surface can be covered deterministically.
Team Activity: Walk Through a Golden Dataset¶
Format: Mob Session Time: ~3 minutes Setup: One person drives, everyone navigates.
Ask your AI coding assistant to read data/shared/golden-datasets/gs-001-salt-lake.json and explain the structure.
Discuss as a team:
- What are the inputs? What are the expected outputs?
- Where does the "answer" come from? (The UAC forecaster's assessment)
- What would a deterministic check look like for this scenario? (Compare overall_danger_rating to "Considerable")
- What would need an AI-graded rubric? (Whether the analysis narrative captures the key risk factors)
- How would you build more golden datasets? (Fetch forecasts from different dates — the danger ratings change daily)
Key Insight¶
Tests verify behavior — the deterministic pipeline does what you told it to. Evals verify correctness — the AI analysis layer produces accurate, complete output that reflects what the forecaster actually identified. Golden datasets make both possible: expected outputs serve as direct comparison targets for the pipeline (tests) and rubric criteria for the AI layer (evals). Start with deterministic checks for the pipeline, layer AI-graded rubrics for the analysis layer, and expand the golden dataset over time as you discover new edge cases.