Closing the Loop¶
From Measurement to Improvement¶
You have golden datasets that tell you when the system produces the wrong answer. You have structured logging that tells you where the pipeline diverged. What you do with that information is what separates measurement from improvement.
The pattern is a feedback loop: eval failure → diagnose (using traces) → update the system → re-eval to confirm the fix. The question is: what part of the system do you update?
Three Levers¶
When an eval fails, the fix falls into one of three categories:
1. Fix the pipeline logic. The code has a bug or a wrong assumption. The deterministic parser misreads a field from the CAIC format, or the alert router doesn't handle the multi-problem escalation rule correctly. This is a code change — fix the logic, run tests, re-eval.
2. Adjust the thresholds. The alert configuration doesn't match how forecasters actually make decisions. Your alert-thresholds.json says danger level 3 triggers human review, but the eval scenarios show that when three or more problems are present at level 3, forecasters escalate to "High." This is a data change — update the thresholds, re-eval.
3. Refine the AI instructions. Your AI analysis layer calls the Anthropic API with prompts and system instructions that shape the generated output. Those instructions are producing the wrong behavior. Maybe the briefing prompt says "summarize all avalanche problems equally" when persistent weak layers should be highlighted as higher-consequence than wind slab. Or the contextual alert prompt doesn't account for problem likelihood when characterizing severity. This is where the "We Do, You Do" pattern from Lift 1 becomes a measurement-driven refinement loop — but now you're iterating on runtime AI instructions, not just development-time skills.
AI Instruction Feedback Loops¶
In the Blue Square track, skills were refined through iteration — you worked with your AI coding assistant until the output matched what you'd produce, then captured those instructions as a skill. The test for completeness was fresh context: does the skill reproduce quality output without your guidance?
At this level, you add a quantitative dimension. Eval results become the test for AI instruction quality. Instead of "does this look right to me?" the question becomes "does the AI analysis layer produce correct results across the golden dataset?"
The loop:
- Run evals. Your AI analysis layer generates briefings for all nine zones. Compare against golden dataset expected outputs — both deterministic checks (did the pipeline parse correctly?) and AI-graded rubrics (does the briefing accurately reflect the forecast?).
- Identify failures. Salt Lake's briefing is accurate — it correctly highlights all three avalanche problems. Ogden's briefing is wrong — it characterizes danger as elevated due to a "high-consequence persistent weak layer" when the forecaster described that problem as "trending more stubborn" with low likelihood.
- Diagnose with traces. Pull the Ogden trace. The AI analysis prompt doesn't distinguish between problem types by likelihood — it treats all listed problems as equally severe.
- Determine the lever. Is this an analysis logic bug (the prompt structure is wrong), a threshold issue (the alert routing overreacted), or an AI instruction issue (the prompt needs to account for problem likelihood)? Looking at the golden dataset: the forecaster rated Ogden as Moderate despite two problems because one had low likelihood. The AI instruction needs refinement.
- Update. Refine the AI analysis prompt to consider problem likelihood when characterizing severity. Or update the threshold configuration to require high-likelihood problems for escalation. Either way, the golden dataset is the acceptance test.
- Re-eval. Run the full golden dataset again. Does Ogden's briefing now accurately reflect Moderate danger? Did the prompt change break Salt Lake's briefing accuracy?
This is the closed loop from the Blue Square track — criteria → tests → fail → implement → pass — operating at the system level instead of the code level. The golden dataset is your acceptance criteria. The eval harness is your test suite. The AI instruction or threshold update is your implementation. The re-eval is your verification.
Team Discussion: What Would You Change?¶
Format: Team Discussion Time: ~2 minutes
Your eval harness shows that the AI analysis layer generates an accurate briefing for Salt Lake (correctly highlighting Considerable danger with all three problems) but generates an inaccurate briefing for Ogden — characterizing it as elevated risk when the forecaster rated it Moderate.
Discuss: Walk through the three levers. Is this most likely a pipeline logic bug, a threshold configuration issue, or an AI instruction refinement need? How would you decide which lever to pull? What's the risk of each fix — could fixing Ogden's briefing break Salt Lake's? How does the golden dataset protect you from that regression?
The Wall You're About to Hit¶
You now have the measurement infrastructure: golden datasets for correctness, structured logging for diagnostics, and feedback loops for improvement. You can identify what's wrong, find where it went wrong, and fix it.
But: you're still the one doing all of that. Every eval failure lands on your desk. Every trace investigation is your time. Every skill refinement is your judgment. You've automated the measurement, but the response to measurement is still manual.
When five parallel workstreams each produce eval failures, you're the bottleneck again — just at a higher level than before. You've gone from reviewing code to reviewing system correctness, which is better, but it still doesn't scale.
That's what Lift 3 addresses.
Key Insight¶
Measurement without a feedback loop is just noise. The power of evals isn't knowing what's wrong — it's having a systematic path from "wrong" to "fixed." Eval failure → diagnose with traces → update the right lever (logic, thresholds, or AI instructions) → re-eval to confirm. The golden dataset plays dual roles: it's both the test that reveals failures and the acceptance criteria that confirms fixes. The same closed-loop pattern from Blue Square — criteria → fail → implement → pass — works at the system level when your criteria are golden datasets and your tests are evals.