From Eval Failures to Work Items¶
The Manual Response Problem¶
Lift 2 introduced the feedback loop: eval failure → diagnose with traces → update the right lever → re-eval. That loop works. The problem is who runs it.
Right now, an eval failure produces a result — "Ogden rated Considerable, expected Moderate" — and that result sits in a log or a terminal output. Someone (you) reads the log, interprets the failure, decides what to change, makes the change, and re-runs the eval. Multiply that across nine zones, three data sources, and five parallel workstreams, and you're spending your time triaging, not directing.
The shift: turn eval failures into work items that agents can act on without you as the middleman.
From Failure to Issue¶
The pattern is straightforward: when an eval fails, the pipeline creates a work item — a GitHub issue, a task in a project board, or an entry in a task graph — with enough context for an agent (or a teammate) to pick it up and act on it.
A well-formed issue from an eval failure includes:
| Field | What It Contains | Where It Comes From |
|---|---|---|
| Title | What failed | Eval harness output |
| Expected vs. Actual | The golden dataset answer vs. the system's answer | Eval comparison |
| Trace context | The trace ID and key span data showing where divergence occurred | Structured logging |
| Suggested lever | Whether this looks like a logic bug, threshold issue, or skill refinement | Pattern matching against the three levers from Lift 2 |
| Regression scope | Which other golden dataset scenarios might be affected | Cross-scenario analysis |
The issue isn't just a notification — it's a delegation contract. It has scope (what failed), intent (match the golden dataset), and structure (the trace data and suggested lever). It passes the delegation-ready test from Lift 1: you could write acceptance criteria ("Ogden rates as Moderate, Salt Lake still rates as Considerable"), the scope is bounded, and it can be built and tested independently.
Task Graphs: Structuring the Work¶
When you have dozens of eval failures, individual issues aren't enough. You need structure. A task graph organizes work items into a dependency tree: which issues block which, which can be worked in parallel, which share a root cause.
For your analysis engine, a task graph might look like:
[Root] Analysis engine escalation logic
├── [Issue] Fix multi-problem escalation rule
│ └── [Blocked by] Clarify problem likelihood weighting
├── [Issue] Ogden danger rating incorrect (gs-002)
│ └── [Depends on] Fix multi-problem escalation rule
└── [Issue] Logan danger rating incorrect (gs-004)
└── [Depends on] Fix multi-problem escalation rule
Both Ogden and Logan failures share the same root cause — the escalation logic — so fixing the root issue resolves both. The task graph makes that relationship visible. An agent working on the root cause doesn't need to be told about the downstream issues; the structure handles it.
Decomposition at Scale¶
This is decomposition from the Blue Square track — breaking big problems into independently shippable pieces — operating at the system level. In Blue Square, you decomposed a feature into stories. Here, you decompose eval failures into work items with dependency relationships.
The same principles apply:
- Each work item should be independently completable — an agent can pick it up without understanding the entire system
- The acceptance criteria come from the golden dataset — the eval either passes or it doesn't
- Work items that share dependencies are sequenced; independent items can run in parallel
What changes is that the decomposition can happen automatically. Your eval harness knows which scenarios failed. Your structured logging knows where the divergence occurred. The three-lever framework from Lift 2 suggests what category of fix is needed. A skill that encodes this diagnostic pattern can turn a batch of eval failures into a structured set of work items — without you reading every log line.
Team Discussion: Structuring Work from Failures¶
Format: Team Discussion Time: ~2 minutes
Imagine your eval harness runs all seven golden dataset scenarios and three fail: Ogden, Logan, and Uintas all rate one level too high.
Discuss: What questions would you need to answer before creating work items? How would you determine if the three failures share a root cause or are independent? What would the task graph look like? What information would an agent need in the issue to fix the problem without asking you clarifying questions?
Key Insight¶
The gap between "my evals tell me what's wrong" and "the system fixes what's wrong" is a structured work pipeline. Eval failures become issues with enough context to be delegation contracts. Task graphs organize those issues by dependency. The decomposition patterns from Blue Square — independently shippable, bounded scope, testable criteria — work at the system level when the criteria come from golden datasets and the scope comes from structured logging. Your role shifts from triaging failures to designing the pipeline that triages them.