Skip to content

From Eval Failures to Work Items

The Manual Response Problem

Lift 2 introduced the feedback loop: eval failure → diagnose with traces → update the right lever → re-eval. That loop works. The problem is who runs it.

Right now, an eval failure produces a result — "Ogden rated Considerable, expected Moderate" — and that result sits in a log or a terminal output. Someone (you) reads the log, interprets the failure, decides what to change, makes the change, and re-runs the eval. Multiply that across nine zones, three data sources, and five parallel workstreams, and you're spending your time triaging, not directing.

The shift: turn eval failures into work items that agents can act on without you as the middleman.

From Failure to Issue

The pattern is straightforward: when an eval fails, the pipeline creates a work item — a GitHub issue, a task in a project board, or an entry in a task graph — with enough context for an agent (or a teammate) to pick it up and act on it.

A well-formed issue from an eval failure includes:

Field What It Contains Where It Comes From
Title What failed Eval harness output
Expected vs. Actual The golden dataset answer vs. the system's answer Eval comparison
Trace context The trace ID and key span data showing where divergence occurred Structured logging
Suggested lever Whether this looks like a logic bug, threshold issue, or skill refinement Pattern matching against the three levers from Lift 2
Regression scope Which other golden dataset scenarios might be affected Cross-scenario analysis

The issue isn't just a notification — it's a delegation contract. It has scope (what failed), intent (match the golden dataset), and structure (the trace data and suggested lever). It passes the delegation-ready test from Lift 1: you could write acceptance criteria ("Ogden rates as Moderate, Salt Lake still rates as Considerable"), the scope is bounded, and it can be built and tested independently.

Task Graphs: Structuring the Work

When you have dozens of eval failures, individual issues aren't enough. You need structure. A task graph organizes work items into a dependency tree: which issues block which, which can be worked in parallel, which share a root cause.

For your analysis engine, a task graph might look like:

[Root] Analysis engine escalation logic
  ├── [Issue] Fix multi-problem escalation rule
  │     └── [Blocked by] Clarify problem likelihood weighting
  ├── [Issue] Ogden danger rating incorrect (gs-002)
  │     └── [Depends on] Fix multi-problem escalation rule
  └── [Issue] Logan danger rating incorrect (gs-004)
        └── [Depends on] Fix multi-problem escalation rule

Both Ogden and Logan failures share the same root cause — the escalation logic — so fixing the root issue resolves both. The task graph makes that relationship visible. An agent working on the root cause doesn't need to be told about the downstream issues; the structure handles it.

Decomposition at Scale

This is decomposition from the Blue Square track — breaking big problems into independently shippable pieces — operating at the system level. In Blue Square, you decomposed a feature into stories. Here, you decompose eval failures into work items with dependency relationships.

The same principles apply:

  • Each work item should be independently completable — an agent can pick it up without understanding the entire system
  • The acceptance criteria come from the golden dataset — the eval either passes or it doesn't
  • Work items that share dependencies are sequenced; independent items can run in parallel

What changes is that the decomposition can happen automatically. Your eval harness knows which scenarios failed. Your structured logging knows where the divergence occurred. The three-lever framework from Lift 2 suggests what category of fix is needed. A skill that encodes this diagnostic pattern can turn a batch of eval failures into a structured set of work items — without you reading every log line.

Team Discussion: Structuring Work from Failures

Format: Team Discussion Time: ~2 minutes

Imagine your eval harness runs all seven golden dataset scenarios and three fail: Ogden, Logan, and Uintas all rate one level too high.

Discuss: What questions would you need to answer before creating work items? How would you determine if the three failures share a root cause or are independent? What would the task graph look like? What information would an agent need in the issue to fix the problem without asking you clarifying questions?

Key Insight

The gap between "my evals tell me what's wrong" and "the system fixes what's wrong" is a structured work pipeline. Eval failures become issues with enough context to be delegation contracts. Task graphs organize those issues by dependency. The decomposition patterns from Blue Square — independently shippable, bounded scope, testable criteria — work at the system level when the criteria come from golden datasets and the scope comes from structured logging. Your role shifts from triaging failures to designing the pipeline that triages them.