Skip to content

Factory-Level Feedback Loops

The Loop You Already Know — Automated

In Lift 1, you learned the feedback loop: eval failure → diagnose with traces → update the right lever → re-eval to confirm. The three levers are: fix the analysis logic (a code bug), adjust the thresholds (configuration doesn't match expert judgment), or refine the skills (the AI coding assistant's instructions produce wrong behavior).

That loop works. But every iteration requires you. You read the eval output. You follow the trace IDs. You decide which lever to pull. You make the change. You re-run the eval.

A factory-level feedback loop automates this cycle. The eval harness detects a failure. The system diagnoses which lever to pull using the structured logs and trace IDs you designed in Lift 1. It proposes or applies the fix. It re-runs the eval to confirm. The ratchet effect — every fixed failure becomes a permanent regression guard — now compounds without waiting for your attention.

The system-level question: which parts of this loop can safely run without you?

Automated Skill Refinement

The shared skills library from Lift 2 is where factory-level feedback loops deliver the most leverage. When an eval failure traces back to a skill producing incorrect output, the remediation path is:

  1. Detect — the eval harness identifies a failure and the trace logs pinpoint the skill responsible
  2. Diagnose — the system compares the skill's actual output against the expected output from the golden dataset, identifying the specific gap
  3. Propose — an AI coding assistant drafts a skill refinement that addresses the gap, using the same "We Do, You Do" pattern — but now the "We Do" is between the eval result and the AI, not between you and the AI
  4. Verify — the updated skill runs against the full golden dataset, not just the failing scenario, to confirm it doesn't break other cases
  5. Gate — the refinement either auto-merges (if the change is low-risk and all evals pass) or queues for human review (if the change affects high-stakes components)

The autonomy slider from Lift 1 determines step 5. A skill refinement that changes how the CAIC pipeline normalizes date formats might auto-merge. A skill refinement that changes how the alert engine evaluates danger thresholds should require human approval. The reversibility principle applies: reversible changes with low blast radius earn higher autonomy.

Quality Gate Failures as Triggers, Not Blockers

In the traditional model, a quality gate failure blocks deployment. You investigate, fix, and retry. The gate is a wall.

In a self-healing factory, a quality gate failure is a trigger. The failure event initiates a remediation workflow:

Gate Failure Automated Response Human Review Required?
Lint violation Auto-fix and re-run No — deterministic fix
Type error Diagnose and propose fix Depends on scope
Test failure Trace to root cause, propose fix Yes — behavioral change
Eval regression Identify lever, propose remediation Yes — correctness judgment

The pattern: deterministic failures get deterministic fixes. A lint violation has exactly one correct fix. A type error usually has a narrow solution space. These can auto-heal. Test failures and eval regressions require judgment about intent — the system proposes, but a human confirms.

This is the SonarQube Remediation Agent pattern emerging across the industry in 2026: when a quality gate fails in CI/CD, a specialized agent triggers automatically, analyzes the failure, and posts a suggested fix directly to the PR. The review-fix-verify bottleneck — historically the slowest step in CI/CD — becomes an automated loop.

Try It: Design an Automated Feedback Loop

Think about the multi-center platform's eval harness. When the CAIC pipeline misparses a danger rating, or the AI analysis layer generates a briefing that mischaracterizes the forecast, what should happen automatically?

Ask your AI coding assistant:

Design a factory-level feedback loop for our avalanche platform's eval harness. When an eval fails — the deterministic pipeline misparses a field or the AI analysis layer generates an inaccurate briefing — describe the automated workflow: how does the system diagnose which of the three levers to pull, what does the remediation look like for each lever, and where should a human stay in the loop?

Claude Code designs the feedback loop and can scaffold the automation — eval runner, diagnostic logic, and PR-generation workflow.

Codex designs the feedback loop architecture. Implementation follows from the design.

pi designs the feedback loop. Create the implementation based on its architecture.

Evaluate the result: does the system correctly distinguish between the three levers? Does the autonomy slider setting make sense for each type of remediation?

Key Insight

The factory-level feedback loop is the same closed loop from Lift 1 — eval failure → diagnose → update lever → re-eval — running autonomously within constraints you define. The ratchet effect compounds without waiting for your attention. The autonomy slider determines which remediations auto-merge and which queue for human review. The system-level principle: deterministic failures get deterministic fixes; judgment-dependent failures get human review. The factory gets strictly better over time because the ratchet never stops turning.