Skip to content

The Deliberate Line

Orchestration for Critical Decisions

For the decisions where you've set the slider to Approver or Collaborator — the irreversible, high-consequence ones — how do you make sure the system gives you the best possible information before you decide?

Single-agent analysis has a problem: one agent, one perspective. If the agent's skill instructions contain a bias (overweighting recent snowfall, underweighting wind loading), every analysis reflects that bias. The obedience problem from Section 1 means the agent won't flag its own bias.

Orchestration patterns address this by introducing structured disagreement — multiple agents with different perspectives, each challenging the others.

Two Patterns

Team Lead + Specialists: A central agent decomposes a complex assessment into sub-tasks and delegates to specialized agents. One specialist analyzes snowpack stability. Another analyzes weather trends. A third evaluates historical patterns for the zone. The lead synthesizes their results into a unified assessment.

The lead doesn't just concatenate — it resolves conflicts. If the snowpack specialist says "stable" and the weather specialist says "rapid warming trend," the lead must weigh those signals and explain its reasoning. You review the synthesis and the individual specialist reports.

Debate and Consensus: Two or more agents independently analyze the same data, then a reviewer agent compares their assessments. If they agree, the consensus is high-confidence. If they disagree, the reviewer highlights the disagreement and escalates for human review.

This pattern is the structural equivalent of the "steelman the opposite" technique from conversation-level anti-obedience — except it happens automatically, every time, without you having to ask.

Pattern Best For Tradeoff
Team Lead + Specialists Complex assessments requiring multiple data sources Requires well-defined sub-tasks; quality depends on the lead's synthesis
Debate and Consensus High-stakes decisions where you need confidence Costs more (multiple agents analyzing the same data); slower

For your alert system, consider: a "Considerable" or higher danger rating triggers the debate pattern before an alert is dispatched. Two agents independently assess the data. If they agree, the alert proceeds. If they disagree, the system escalates to human review — with both assessments and their reasoning visible to you.

The Approval Spectrum

Orchestration patterns aren't an alternative to the autonomy slider — they're what makes the Approver position effective. When a decision requires your approval, the orchestration pattern determines what you see when you make that decision:

What You See Quality of Your Decision
"Alert: Considerable danger, Salt Lake" Low — you're approving based on the system's conclusion with no context
"Alert: Considerable danger, Salt Lake. Analysis confidence: 0.89. Wind Slab + Persistent Weak Layer identified. Matches golden dataset pattern gs-001." Medium — you have the reasoning and reference data
"Alert: Considerable danger, Salt Lake. Two agents agree. Analyst A identified Wind Slab + PWL (confidence 0.91). Analyst B identified Wind Slab + PWL + New Snow (confidence 0.87). Disagreement on New Snow — Analyst B weighted recent 24hr precipitation higher. Both agree on Considerable." High — you see the consensus, the disagreement, and the reasoning behind each

The third option takes more compute and more time. It's not worth it for every decision. It IS worth it for the decisions where being wrong has consequences — the ones where you've deliberately set the slider to Approver.

Team Discussion: Where Does Debate Add Value?

Format: Team Discussion Time: ~2 minutes

Think about the decisions in your system that sit at the Approver position on the autonomy slider.

Discuss: Which of those decisions would benefit from multi-agent debate? Is the danger rating itself a good candidate (two agents independently assess, reviewer compares)? What about the alert decision (one agent assesses danger, another evaluates whether the alert threshold logic applies correctly)? Where would the cost of running multiple agents NOT be worth the improvement in decision quality?

Drawing the Line

You now have every piece:

Lift What You Built What It Enables
Lift 1 Context engineering, skills, delegation contracts AI understands your project, your conventions, and what "done" looks like
Lift 2 Golden datasets, eval harness, structured logging You know when the system produces wrong answers, and you can trace why
Lift 3 Auto-created work items, worktrees, quality gates The system responds to failures and builds in parallel with guardrails
Lift 4 Anti-obedience, autonomy slider, orchestration patterns You've deliberately chosen where humans stay in the loop

The journey across this track: Delegator → Director. In Blue Square, you learned to delegate — trusting AI to do work you'd verify. In Black Diamond, you learned to direct — designing systems that do the delegating, the measuring, and the responding, while you decide where to trust and where to intervene.

The line you've drawn isn't permanent. As the golden dataset grows, as evals get tighter, as skills encode more judgment, you can push the slider further. A component that starts at Approver today might move to Monitor next quarter — not because you trust AI more, but because you've built the infrastructure that makes monitoring safe.

The line is deliberate. That's what matters.

Team Discussion: The Line You've Drawn

Format: Team Discussion Time: ~3 minutes

As a team, reflect on the system you've designed across all four lifts.

Discuss: - Where have you drawn the line between autonomous and human-approved? Are you comfortable with those choices? - What would need to change for you to push the slider further on the components that currently require approval? - What's the difference between "I trust the system" and "I've built infrastructure that makes the system trustworthy"? - The track started with the visibility problem — "I can't manually check all of it anymore." Do you feel like you've solved that problem? Or reframed it into something more manageable?

Key Insight

The autonomy slider isn't about trusting AI. It's about trusting the infrastructure you've built around AI — evals that catch wrong answers, logging that traces why, quality gates that block regressions, orchestration patterns that introduce structural disagreement for critical decisions, and the deliberate design choices about where humans stay in the loop. The line between autonomous and human-approved isn't drawn by capability — it's drawn by consequence, reversibility, and the strength of your measurement infrastructure. A director doesn't do the work. A director designs the systems that do the work and knows exactly where to intervene.