Directing AI Systems¶
From Individual Toolkit to System Design¶
The Black Diamond track teaches the individual director's toolkit: context engineering, skills, delegation contracts, parallel execution with worktrees, quality gates, and the autonomy slider. Each of these is a tool for one practitioner directing one system.
The Double Black Diamond question is different: what happens when you're designing systems that multiple contributors — human and AI — operate simultaneously? The tools don't change. The design challenge does.
This section recaps the directing toolkit with an eye toward what changes at organizational scale.
Skills, Context, and Parallel Execution¶
Skills encode judgment into reusable, version-controlled files. The "We Do, You Do" pattern captures your process through refinement — the value isn't the file, it's the iteration that shaped it. At the individual level, skills solve a personal problem: "I keep re-explaining the same convention." At the team level, they solve role gaps: a decomposition skill acts as a PM, a test generation skill acts as QA.
Context engineering at scale is architecture, not authoring. The project context file is the foundation — same bootstrap pattern, same "INDEX, not ENCYCLOPEDIA" principle. What changes at scale is the layered architecture: path-scoped rules for conditional context, subdirectory context for domain-specific knowledge. The discipline is knowing what loads when and keeping each layer focused. Unfollowed rules are worse than no rules — they teach your AI coding assistant to ignore instructions.
Worktrees give each workstream its own isolated copy of the codebase. Each agent works on its own branch, in its own directory, with its own context window. The delegation-ready test determines whether a task can be parallelized: can it be built and tested independently? If yes, it goes in a worktree. If it shares files with another workstream, it needs sequencing or careful merge planning.
Launch with claude --worktree fix-escalation to create an isolated worktree with its own branch and conversation. Check status with /tasks. Skills and context load automatically in each worktree.
Create worktrees manually with git worktree add ../fix-escalation -b fix-escalation, then run a separate Codex instance in that directory.
Create worktrees manually with git worktree add, then launch a separate pi instance in each worktree directory.
Quality Gates and the Ratchet¶
Quality gates are automated checks that block progress if the output doesn't meet a defined standard. The full stack for AI-speed development:
| Gate | What It Checks | When It Runs |
|---|---|---|
| Linting | Code style, anti-patterns, common errors | Every change |
| Type checking | Type safety, interface contracts | Every change |
| Unit tests | Individual function behavior | Every change |
| Acceptance tests | Feature-level behavior against user stories | Every change |
| Eval harness | System correctness against golden datasets | Every change that affects analysis, thresholds, or skills |
| Deployment gate | All of the above must pass | Before any deployment |
The deployment gate is the critical addition. No code reaches production unless it passes your full quality bar. A single verification command runs everything — pass means deploy, fail means fix.
The ratchet effect from the eval harness means the system gets strictly better over time. Every eval failure you fix becomes a permanent regression guard. A skill refinement that fixes one zone but breaks another gets caught before deployment.
The Autonomy Slider¶
Autonomy isn't binary. It's a slider with five positions:
| Position | Human Role | AI Role |
|---|---|---|
| Operator | Directs every action | Executes what you say |
| Collaborator | Works alongside AI | Co-produces with guidance |
| Approver | Reviews before execution | Proposes, waits for approval |
| Monitor | Watches for problems | Acts within constraints |
| Observer | Reviews after the fact | Fully autonomous |
The reversibility principle determines where each component sits: reversible actions with low consequences can run at higher autonomy; irreversible actions with real-world impact need human approval. Different components of the same system sit at different positions — this is a per-component design decision, not a system-wide setting.
Human-in-the-loop (Approver): the human approves before execution. Safe, but doesn't scale. At high volume, approval degrades into rubber-stamping.
Human-on-the-loop (Monitor): the system acts autonomously within constraints. Scalable, but requires trust. That trust comes from the measurement infrastructure — evals, logging, quality gates.
The infrastructure from the previous section is what makes the Monitor and Observer positions safe. Without that infrastructure, high autonomy is high risk. With it, you can deliberately choose where to trust the system and where to stay in the loop.
Orchestration Patterns¶
For the decisions where you've set the slider to Approver — the irreversible, high-consequence ones — orchestration patterns introduce structured disagreement to improve decision quality.
Team Lead + Specialists: A central agent decomposes a complex assessment into sub-tasks and delegates to specialized agents. One specialist analyzes snowpack stability, another analyzes weather trends, a third evaluates historical patterns. The lead synthesizes their results, resolving conflicts and explaining its reasoning.
Debate and Consensus: Two or more agents independently analyze the same data, then a reviewer agent compares their assessments. Agreement means high confidence. Disagreement triggers escalation for human review — with both assessments and their reasoning visible to you.
These patterns are the structural equivalent of anti-obedience techniques — they counter the obedience problem (AI defaults to agreement rather than flagging when something is wrong) by designing disagreement into the system rather than relying on a single agent to be both producer and critic.
| Pattern | Best For | Tradeoff |
|---|---|---|
| Team Lead + Specialists | Complex assessments requiring multiple data sources | Quality depends on the lead's synthesis |
| Debate and Consensus | High-stakes decisions where you need confidence | More compute, slower |
Team Discussion: From Director to Disruptor¶
Format: Team Discussion Time: ~3 minutes
Think about the system you're about to build: a multi-center avalanche operations platform where each pair on the team takes one center's data and builds their own pipeline, alerting rules, dashboard, and skills. Eventually, you combine into a unified platform.
Discuss: - Each concept in this section was designed for one practitioner directing one system. What changes when two pairs are independently building pipelines that need to merge? Skills — whose conventions win? Context architecture — who decides the shared rules? Quality gates — whose eval harness is the standard? - The Black Diamond track ends at "Director" — you design systems that do the work and know where to intervene. The Double Black Diamond track pushes to "Disruptor" — you design systems that transform how your organization builds software. What's the gap between those two? What needs to be true for the individual director's toolkit to become an organizational capability? - When you look at the platform you're about to build, where on the autonomy slider should the cross-center normalization sit? The unified alerting? The compliance scanning? Do you and your teammates agree?
Key Insight¶
The director's toolkit — skills, context architecture, worktrees, quality gates, the autonomy slider, orchestration patterns — works for one practitioner directing one system. The disruptor's challenge is making that toolkit work for an organization. Skills become a shared library. Context becomes a team-wide architecture. Quality gates become organizational standards. The autonomy slider becomes a governance framework. The same tools, applied at a different level of abstraction. That's the journey this track takes you on.