Owning the Quality Bar¶
You Define Production-Ready¶
With multiple worktrees, auto-created issues, and agents picking up work items, you're no longer reviewing every line of code. That's the point — you've designed systems to scale beyond what you can personally supervise.
But "I can't review everything" is only safe when the pipeline enforces what you'd catch in review. The pipeline needs to know your quality bar — what standards the code must meet before it can be deployed. That's not something the pipeline figures out on its own. You define it.
Quality Gates: Your Automated Standards¶
A quality gate is an automated check that blocks progress if the output doesn't meet a defined standard. You've been building quality gates throughout this track — tests, evals, and structured logging are all quality gates at different levels.
The full stack for AI-speed development:
| Gate | What It Checks | When It Runs |
|---|---|---|
| Linting | Code style, anti-patterns, common errors | Every change |
| Type checking | Type safety, interface contracts | Every change |
| Unit tests | Individual function behavior | Every change |
| Acceptance tests | Feature-level behavior against user stories | Every change |
| Eval harness | System correctness against golden datasets | Every change that affects analysis, thresholds, or skills |
| Deployment gate | All of the above must pass | Before any deployment |
The deployment gate is the critical addition. Without it, a worktree that passes its own tests but fails the eval suite can still be deployed. With it, no code reaches production unless it passes your full quality bar.
The single verification command pattern makes this concrete. Instead of remembering which checks to run, you configure one command that runs everything — linting, type checking, tests, and evals. Pass means deploy. Fail means fix. No ambiguity.
Evals as Deployment Guards¶
Here's where Lift 2's investment compounds. Your golden datasets aren't just diagnostic tools — they're deployment guards. When the eval harness runs as part of the deployment pipeline, it blocks any change that would cause the deterministic pipeline to misparse a forecast or the AI analysis layer to generate an inaccurate briefing.
This creates a ratchet effect: every eval failure you fix becomes a permanent regression guard. Today your golden dataset has seven scenarios. After a week of development, you might have twenty — each one representing a case where the system once got the wrong answer and was corrected. The system gets strictly better over time because it can never regress past a fixed failure.
The ratchet works in both directions:
- Blocking bad changes. A skill refinement that fixes Ogden but breaks Salt Lake gets caught before deployment.
- Expanding coverage. Every time you discover a new edge case (two avalanche problems on the same aspect, a zone with no SNOTEL data), you add a golden dataset scenario. Future changes are automatically tested against it.
Building Guardrails, Not Features¶
The shift at this level is where you invest your time. Earlier in the track, you spent time building features — data ingestion, analysis logic, alert templates. Now the higher-leverage work is building the guardrails that ensure features work correctly at scale.
| Feature Work | Guardrail Work |
|---|---|
| Build the Skyline zone ingestion pipeline | Write the golden dataset scenario for Skyline (which has no SNOTEL data) |
| Add a new alert template for rapid-change conditions | Add an eval scenario that verifies the alert fires at the right threshold |
| Improve the analysis engine's problem weighting | Add golden dataset scenarios for the edge cases where weighting matters |
The guardrails are what let you delegate the feature work to agents with confidence. An agent can build the Skyline pipeline in a worktree. If the golden dataset for Skyline passes, the pipeline works correctly. If it doesn't, the deployment gate blocks it. You don't need to read the code — the quality bar does the reviewing.
Team Activity: Design Your Quality Pipeline¶
Format: Mob Session Time: ~3 minutes Setup: One person drives, everyone contributes.
As a team, sketch out the quality gate pipeline for your project. Think about what runs on every change vs. what runs only at deployment time.
Discuss: - What's on your minimum quality bar? (Linting, type checking, tests?) - Which eval scenarios are must-pass vs. nice-to-have? - What's your single verification command — what does it run? - What happens when a worktree branch passes its own tests but fails the full eval suite after merging?
The Wall You're About to Hit¶
You've automated a lot. Eval failures become work items. Agents pick them up in isolated worktrees. Quality gates block deployment when evals regress. The system is increasingly capable of operating without your direct supervision.
But: some decisions are happening without you that you're not comfortable with. The alert engine just sent a notification based on its own analysis. Was that the right call? A worktree agent refactored the analysis engine and the evals pass — but the approach is completely different from what you would have chosen. Is that a problem?
If you step in for everything, you've defeated the purpose of the system you built. If you don't step in for anything, you've abdicated responsibility for decisions that matter.
The question isn't "should I be in the loop?" It's: where should I be in the loop? That's what Lift 4 addresses.
Team Discussion: Where Should Humans Stay?¶
Format: Team Discussion Time: ~2 minutes
Think about the system as you've built it: auto-created issues, worktrees, quality gates, eval-gated deployments.
Discuss: What decisions are you comfortable letting the system make autonomously? What decisions do you want a human to review before they execute? Is there a difference between "the eval passed" and "I'm confident this is right"? Where's the line?
Key Insight¶
Your quality bar is the contract between you and the system you've built. Quality gates — linting, tests, evals, deployment guards — enforce that contract automatically. The ratchet effect means every fixed eval failure becomes a permanent regression guard, so the system gets strictly better over time. At this level, the highest-leverage work isn't building features — it's building the guardrails that let agents build features safely. Your role is defining what "production-ready" means. The pipeline's role is enforcing it.