Skip to content

The Obedience Problem

Why Agreement Is Dangerous

AI coding assistants are trained to be agreeable. During training, human raters tend to rate helpful responses highly — and "helpful" often means validating the user's ideas rather than challenging them. The model picks up this pattern. Agreement becomes a successful strategy for getting rewarded, even though nobody told it to be agreeable.

For individual tasks, this is an annoyance — you ask "is this a good idea?" and get "Great idea!" instead of genuine feedback. For autonomous systems, it's a risk.

Consider your AI analysis layer. You've written prompts that tell it how to interpret published forecasts, what risk factors to highlight in briefings, how to characterize avalanche problems in contextual alerts. If the prompt instructions contain a subtle error — say, they treat all avalanche problems as equally severe when persistent weak layers are fundamentally higher-consequence than wind slab — a sycophantic system won't flag it. It will follow the instructions faithfully, generate briefings that look reasonable, and never say "this characterization doesn't match how forecasters actually assess these problems."

The obedience problem means your system won't push back even when it should. And when you're not reviewing every output — because the whole point of Lifts 2 and 3 was to stop manually reviewing everything — a system that doesn't push back is a system that can be confidently wrong.

Anti-Obedience Techniques

The fix isn't making AI disobedient — it's giving it explicit permission and direction to challenge your thinking. Three techniques that work at both the conversation level and the system level:

Technique In Conversation In Skills/System
Ground rules "Tell me things I need to know, even if I don't want to hear them" Add to your project context: "Flag any analysis result that contradicts the golden dataset pattern"
Explicit challenge "Steelman the opposite position" Build a critic skill that reviews analysis output before it ships
Structured disagreement "What am I not seeing? What could go wrong?" Use multi-agent patterns where one agent evaluates another's work

At the conversation level, these techniques counter the default agreeableness. At the system level, they become architectural — you design disagreement into the system rather than relying on a single agent to be both producer and critic.

Evals as Pushback

Here's the connection to Lift 2: your eval harness is the anti-obedience mechanism at the system level.

When a sycophantic agent follows flawed prompt instructions and generates a briefing that mischaracterizes conditions, the eval harness catches it — not because the agent pushed back, but because the golden dataset doesn't agree. The forecaster identified Moderate danger with low-likelihood problems, and the AI-generated briefing characterized it as elevated risk. The eval doesn't care how confident the system was. It compares against ground truth.

This is why evals matter beyond quality assurance. They're the structural counter to AI obedience. An agent that never questions its instructions is fine as long as something else questions its outputs. The golden dataset is that something else.

But evals are reactive — they catch wrong answers after the fact. For the decisions where you can't afford to be wrong even once (sending an alert to field teams about an avalanche threat), you need proactive disagreement. That's where orchestration patterns come in — Lift 4, Section 3.

Team Discussion: Where Does Obedience Create Risk?

Format: Team Discussion Time: ~2 minutes

Think about the system you've built: data ingestion, analysis engine, alert generator, dashboard.

Discuss: Where in your system would blind obedience be most dangerous? If the analysis skill has a flawed instruction and the agent follows it perfectly, what's the worst outcome? Where would you want the system to say "I followed the instructions, but the result doesn't look right" instead of silently producing output? Is there a difference between obedience in implementation (doing what you asked) and obedience in judgment (agreeing with your assessment)?

Key Insight

AI obedience isn't a flaw in individual interactions — it's a structural risk in autonomous systems. When you stop reviewing every output, you need mechanisms that push back on your behalf. Evals are the reactive mechanism: they catch wrong answers against the golden dataset. Anti-obedience techniques — ground rules, explicit challenges, structured disagreement — are the proactive mechanism. The combination means your system doesn't just follow instructions well; it flags when following instructions produces the wrong result.