Skip to content

System Health & Observability

From Data Point Traces to Factory-Level Metrics

In Lift 1, you learned structured logging with trace IDs — a unique identifier that follows a data point from ingestion through analysis to alert decision. When an eval fails, you search for that trace ID and see exactly what happened at each step. That observability operates at the data level: one forecast, one trace, one diagnosis.

Factory-level observability operates at the system level. The question isn't "what happened to this data point?" but "is the factory healthy?" A factory producing correct output today can be degrading in ways that won't surface until tomorrow — and by then, the remediation cost has compounded.

Three Forms of Drift

Drift is the gradual divergence between what the system does and what it should do. In Lift 2, you learned about agentic drift — parallel workstreams encoding incompatible assumptions that survive clean merges. At the factory level, three forms of drift threaten system health:

Drift Type What Changes How It Manifests Detection Strategy
Data drift Input distributions shift CAIC changes their data format, adds new fields, or alters how they encode danger levels Schema validation, distribution monitoring on incoming data
Context drift Context files become stale A recent refactor changed the normalization approach, but the subdirectory context still describes the old one Git commit analysis against context file references; context staleness detection
Agentic drift Parallel workstreams encode divergent assumptions The CAIC pipeline and UAC pipeline independently evolve their error handling in incompatible directions Cross-pipeline eval harnesses that test unified output, not just individual pipeline correctness

The system-level implication: you need observability at three different levels simultaneously — data health (are inputs still what you expect?), context health (are instructions still accurate?), and output health (are results still correct?). Any one of these can degrade while the others remain stable.

Pipeline Health Metrics

Traditional software metrics (uptime, latency, error rates) don't capture factory health. The metrics that matter for an AI-assisted development factory are:

Metric What It Measures Healthy Signal Degrading Signal
Eval pass rate trend Are evals passing more or less over time? Stable or improving (the ratchet) Declining — new failures appearing faster than fixes
Remediation cycle time How long from eval failure to confirmed fix? Shrinking — the factory gets faster at self-repair Growing — remediation queue is backing up
Quality gate failure rate What percentage of changes fail gates? Stable and low Spiking — either code quality is degrading or gates are miscalibrated
Drift detection frequency How often does drift detection fire? Occasional, quickly resolved Frequent and persistent — upstream changes or architectural rot
Golden dataset coverage What percentage of the system's behavior is covered by evals? Growing as new scenarios are added Stagnant — new features ship without eval coverage

The DORA research reinforces this: teams with high AI adoption see a 7.2% decrease in delivery stability when they lack pipeline-level observability. Faster output without quality measurement produces the illusion of productivity while technical debt accumulates beneath the surface.

Leading vs. Lagging Indicators

The metrics above split into two categories that determine whether you're reacting or anticipating:

Lagging indicators tell you what already happened: eval failures, production incidents, drift that reached users. They're important for accountability but too late for prevention.

Leading indicators tell you what's about to happen: increasing remediation cycle time (the queue is backing up), declining golden dataset coverage (new features ship without eval protection), rising quality gate failure rates (code quality is trending down). These are the metrics that let you intervene before the factory degrades.

The factory-level discipline: instrument leading indicators, alert on lagging indicators, and investigate the gap between them.

Team Discussion: What Does Factory Health Look Like?

Format: Team Discussion Time: ~3 minutes

Your multi-center platform has two center pipelines (CAIC and UAC), a shared normalization layer, a unified dashboard, and a shared skills library. Each component can independently be healthy or degrading.

Discuss: What metrics would you put on a factory health dashboard? Which are leading indicators (predicting problems) and which are lagging (confirming problems)? If the CAIC pipeline's eval pass rate drops while the UAC pipeline's stays stable, what does that tell you — and what's the first thing you investigate? Now consider the harder question: what if both pipelines pass their individual evals but the unified dashboard shows incorrect cross-center comparisons? Where in the observability stack does that failure surface?

Key Insight

Factory-level observability extends the trace ID concept from individual data points to the factory itself. Three forms of drift — data, context, and agentic — can independently degrade system health, and each requires a different detection strategy. The metrics that matter are not just whether the factory is working now, but whether it's getting better or worse — the trend is the signal. Leading indicators let you intervene before degradation reaches users; lagging indicators confirm what you should already know.