Skip to content

Observability Across Agents

The Tracing Problem

Your analysis engine has multiple stages: data ingestion pulls from UAC, NWS, and SNOTEL. The analysis engine correlates that data. The alert generator makes dispatch decisions. When you're running these stages across parallel agents, a question surfaces: when something goes wrong, how do you find where it went wrong?

An eval tells you the Salt Lake danger rating is wrong. But was it the forecast data that was malformed? Did the analysis engine misweight the snowpack data? Did the alert threshold logic apply the wrong rule? Without visibility into the pipeline, an eval failure is a black box — you know the answer is wrong, but not why.

Structured Logging

The solution is structured logging — machine-readable log entries (typically JSON) that include enough metadata to trace a data point through the entire system.

Unstructured log:

Processing Salt Lake zone... done.
Analysis complete: Considerable.
Alert generated.

Structured log:

{
  "timestamp": "2026-03-06T07:15:00Z",
  "trace_id": "abc-123-def",
  "span": "analysis",
  "zone": "salt-lake",
  "danger_rating": "Considerable",
  "danger_level": 3,
  "problems_identified": ["Wind Drifted Snow", "New Snow", "Persistent Weak Layer"],
  "data_sources": {
    "forecast": "success",
    "weather": "success",
    "snowpack": "success"
  },
  "duration_ms": 450
}

The structured version is queryable: filter by zone, find all entries with a specific trace ID, aggregate duration across stages, identify which data sources failed. The unstructured version is only useful to a human reading it sequentially.

Trace IDs: Following a Data Point

A trace ID is a unique identifier assigned at the start of a workflow that follows a data point through every stage. When the Salt Lake forecast enters the system, it gets a trace ID. That same ID appears on the ingestion log, the analysis log, and the alert decision log. When an eval fails, you search for that trace ID and see exactly what happened at each step.

The concept maps directly to how modern observability systems work. OpenTelemetry — the industry standard for telemetry data — defines spans (individual operations) connected by trace IDs into a complete picture of a request's journey through a system. Each span records what happened, how long it took, and whether it succeeded.

For your project, the key spans might be:

Span What It Captures
ingest.forecast Did the UAC API return valid data? How long did it take?
ingest.weather Did NWS return data for the right coordinates?
ingest.snowpack Did SNOTEL return readings? (Not all zones have stations)
analyze.danger What danger rating did the parser extract? What did the AI analysis layer generate?
alert.decide What alert action was taken? Which threshold rule matched?

When an eval says the AI-generated briefing for Ogden mischaracterizes conditions — it highlights elevated risk when the forecaster said "Moderate" — you pull the trace and see: the SNOTEL snowpack data showed a rapid depth increase (29" to 37" in a week), and the AI analysis prompt interpreted that as an escalation signal without checking the forecaster's published assessment. Now you know the problem — and you know what to fix.

Team Discussion: Where Would You Trace?

Format: Team Discussion Time: ~2 minutes

Think about the system you're building: data ingestion from three APIs, an analysis engine, and an alert generator.

Discuss: If an eval fails — the AI analysis layer generates a briefing that overstates conditions compared to the forecaster's published assessment — where in the pipeline would you want visibility? What metadata would you log at each stage? What would help you diagnose the failure fastest? What happens when one of the three data sources is unavailable (Skyline and Moab have no SNOTEL stations)?

Key Insight

Evals tell you what is wrong — the system produced the wrong answer. Structured logging with trace IDs tells you where and why it went wrong. Without observability, every eval failure requires manual investigation across the entire pipeline. With trace IDs, you follow the data point from ingestion through analysis to alert decision and pinpoint exactly where the system diverged from the expected output. The investment in structured logging pays for itself the first time a parallel workstream produces a wrong answer and you need to find out why in minutes, not hours.