The Agent Observability Stack: What to Measure Before Production
Agent logs are not enough. Production teams need traces, policy decisions, tool latency, cost attribution, and eval outcomes in one timeline.

Logs are the floor, not the system
The first version of most agent observability programs is a prompt log and a response log. That is useful for debugging a bad answer, but it is not enough to operate agents that can touch production systems.
A real observability stack treats the agent run as a distributed trace. The model call is one span. Tool selection, retrieval, policy checks, retries, and downstream API calls belong in the same timeline.
The five signals that matter
AIRMY teams measure outcome quality, tool latency, policy decisions, token cost, and user-visible completion time.
HTTP 200 does not mean the agent chose the right data source. Observability has to understand both infrastructure behavior and agent behavior.
From traces to operating decisions
Once every run has a stable id, product, finance, security, and engineering can reason from the same timeline.
The teams that succeed ask what happened across the whole run, not only what the model said.
Elena Park
Observability Lead, AIRMY. Writes about production-grade agent infrastructure, governance, and platform operations.
Connect on LinkedIn