Observability for AI Agents: Tracing, Metrics, Logs and Evals
AI agent observability makes the inner workings of an autonomous agent visible: through tracing (spans across reasoning and tool calls), metrics (latency, tokens, cost, success rate), structured logs and continuous evals. It answers why an agent made a particular decision – and it is the prerequisite for being able to debug, secure and audit multi-step agents in production at all.
Key Takeaways
- ✓Observability for agents rests on four pillars: tracing, metrics, logs and evals. Without tracing across reasoning and tool-call spans, multi-step agents are effectively un-debuggable.
- ✓An agent trace is hierarchical: one root span per request, with nested spans below it for each LLM call, tool call, retrieval step and guardrail – including prompt, response, token count, latency and cost per span.
- ✓Tool landscape (as of 2026): LangSmith (close to LangChain/LangGraph), Langfuse (open source, self-hostable in the EU), Arize Phoenix, as well as the vendor-neutral path via OpenTelemetry GenAI conventions and OpenInference.
- ✓For DACH B2B, the data residency of the observability backend is a compliance topic in its own right: Langfuse self-hosted in the EU, Datadog EU or Honeycomb EU keep prompts and responses within the EU.
- ✓Evals belong in observability: success rate, tool-call correctness and response quality are tagged against pinned model and prompt versions – mandatory for AI Act-relevant high-risk audits.
- ✓Tracing is also a security signal: it provides the audit log for an agent's large blast radius and complements egress control and service-account hygiene from the security pillar.
AI agent observability makes the inner workings of an autonomous agent visible: through tracing (spans across reasoning and tool calls), metrics (latency, tokens, cost, success rate), structured logs and continuous evals. It answers why an agent made a particular decision – and it is the prerequisite for being able to debug, secure and audit multi-step agents in production at all.
Unlike a classic web service, an agent does not make a single request-response decision but instead goes through several rounds of reasoning per task with intermediate tool calls. It is precisely this multi-step nature that makes conventional monitoring blind: an HTTP 200 says nothing about whether the right tool was called with the right arguments or whether the model took a wrong turn halfway through.
The three key points up front:
- Observability for agents consists of four pillars – tracing, metrics, logs and evals. They interlock: the trace provides the path, metrics provide the aggregation, logs provide the context, evals provide the quality verdict.
- Without tracing, agents are un-debuggable. Only a hierarchical trace across all reasoning and tool-call spans shows at which step a multi-step chain broke.
- Data residency is a mandatory criterion. Since traces contain prompts and responses – and therefore potentially customer data – the observability backend is itself a compliance object for DACH B2B (Langfuse self-hosted in the EU, Datadog EU, Honeycomb EU).
Why agents are un-debuggable without tracing
An agent is non-deterministic. The same request can lead to different tool calls, different numbers of reasoning rounds and different results. If a run fails or produces a substantively incorrect answer, there is no reliable way to isolate the cause without tracing. Possible sources of error within a single request:
- The LLM reasoned incorrectly and chose an unsuitable tool.
- The tool was called with faulty arguments.
- The retrieval step (RAG) pulled the wrong documents.
- A guardrail or PII filter intervened and altered the context.
- A model version changed server-side (managed APIs update models on the provider's schedule).
A flat output log does not distinguish between these cases. Tracing makes each of these steps visible as its own span, ordered temporally and causally. This shifts debugging from "guessing based on the final result" to "finding the specific span that broke". The research dossier also anchors this view architecturally: service meshes provide observability at the network level, and agent stacks made up of many microservices (orchestrator, tool server, memory store, vector DB, retrieval service, guardrail service) can only be managed with an end-to-end observability layer.
The four pillars of agent observability
Tracing – spans across reasoning and tool calls
A trace represents a complete agent request as a tree. The root span encompasses the entire request; nested below it hang spans for each LLM call, each tool call, each retrieval step and each guardrail check. Every span carries inputs and outputs as well as attributes such as model name, prompt and completion tokens, latency and – derived from these – cost. This hierarchical structure is the decisive difference from the classic, flat request log.
Metrics – latency, tokens, cost, success rate
Metrics aggregate across many traces. Four signals form the core:
- Latency, end-to-end and per tool-call round. Relevant because, according to the dossier, an agent in Frankfurt calling a US-East API adds around 80–130 ms per direction – and with several tool rounds this multiplies.
- Tokens (prompt and completion) per span as the basis for cost attribution.
- Cost per request, per team and per use case.
- Success rate – the share of correctly completed tasks.
Logs – structured context
Structured logs supplement spans with detailed context: raw tool responses, retry attempts, guardrail triggers, truncated contexts. The key is correlation: every log entry should be attributable to a specific span via the trace ID, otherwise a blind spot reappears.
Evals – quality as part of observability
Evals are the signal that pure tracing does not provide: the systematic assessment of output quality. What is typically assessed is success rate, tool-call correctness (was the right tool called with the right arguments?) and response quality – via heuristics, via a reference dataset or via "LLM-as-a-judge". Decisive for regulated contexts: evals and prompt versions are tagged against pinned model versions. The dossier names this pattern explicitly – prompt versions and evals are marked against specific model versions in the gateway or observability stack, among other things to prepare for AI Act high-risk audits.
Signal-to-tool matrix
The following table maps each observability signal to its measured quantity and a typical tool (tool mentions from the research dossier; as of 2026).
Signal | What to measure | Tool (examples) |
|---|---|---|
Tracing | Span tree across reasoning and tool-call steps; prompt/response, nesting, duration per span | LangSmith, Langfuse, Arize Phoenix; vendor-neutral via OpenTelemetry GenAI conventions / OpenInference |
Latency | End-to-end and per tool-call round; tail latency | Langfuse, LangSmith, Datadog EU / Honeycomb EU |
Tokens | Prompt and completion tokens per span | Langfuse, LangSmith (token-level cost attribution) |
Cost | Cost per request, team, use case | Langfuse, LangSmith; aggregation often in the AI gateway (LiteLLM, Portkey) |
Success rate / quality | Eval scores, tool-call correctness, response quality against a pinned model version | Eval harness in Langfuse / Arize Phoenix / LangSmith |
Logs | Raw tool responses, retries, guardrail hits, truncated contexts | Structured logs, correlated via trace ID |
Guardrails / PII | Triggered filters, redaction events | AI gateway (Portkey, LiteLLM) plus trace annotation |
The LLM observability stack: tools at a glance
For the observability layer, the research dossier names a clear selection (as of 2026):
- LangSmith – the tracing and eval platform from the LangChain/LangGraph ecosystem, as a managed service. Strong when the agent is already built on this framework; listed in the dossier as a managed option alongside Datadog and Honeycomb.
- Langfuse – open source and therefore self-hostable in the EU (your own data centre or an EU region). In the dossier, it is listed both for the regulated sovereign architecture (customer-controlled observability) and for the lean cloud startup pattern ("Langfuse self-hosted") as a GDPR-compliant path.
- Arize Phoenix – open-source observability and evaluation, named in the dossier as a self-hostable alternative alongside Langfuse.
- OpenTelemetry for LLMs / OpenInference – the vendor-neutral trace standard. Instrumented via GenAI semantic conventions, the backend remains interchangeable – an important argument against lock-in. Hugging Face's TGI, for example, already exported OpenTelemetry and Prometheus before it entered maintenance mode in December 2025.
- Datadog EU / Honeycomb EU – established APM backends with an EU data residency option, listed in the dossier as managed paths for DACH residency.
An important architectural point: the AI gateway (LiteLLM, Portkey, Kong) is often already a carrier of parts of the observability. According to the dossier, gateways handle multi-provider failover, virtual key management, team budgets, observability, guardrails and PII redaction. In practice, the gateway attributes tokens and cost, while the tracing platform holds the reasoning path and the evals – the two together produce the complete picture.
Example trace: an agent run, broken down
The following pseudo-example shows how a single support-agent run looks as a span tree. The numbers are illustrative; the latency magnitude for transatlantic calls (80–130 ms/direction) comes from the dossier.
```
TRACE id=ag-7f3c "Customer request: cancel invoice" total: 4,210 ms | 3,320 tokens | 0.041 USD
├─ SPAN llm.reason model=gpt-4.x 620 ms | in 540 / out 80 tok "Plan: look up customer+invoice"
├─ SPAN tool.crm_lookup mTLS, in-VPC 180 ms | status=200 args={customer_id:8842}
├─ SPAN retrieval.vector qdrant-eu 95 ms | 4 hits query="B2B cancellation policy"
├─ SPAN llm.reason model=gpt-4.x 710 ms | in 1,980 / out 130 tok "Cancellation permitted, call tool"
├─ SPAN tool.invoice_void in-VPC 240 ms | status=200 args={invoice:RE-2026-0317}
├─ SPAN guardrail.pii redaction 40 ms | 0 hits
└─ SPAN llm.compose model=gpt-4.x 2,325 ms| in 380 / out 210 tok response text to customer
EVAL tool_call_correct=PASS | answer_quality=0.92 | tagged: model=gpt-4.x, prompt=v7
```
What this trace achieves that an output log cannot: had the cancellation failed, the tree immediately shows whether tool.invoice_void returned an error status, whether retrieval.vector pulled the wrong policy or whether the second llm.reason span decided incorrectly. The largest latency item (llm.compose, 2,325 ms) is immediately recognisable as an optimisation candidate. And the eval line ties the quality verdict to the model and prompt version – the basis for rollback and audit.
Relationship to monitoring in the security pillar
Tracing is not only a debugging signal but also a security signal. Agents have an unusually large "blast radius": a compromised agent can call many tools. The end-to-end trace provides exactly the audit log that makes this attack surface traceable – which agent called which tool with which identity and when. This complements the controls from the security and identity context: deny-by-default egress with an allowlist and logging at the gateway, one service account per agent-tool pair instead of shared accounts, and the binding of every call back to the user identity via a token-exchange chain. Observability at the network level via a service mesh (mTLS, workload identity, traffic shaping) closes the loop. Detailed measurement and control patterns for egress, identity and monitoring are covered in the security pillar; the observability layer provides the telemetry basis for this. The specific token economics and cost modelling, in turn, are the subject of the FinOps pillar.
For agencies and B2B decision-makers
For marketing agencies building agent workflows for clients, observability is the deliverable that separates a pilot from an auditable production system: only trace, eval and an EU-compliant backend make an agent operable, billable and defensible with regard to client data protection. For DACH B2B decision-makers, the rule is: choose the observability backend just as deliberately as the cloud region – self-hosted Langfuse, Datadog EU or Honeycomb EU keep prompts and responses within the EU, and vendor-neutral OpenTelemetry tracing protects against lock-in. Blck Alpaca from Vienna designs and operates agent infrastructure with this observability layer from the outset – including tracing, an eval harness and a GDPR-compliant backend. Get in touch if you want to move an agent out of pilot status into auditable production operation.
FAQ
What is the difference between observability for AI agents and classic APM?
Why are AI agents un-debuggable without tracing?
Which observability tools are suitable for DACH companies with data residency requirements?
Do evaluations (evals) belong to observability, or are they separate disciplines?
Which metrics should you capture at a minimum for a production agent?
Want to go deeper?
Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.