10.4Intermediate8 min

Observability for AI Agents: Tracing, Metrics, Logs and Evals

Blck Alpaca·9 June 2026

Definition

AI agent observability makes the inner workings of an autonomous agent visible: through tracing (spans across reasoning and tool calls), metrics (latency, tokens, cost, success rate), structured logs and continuous evals. It answers why an agent made a particular decision, and it is the prerequisite for being able to debug, secure and audit multi-step agents in production at all.

Key Takeaways

✓Observability for agents rests on four pillars: tracing, metrics, logs and evals. Without tracing across reasoning and tool-call spans, multi-step agents are effectively un-debuggable.
✓An agent trace is hierarchical: one root span per request, with nested spans below it for each LLM call, tool call, retrieval step and guardrail: including prompt, response, token count, latency and cost per span.
✓Tool landscape (as of 2026): LangSmith (close to LangChain/LangGraph), Langfuse (open source, self-hostable in the EU), Arize Phoenix, as well as the vendor-neutral path via OpenTelemetry GenAI conventions and OpenInference.
✓For DACH B2B, the data residency of the observability backend is a compliance topic in its own right: Langfuse self-hosted in the EU, Datadog EU or Honeycomb EU keep prompts and responses within the EU.
✓Evals belong in observability: success rate, tool-call correctness and response quality are tagged against pinned model and prompt versions: mandatory for AI Act-relevant high-risk audits.
✓Tracing is also a security signal: it provides the audit log for an agent's large blast radius and complements egress control and service-account hygiene from the security pillar.

Unlike a classic web service, an agent does not make a single request-response decision but instead goes through several rounds of reasoning per task with intermediate tool calls. It is precisely this multi-step nature that makes conventional monitoring blind: an HTTP 200 says nothing about whether the right tool was called with the right arguments or whether the model took a wrong turn halfway through.

The three key points up front:

Observability for agents consists of four pillars: tracing, metrics, logs and evals. They interlock: the trace provides the path, metrics provide the aggregation, logs provide the context, evals provide the quality verdict.
Without tracing, agents are un-debuggable. Only a hierarchical trace across all reasoning and tool-call spans shows at which step a multi-step chain broke.
Data residency is a mandatory criterion. Since traces contain prompts and responses, and therefore potentially customer data, the observability backend is itself a compliance object for DACH B2B (Langfuse self-hosted in the EU, Datadog EU, Honeycomb EU).

Why agents are un-debuggable without tracing

An agent is non-deterministic. The same request can lead to different tool calls, different numbers of reasoning rounds and different results. If a run fails or produces a substantively incorrect answer, there is no reliable way to isolate the cause without tracing. Possible sources of error within a single request:

The LLM reasoned incorrectly and chose an unsuitable tool.
The tool was called with faulty arguments.
The retrieval step (RAG) pulled the wrong documents.
A guardrail or PII filter intervened and altered the context.
A model version changed server-side (managed APIs update models on the provider's schedule).

A flat output log does not distinguish between these cases. Tracing makes each of these steps visible as its own span, ordered temporally and causally. This shifts debugging from "guessing based on the final result" to "finding the specific span that broke". The research dossier also anchors this view architecturally: service meshes provide observability at the network level, and agent stacks made up of many microservices (orchestrator, tool server, memory store, vector DB, retrieval service, guardrail service) can only be managed with an end-to-end observability layer.

The four pillars of agent observability

Tracing: spans across reasoning and tool calls

A trace represents a complete agent request as a tree. The root span encompasses the entire request; nested below it hang spans for each LLM call, each tool call, each retrieval step and each guardrail check. Every span carries inputs and outputs as well as attributes such as model name, prompt and completion tokens, latency and, derived from these, cost. This hierarchical structure is the decisive difference from the classic, flat request log.

Metrics: latency, tokens, cost, success rate

Metrics aggregate across many traces. Four signals form the core:

Latency, end-to-end and per tool-call round. Relevant because, according to the dossier, an agent in Frankfurt calling a US-East API adds around 80–130 ms per direction, and with several tool rounds this multiplies.
Tokens (prompt and completion) per span as the basis for cost attribution.
Cost per request, per team and per use case.
Success rate: the share of correctly completed tasks.

Logs: structured context

Structured logs supplement spans with detailed context: raw tool responses, retry attempts, guardrail triggers, truncated contexts. The key is correlation: every log entry should be attributable to a specific span via the trace ID, otherwise a blind spot reappears.

Evals: quality as part of observability

Evals are the signal that pure tracing does not provide: the systematic assessment of output quality. What is typically assessed is success rate, tool-call correctness (was the right tool called with the right arguments?) and response quality: via heuristics, via a reference dataset or via "LLM-as-a-judge". Decisive for regulated contexts: evals and prompt versions are tagged against pinned model versions. The dossier names this pattern explicitly: prompt versions and evals are marked against specific model versions in the gateway or observability stack, among other things to prepare for AI Act high-risk audits.

Signal-to-tool matrix

The following table maps each observability signal to its measured quantity and a typical tool (tool mentions from the research dossier; as of 2026).

Signal	What to measure	Tool (examples)
Tracing	Span tree across reasoning and tool-call steps; prompt/response, nesting, duration per span	LangSmith, Langfuse, Arize Phoenix; vendor-neutral via OpenTelemetry GenAI conventions / OpenInference
Latency	End-to-end and per tool-call round; tail latency	Langfuse, LangSmith, Datadog EU / Honeycomb EU
Tokens	Prompt and completion tokens per span	Langfuse, LangSmith (token-level cost attribution)
Cost	Cost per request, team, use case	Langfuse, LangSmith; aggregation often in the AI gateway (LiteLLM, Portkey)
Success rate / quality	Eval scores, tool-call correctness, response quality against a pinned model version	Eval harness in Langfuse / Arize Phoenix / LangSmith
Logs	Raw tool responses, retries, guardrail hits, truncated contexts	Structured logs, correlated via trace ID
Guardrails / PII	Triggered filters, redaction events	AI gateway (Portkey, LiteLLM) plus trace annotation

The LLM observability stack: tools at a glance

For the observability layer, the research dossier names a clear selection (as of 2026):

LangSmith: the tracing and eval platform from the LangChain/LangGraph ecosystem, as a managed service. Strong when the agent is already built on this framework; listed in the dossier as a managed option alongside Datadog and Honeycomb.
Langfuse: open source and therefore self-hostable in the EU (your own data centre or an EU region). In the dossier, it is listed both for the regulated sovereign architecture (customer-controlled observability) and for the lean cloud startup pattern ("Langfuse self-hosted") as a GDPR-compliant path.
Arize Phoenix: open-source observability and evaluation, named in the dossier as a self-hostable alternative alongside Langfuse.
OpenTelemetry for LLMs / OpenInference, the vendor-neutral trace standard. Instrumented via GenAI semantic conventions, the backend remains interchangeable, an important argument against lock-in. Hugging Face's TGI, for example, already exported OpenTelemetry and Prometheus before it entered maintenance mode in December 2025.
Datadog EU / Honeycomb EU: established APM backends with an EU data residency option, listed in the dossier as managed paths for DACH residency.

An important architectural point: the AI gateway (LiteLLM, Portkey, Kong) is often already a carrier of parts of the observability. According to the dossier, gateways handle multi-provider failover, virtual key management, team budgets, observability, guardrails and PII redaction. In practice, the gateway attributes tokens and cost, while the tracing platform holds the reasoning path and the evals: the two together produce the complete picture.

Example trace: an agent run, broken down

The following pseudo-example shows how a single support-agent run looks as a span tree. The numbers are illustrative; the latency magnitude for transatlantic calls (80–130 ms/direction) comes from the dossier.

```
TRACE id=ag-7f3c "Customer request: cancel invoice" total: 4,210 ms | 3,320 tokens | 0.041 USD
├─ SPAN llm.reason model=gpt-4.x 620 ms | in 540 / out 80 tok "Plan: look up customer+invoice"
├─ SPAN tool.crm_lookup mTLS, in-VPC 180 ms | status=200 args={customer_id:8842}
├─ SPAN retrieval.vector qdrant-eu 95 ms | 4 hits query="B2B cancellation policy"
├─ SPAN llm.reason model=gpt-4.x 710 ms | in 1,980 / out 130 tok "Cancellation permitted, call tool"
├─ SPAN tool.invoice_void in-VPC 240 ms | status=200 args={invoice:RE-2026-0317}
├─ SPAN guardrail.pii redaction 40 ms | 0 hits
└─ SPAN llm.compose model=gpt-4.x 2,325 ms| in 380 / out 210 tok response text to customer
EVAL tool_call_correct=PASS | answer_quality=0.92 | tagged: model=gpt-4.x, prompt=v7
```

What this trace achieves that an output log cannot: had the cancellation failed, the tree immediately shows whether tool.invoice_void returned an error status, whether retrieval.vector pulled the wrong policy or whether the second llm.reason span decided incorrectly. The largest latency item (llm.compose, 2,325 ms) is immediately recognisable as an optimisation candidate. And the eval line ties the quality verdict to the model and prompt version: the basis for rollback and audit.

Relationship to monitoring in the security pillar

Tracing is not only a debugging signal but also a security signal. Agents have an unusually large "blast radius": a compromised agent can call many tools. The end-to-end trace provides exactly the audit log that makes this attack surface traceable, which agent called which tool with which identity and when. This complements the controls from the security and identity context: deny-by-default egress with an allowlist and logging at the gateway, one service account per agent-tool pair instead of shared accounts, and the binding of every call back to the user identity via a token-exchange chain. Observability at the network level via a service mesh (mTLS, workload identity, traffic shaping) closes the loop. Detailed measurement and control patterns for egress, identity and monitoring are covered in the security pillar; the observability layer provides the telemetry basis for this. The specific token economics and cost modelling, in turn, are the subject of the FinOps pillar.

For agencies and B2B decision-makers

For marketing agencies building agent workflows for clients, observability is the deliverable that separates a pilot from an auditable production system: only trace, eval and an EU-compliant backend make an agent operable, billable and defensible with regard to client data protection. For DACH B2B decision-makers, the rule is: choose the observability backend just as deliberately as the cloud region, self-hosted Langfuse, Datadog EU or Honeycomb EU keep prompts and responses within the EU, and vendor-neutral OpenTelemetry tracing protects against lock-in. Blck Alpaca from Vienna designs and operates agent infrastructure with this observability layer from the outset, including tracing, an eval harness and a GDPR-compliant backend. Get in touch if you want to move an agent out of pilot status into auditable production operation.

FAQ

What is the difference between observability for AI agents and classic APM?

Classic application performance monitoring measures requests, error rates and latency at the service level. AI agent observability goes deeper: it records the entire reasoning path for each request as a trace: every LLM call, every tool call, every retrieval step, including prompt, response, token consumption and cost per span. In addition, it assesses the quality of the output via evals, not just its technical availability. An agent can run technically flawlessly (HTTP 200) and still make the wrong decision, and it is precisely this gap that agent observability closes.

Why are AI agents un-debuggable without tracing?

An agent is non-deterministic and multi-step: a request is often followed by several rounds of reasoning with intermediate tool calls. If the final result fails, a mere output log does not tell you whether the LLM reasoned incorrectly, a tool returned a faulty response, the retrieval step pulled the wrong documents or a guardrail intervened. Tracing makes each of these stages visible as its own span with inputs and outputs. Only then can you isolate the specific step at which the chain broke.

Which observability tools are suitable for DACH companies with data residency requirements?

According to the research dossier, the main candidates for GDPR-compliant setups are Langfuse self-hosted (open source, operable in an EU region or in your own data centre), Datadog EU and Honeycomb EU. Managed options such as LangSmith or Datadog are more convenient but must be checked against the data residency of prompts and responses, since traces can contain customer data. Vendor-neutral instrumentation via OpenTelemetry GenAI conventions keeps the backend interchangeable (as of 2026).

Do evaluations (evals) belong to observability, or are they separate disciplines?

In the agent context, the two merge. Evals: the systematic assessment of success rate, tool-call correctness and response quality, are the quality signal that pure tracing does not provide. In practice, evals are attached directly to the traced runs and tagged against pinned model and prompt versions. The research dossier names exactly this pattern: prompt versions and evals are marked against specific model versions in the gateway or observability stack, among other things, as preparation for AI Act high-risk audits.

Which metrics should you capture at a minimum for a production agent?

Four core signals: latency (end-to-end and per tool-call round, since transatlantic API calls add 80–130 ms per direction according to the dossier), token consumption (prompt and completion tokens per span as the basis for cost attribution), cost (per request, per team, per use case) and success rate (the share of correctly completed tasks). In addition: tool-call error rate, number of reasoning rounds per request and guardrail triggers. Detailed token economics and cost modelling are covered separately in the FinOps pillar.

Want to go deeper?

Get new analyses straight to your inbox, or see how we put this knowledge to work for companies.

Subscribe to newsletter →Our services

Previous← Deploying AI Agents on Kubernetes: Architecture, Scaling and When K8s Pays Off NextToken Economics: How AI Agent Costs Really Arise →