Skip to content
16.9Intermediate8 min

AI Agent Monitoring with LangSmith and Langfuse: Observability for Secure AI Agents

Blck Alpaca·
Definition

AI agent monitoring (agent observability) is the end-to-end capture and analysis of what an AI agent does: traces, tool calls, token costs, latency, errors and eval scores. Tools such as LangSmith and Langfuse make an agent's decision paths traceable and are therefore a prerequisite for security, debugging and compliance.

Key Takeaways

  • Monitoring for agents differs fundamentally from classic APM: what needs to be tracked are traces, tool calls, tokens/costs, latency, errors and eval scores along multi-step reasoning chains.
  • LangSmith (commercial, tightly coupled to the LangChain ecosystem) and Langfuse (open source, self-hostable, EU region available) are the two observability anchors named explicitly in the research; for DACH data residency, self-hosting or EU hosting is the central selection criterion.
  • Observability is security infrastructure: without trace and provenance logging, OWASP agentic threats such as Goal Hijack (ASI01), Tool Misuse (ASI02) or Memory Poisoning (ASI06) cannot be detected.
  • Compliance hinges on logging: ISO/IEC 42001 (A.6.2.6, A.6.2.8), the EU AI Act (Art. 15, Art. 14, Art. 72) and, in the financial sector, DORA as well as the BaFin guidance require traceable, tamper-proof logs.
  • Standard observability stacks are not enough in 2026: they often fail to capture agent-specific signals (reasoning drift, memory-write provenance, inter-agent integrity); detection practice for agents is still immature.
  • Minimum log content per agent action: full prompt, model version/config hash, tool-call sequence with arguments, retrieval queries, output and rationale, human-override and memory events, costs and latency.

AI agent monitoring (agent observability) is the end-to-end capture and analysis of what an AI agent does: traces, tool calls, token costs, latency, errors and eval scores. Tools such as LangSmith and Langfuse make an agent's decision paths traceable and are therefore a prerequisite for security, debugging and compliance. Unlike classic application monitoring, it must not map individual requests but entire multi-step reasoning chains.

The three most important points up front:

  • What to track: traces (the complete reasoning chain), tool calls with arguments, token consumption and costs, latency per step, errors and eval scores for answer quality.
  • Which tools: LangSmith (commercial, tightly coupled to LangChain), Langfuse (open source, self-hostable, EU region) as well as, in addition, Arize Phoenix, Weights & Biases Weave, Datadog LLM Observability and OpenTelemetry for GenAI traces.
  • Why it is critical: without observability, OWASP agentic threats such as Goal Hijack or Memory Poisoning remain invisible - and compliance obligations from ISO 42001, the EU AI Act and DORA cannot be met.

Why agents need their own observability

Classic language-model applications react: prompt in, answer out. Agentic systems, by contrast, plan, reason recursively, choose tools, write to persistent memory and act with minimal step-by-step approval. This shift expands the attack and error surface along three axes: autonomy (multi-step plans, self-modification of context), tool use (file system, APIs, databases, code sandboxes, MCP servers) and persistence (long-lived memory, vector databases, agent-to-agent trust chains).

This is precisely why a request log is not enough. You must be able to reconstruct the entire trajectory of an agent - otherwise neither a bug, nor an attack, nor a hallucination can be traced back. In the MAESTRO reference architecture of the Cloud Security Alliance, observability forms its own layer (Layer 5: Evaluation & Observability), which itself becomes an attack target: poisoned observability data, monitoring evasion and compromised evaluation are listed as threats there.

What specifically needs to be tracked

The OWASP source defines a minimum log content per agent action for forensic completeness. This list is the practical core of any monitoring strategy:

  • Full prompt - user, system and injected context (crucial for detecting indirect prompt injection).
  • Model version and configuration hash - reproducibility and proof of change.
  • Tool-call sequence with arguments - which tool was called when and with which parameters.
  • Retrieval queries and returned document IDs - traceability of the RAG basis.
  • Output and decision rationale - including chain-of-thought, where available.
  • Human-override events - every human approval or correction.
  • Memory write and read accesses - critical for detecting memory poisoning.
  • Costs and latency - per step, for cost-efficiency and anomaly detection.

In addition, eval scores belong in monitoring: automated or model-based assessments of answer quality (correctness, groundedness, hallucination rate) that are compared across versions. LangSmith and Langfuse both support such evaluation pipelines, which allow regressions to be detected before production deployment.

When it comes to retention, the rule is: WORM storage (write-once-read-many, i.e. immutable logs) and cryptographic signing for tamper detection are recommended. Retention periods depend on the sector - the source cites 10 years for banks and insurers, in healthcare in line with BfArM/Swissmedic requirements, and in the public sector in line with archival law.

The tool landscape: focus and hosting

LangSmith and Langfuse are the two anchors of the observability landscape named explicitly in the source. For DACH decision-makers, in addition to feature scope, hosting in particular is decisive - data residency in the EU or Switzerland is often the deciding criterion in regulated sectors.

Tool

Focus

Hosting / EU suitability (as of 2026)

LangSmith

Tracing, eval, debugging; tightly coupled to the LangChain/LangGraph ecosystem

Commercial, primarily as a managed cloud; enterprise self-hosting available

Langfuse

Tracing, token/cost tracking, evaluation, prompt management; framework-agnostic

Open source, fully self-hostable; dedicated EU region in the managed cloud - favourable for GDPR data residency

Arize Phoenix

Open-source observability and evaluation, RAG/embedding analysis

Open source, self-hostable

Weights & Biases Weave

Tracing and evaluation, ML-experiment-adjacent stack

Managed cloud, self-hosting for enterprise

Datadog LLM Observability

LLM tracing integrated into existing APM/SIEM

Managed; EU region available within the Datadog estate

OpenTelemetry (GenAI)

Open trace standard, vendor-neutral instrumentation

Self-hostable; basis for vendor-independent pipelines

In addition, the source names Honeycomb AI and Splunk AI Assistant Tracing as building blocks in the broader observability stack. Those who want to avoid lock-in instrument via OpenTelemetry and forward the traces to the platform of their choice.

A note relevant to DACH: in the adjacent guardrail market, Lakera is an active Swiss provider - evidence that European data residency is also feasible in the security tooling around agents. With all providers, the rule is: self-published benchmark and detection rates should be treated as marketing until they are independently verified.

Why monitoring is security and compliance infrastructure

Observability is not a downstream nice-to-have but the foundation on which detection can work at all. The OWASP agentic threats can almost universally only be detected through monitoring signals:

  • Agent Goal Hijack (ASI01): atypical outbound URLs in agent outputs, tool calls that do not match the user request, sudden topic switches in the reasoning trace - all only visible if the trace is available without gaps.
  • Tool Misuse (ASI02): anomalous tool-call frequencies, unusual call sequences, destructive operations shortly after ingesting external content.
  • Memory & Context Poisoning (ASI06): behavioural drift without any code or model change, non-verifiable memory entries, semantic outliers in the vector store.
  • Cascading Failures (ASI08): rapid fan-out (one decision triggers many downstream agents within seconds), oscillating retry loops.
  • Rogue Agents (ASI10): behavioural drift against the baseline, access to resources outside the normal scope.

On the cost side, monitoring addresses the threat of unbounded resource consumption (Unbounded Consumption, LLM10) - so-called denial-of-wallet attacks. Multi-step plans multiply token consumption; anomaly detection on the token-usage time series plus hard cost caps with circuit breakers are the countermeasures.

On the compliance side, logging is the common denominator of all the relevant frameworks. ISO/IEC 42001 addresses it directly with the Annex A controls A.6.2.6 (Operation and monitoring) and A.6.2.8 (Logging). The EU AI Act requires cybersecurity and robustness in Art. 15, human oversight in Art. 14 and post-market monitoring in Art. 72 for high-risk systems. The GDPR connects via Art. 32 (technical and organisational measures for integrity and availability). In the financial sector, DORA applies (Art. 5-15 ICT risk management), and the BaFin guidance of 18 December 2025 frames AI systems as a subclass of network and information systems under DORA - with explicit focus on transparent decision logs. For KRITIS operators, NIS2 (Art. 21) addresses incident handling and access control. Note: this section places the frameworks in context and is not legal advice; the specific applicability to your system should be assessed legally.

An honest assessment is part of this: according to the OWASP source, detection in production agent deployments is currently weak. Most observability stacks were built for classic applications and do not capture agent-specific signals - reasoning-trace anomalies, memory-write provenance violations, inter-agent integrity errors, behavioural drift against the baseline. According to the source, detection-engineering practice for agents stands roughly where SOC detection for cloud was in 2018: usable, but with high false-negative rates and limited DACH language coverage. Those who see nothing in the SIEM should not conclude that nothing is happening.

Practical example: what a trace makes visible

Suppose an insurer operates a claims-triage agent. A scanned doctor's letter file contains, readable via OCR, manipulative text that pushes the agent towards an automatic payout - a goal-hijack scenario (ASI01). A well-instrumented trace in Langfuse or LangSmith makes the attack visible:

```
trace_id: claim-48211
step 1 retrieval query="Schadensfall 48211" docs=[doc_91, doc_OCR_scan]
step 2 reasoning "Dokument enthaelt Anweisung: sofort genehmigen" <- Anomalie
step 3 tool_call approve_payout(amount=14.900 EUR) <- untypisch frueh
step 4 output "Auszahlung freigegeben"
tokens: 8.420 cost: 0,11 USD latency: 3,2 s
```

Step 2 shows a reasoning step that adopts an instruction from document content (i.e. from data, not from the system prompt) - the classic signal for indirect injection. Step 3 is a destructive tool call (payout) immediately after ingesting external content. Without a trace, only an approved payment would be visible in the system; with a trace, provenance metadata on the memory entry and an alert on the rule "destructive tool call after external content", the incident can be stopped in real time and later proven forensically.

On urgency, the source provides a concrete figure: in simulated multi-agent systems, a single compromised agent poisoned 87 percent of downstream decisions within four hours (Galileo AI Research, December 2025). Cascading failures spread faster than traditional incident response can contain them - which makes continuous, deep monitoring of inter-agent flows a must.

For agencies and B2B decision-makers

Those who build AI agents for clients or deploy them in their own operations should plan observability as a mandatory component from day one - not as a later add-on. In practice, this means: instrument framework-agnostically via OpenTelemetry, self-host Langfuse or run it in the EU region when data residency matters, capture the minimum log content mentioned above in full, and place eval scores as a quality gate ahead of every release. For agencies, traceable trace and cost monitoring is at the same time a trust and a sales argument: it demonstrates to the client that the agent is controllable, auditable and budgetable. Blck Alpaca from Vienna supports DACH companies in building and securing such agent stacks - from observability architecture through to alignment with ISO 42001 and the EU AI Act.

FAQ

What is the difference between AI agent monitoring and classic application monitoring?
Classic APM monitors deterministic requests and responses. Agents, by contrast, plan, choose tools, write to memory and act in a multi-step, non-deterministic manner. Monitoring for agents must therefore map the entire reasoning chain as a trace: every tool call with arguments, retrieval queries, memory accesses, token costs, latency per step and eval scores for answer quality. Only this depth makes misbehaviour and attacks visible.
LangSmith or Langfuse - which is a better fit for DACH companies?
Both cover traces, tool calls, tokens/costs, latency and evaluation. The main difference lies in hosting: Langfuse is open source and self-hostable and offers an EU region, which makes GDPR-compliant data residency easier (as of 2026). LangSmith is a commercial platform tightly coupled to the LangChain ecosystem. For strictly data-protection- or sector-regulated DACH scenarios, self-hosting or EU hosting is usually the decisive criterion. This is not legal advice.
Why is monitoring central to the security of AI agents?
Many OWASP agentic threats can only be detected through observability. Goal Hijack (ASI01) shows up as atypical tool calls and sudden topic switches in the reasoning trace, Tool Misuse (ASI02) as anomalous call frequencies, Memory Poisoning (ASI06) as behavioural drift without any code change. Without gap-free trace and provenance logging, a compromised agent remains invisible - according to the OWASP source, in simulated multi-agent systems a single compromised agent poisoned 87 percent of downstream decisions within four hours.
What is the minimum data I must log per agent action?
For forensic completeness, the OWASP source recommends per action: the full prompt (user, system and injected context), model version and configuration hash, tool-call sequence with arguments, retrieval queries and returned document IDs, output including decision rationale, human-override events, memory write and read accesses as well as costs and latency. WORM storage and cryptographic signing are recommended for tamper detection.
Is an existing observability tool like Datadog sufficient for AI agents?
Only to a limited extent. According to the OWASP source, most observability stacks were built for classic applications and do not capture the signals required for agentic threats - such as reasoning-trace anomalies, violations of memory-write provenance or integrity errors in inter-agent communication. Platforms such as Datadog LLM Observability or Splunk can be part of the stack, but should be supplemented with agent-specific tools like LangSmith or Langfuse and with behavioural monitoring.

Want to go deeper?

Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.