LangSmith in Production: Observability, Evaluation, and Debugging AI Agents

You shipped your AI agent. It works in testing. Then a user reports a wrong answer, your LLM costs spike 3x overnight, and you have no idea which tool call caused it. Without observability, production AI is a black box.

LangSmith is the observability and evaluation platform for LangChain/LangGraph agents. After running it across every agent I've shipped — BandiFinder (procurement matching), Pellemoda (inventory forecasting), RevAgent (autonomous RevOps), and Holding Morelli (compliance monitoring) — here's how to use it effectively.

Setup: Two Lines

import os
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_PROJECT"] = "revagent-prod"

That's it. Every LangChain/LangGraph operation is now traced — LLM calls, tool invocations, retrieval steps, state transitions. No code changes to your agent.

Tracing: See What Your Agent Thinks

Every agent run produces a trace — a tree of operations showing exactly what happened:

Trace: "Score deal risk for Acme Corp"
├── LLM Call (gpt-4o-mini) — 340 tokens, 0.8s, $0.0003
│   └── Decision: "Need CRM data before scoring"
├── Tool Call: fetch_deal_data("acme-corp")
│   └── Result: {stage: "negotiation", amount: 85000, days_open: 45}
├── Tool Call: analyze_email_threads("acme-corp")
��   └── Result: {sentiment: 0.3, ghosting_detected: true, last_reply: "2026-03-20"}
├── LLM Call (gpt-4o-mini) — 520 tokens, 1.2s, $0.0005
│   └── Decision: "High risk — ghosting + declining sentiment"
└── Output: {risk_score: 0.82, signals: ["ghosting", "sentiment_decline", "stalled_stage"]}

You see every decision the agent made, with what context, at what cost. When something goes wrong, you don't guess — you look at the trace.

Token and Cost Breakdown

LangSmith automatically tracks costs across three categories:

Input tokens: Prompt tokens sent to the model (including cache reads, image tokens)
Output tokens: Generated response tokens (including reasoning tokens)
Other: Tool calls, retrieval steps, custom runs

Costs are broken down per-run in the trace tree, aggregated per-project in dashboards, and viewable per-thread for conversation-level analysis. For RevAgent, this showed me that the email analysis tool was consuming 60% of total tokens — one optimization cut costs by 40%.

Custom Cost Tracking

For self-hosted models or custom operations, submit costs manually:

from langsmith import traceable
 
@traceable(
    run_type="llm",
    metadata={"ls_model_name": "my-fine-tuned-model", "ls_provider": "custom"}
)
def call_custom_model(prompt: str) -> str:
    result = my_model.generate(prompt)
    # LangSmith tracks this as a traced LLM call
    return result

Offline Evaluation: Test Before You Ship

Offline evaluation runs your agent against a curated dataset before deployment. This is your CI/CD gate — if evaluation scores drop, don't deploy.

Create a Dataset

from langsmith import Client
 
client = Client()
 
dataset = client.create_dataset("deal-risk-scoring-v2")
 
# Add examples — input + expected output
client.create_examples(
    inputs=[
        {"deal": {"stage": "negotiation", "days_open": 90, "sentiment": 0.2}},
        {"deal": {"stage": "closed_won", "days_open": 15, "sentiment": 0.9}},
        {"deal": {"stage": "discovery", "days_open": 5, "sentiment": 0.7}},
    ],
    outputs=[
        {"risk_score_range": [0.7, 1.0], "expected_signals": ["stalled", "low_sentiment"]},
        {"risk_score_range": [0.0, 0.2], "expected_signals": []},
        {"risk_score_range": [0.1, 0.4], "expected_signals": []},
    ],
    dataset_id=dataset.id,
)

Build datasets from:

Manually curated test cases — edge cases you know matter
Historical production traces — real queries that failed or succeeded
Synthetic data — LLM-generated variations for coverage

Define Evaluators

Four types of evaluators:

1. Code rules — deterministic, fast:

def risk_score_in_range(run, example):
    """Check if risk score falls in expected range."""
    actual = run.outputs["risk_score"]
    expected_range = example.outputs["risk_score_range"]
    return {"score": expected_range[0] <= actual <= expected_range[1]}

2. LLM-as-judge — nuanced quality assessment:

from langsmith.evaluation import LangChainStringEvaluator
 
relevance_evaluator = LangChainStringEvaluator(
    "criteria",
    config={"criteria": "Is the risk assessment well-reasoned and based on the provided signals?"}
)

3. Human review — annotation queues for manual scoring

4. Pairwise comparison — A/B test two agent versions side by side

Run Experiments

from langsmith.evaluation import evaluate
 
results = evaluate(
    risk_scoring_agent.invoke,
    data="deal-risk-scoring-v2",
    evaluators=[risk_score_in_range, relevance_evaluator],
    experiment_prefix="risk-agent-v2.1",
    num_repetitions=3,  # Run each example 3 times for consistency
)

Compare experiments across versions. If v2.1 scores lower than v2.0, investigate before deploying.

Online Evaluation: Monitor in Production

Online evaluators run automatically on live traffic — no datasets needed. They catch issues in real-time.

Set Up Online Evaluators

Configure rules that fire on every production trace:

Safety checks: Flag responses containing PII, toxic content, or prompt injection attempts
Format validation: Verify structured outputs match expected schemas
Quality heuristics: Check response length, language consistency, citation presence
LLM-as-judge: Score quality on a sample of production traces (with sampling to control costs)

Sampling and Filtering

You don't need to evaluate every trace — that's expensive. Apply filters:

# Evaluate 10% of production traces
sampling_rate: 0.1

# Only evaluate traces from the risk-scoring agent
filter: agent_name == "risk_scorer"

# Only evaluate traces with latency > 5s (investigate slow runs)
filter: latency > 5000

Alerts

Set alerts for anomalies:

Average latency exceeds 3s → Slack notification
Error rate exceeds 5% → PagerDuty alert
Cost per trace exceeds $0.05 → Email warning
Evaluation score drops below 0.7 → Block deployment

Dashboards: The Operations View

LangSmith dashboards aggregate traces into operational metrics:

Metric	What to Monitor	Alert Threshold
P50/P95 latency	Response time distribution	P95 > 5s
Error rate	Failed traces / total	> 5%
Token usage	Input + output per trace	Sudden 2x spike
Cost per day	Daily spend by model	Exceeds budget
Evaluation scores	Quality trends over time	Downward trend
Trace volume	Requests per hour	Unexpected drop (outage?)

For RevAgent, I built a dashboard per agent (Risk, Forecast, Hygiene, Follow-up) — each with its own latency, cost, and quality metrics. When the Forecast agent's latency doubled, the dashboard caught it before any user complained.

The Feedback Loop

The most valuable pattern: production failures → dataset → offline evaluation → fix → redeploy.

Online evaluator flags a bad trace in production
Add that trace to your evaluation dataset
Write a targeted evaluator that catches the specific failure
Fix the agent (prompt, tool, logic)
Run offline evaluation to confirm the fix works AND doesn't break other cases
Deploy with confidence

This is how agents get better over time. Without this loop, you're fixing symptoms. With it, you're building regression coverage.

For BandiFinder, I added every mismatched tender (user rejected the agent's recommendation) to the evaluation dataset. After 3 months, the matching accuracy improved from 72% to 91% — entirely driven by this feedback loop.

PII Protection in Traces

Production traces capture real user data. For GDPR compliance, mask sensitive fields:

# Option 1: Mask specific fields
from langsmith import Client
 
client = Client(
    hide_inputs=lambda inputs: redact_pii(inputs),
    hide_outputs=lambda outputs: redact_pii(outputs),
)
 
# Option 2: Hide everything (debugging becomes harder)
os.environ["LANGSMITH_HIDE_INPUTS"] = "true"
os.environ["LANGSMITH_HIDE_OUTPUTS"] = "true"

I use Option 1 — mask PII but keep the rest visible. See my GDPR-Compliant AI post for the full approach.

Production Checklist

Separate projects per environment: my-agent-dev, my-agent-staging, my-agent-prod
Evaluation datasets from day one: Don't wait for failures — seed with known test cases
Online evaluators on all production projects: At minimum, safety + format validation
Cost alerts: Set a daily budget alarm — one bad loop can burn through your budget
Dashboards per agent: Don't aggregate everything — each agent has different baselines
Feedback loop: Route failed production traces into evaluation datasets weekly
PII masking: Redact before tracing, not after — once PII is logged, it's a compliance issue

Building AI Agents with LangGraph: From Prototype to Production — the orchestration framework that generates these traces
GDPR-Compliant AI: Building Guardrails for EU AI Act Readiness — protecting PII in traces and audit logging for compliance

Need observability for your AI agents? LangSmith is what I run on every production deployment. Get in touch or book a call.