LangSmith in Production: Observability, Evaluation, and Debugging AI Agents

ACAbhishek Chauhan··6 min read
LangSmith in Production: Observability, Evaluation, and Debugging AI Agents

You shipped your AI agent. It works in testing. Then a user reports a wrong answer, your LLM costs spike 3x overnight, and you have no idea which tool call caused it. Without observability, production AI is a black box.

LangSmith is the observability and evaluation platform for LangChain/LangGraph agents. After running it across every agent I've shipped — BandiFinder (procurement matching), Pellemoda (inventory forecasting), RevAgent (autonomous RevOps), and Holding Morelli (compliance monitoring) — here's how to use it effectively.

Setup: Two Lines

import os
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_PROJECT"] = "revagent-prod"

That's it. Every LangChain/LangGraph operation is now traced — LLM calls, tool invocations, retrieval steps, state transitions. No code changes to your agent.

Tracing: See What Your Agent Thinks

Every agent run produces a trace — a tree of operations showing exactly what happened:

Trace: "Score deal risk for Acme Corp"
├── LLM Call (gpt-4o-mini) — 340 tokens, 0.8s, $0.0003
│   └── Decision: "Need CRM data before scoring"
├── Tool Call: fetch_deal_data("acme-corp")
│   └── Result: {stage: "negotiation", amount: 85000, days_open: 45}
├── Tool Call: analyze_email_threads("acme-corp")
��   └── Result: {sentiment: 0.3, ghosting_detected: true, last_reply: "2026-03-20"}
├── LLM Call (gpt-4o-mini) — 520 tokens, 1.2s, $0.0005
│   └── Decision: "High risk — ghosting + declining sentiment"
└── Output: {risk_score: 0.82, signals: ["ghosting", "sentiment_decline", "stalled_stage"]}

You see every decision the agent made, with what context, at what cost. When something goes wrong, you don't guess — you look at the trace.

Token and Cost Breakdown

LangSmith automatically tracks costs across three categories:

Costs are broken down per-run in the trace tree, aggregated per-project in dashboards, and viewable per-thread for conversation-level analysis. For RevAgent, this showed me that the email analysis tool was consuming 60% of total tokens — one optimization cut costs by 40%.

Custom Cost Tracking

For self-hosted models or custom operations, submit costs manually:

from langsmith import traceable
 
@traceable(
    run_type="llm",
    metadata={"ls_model_name": "my-fine-tuned-model", "ls_provider": "custom"}
)
def call_custom_model(prompt: str) -> str:
    result = my_model.generate(prompt)
    # LangSmith tracks this as a traced LLM call
    return result

Offline Evaluation: Test Before You Ship

Offline evaluation runs your agent against a curated dataset before deployment. This is your CI/CD gate — if evaluation scores drop, don't deploy.

Create a Dataset

from langsmith import Client
 
client = Client()
 
dataset = client.create_dataset("deal-risk-scoring-v2")
 
# Add examples — input + expected output
client.create_examples(
    inputs=[
        {"deal": {"stage": "negotiation", "days_open": 90, "sentiment": 0.2}},
        {"deal": {"stage": "closed_won", "days_open": 15, "sentiment": 0.9}},
        {"deal": {"stage": "discovery", "days_open": 5, "sentiment": 0.7}},
    ],
    outputs=[
        {"risk_score_range": [0.7, 1.0], "expected_signals": ["stalled", "low_sentiment"]},
        {"risk_score_range": [0.0, 0.2], "expected_signals": []},
        {"risk_score_range": [0.1, 0.4], "expected_signals": []},
    ],
    dataset_id=dataset.id,
)

Build datasets from:

Define Evaluators

Four types of evaluators:

1. Code rules — deterministic, fast:

def risk_score_in_range(run, example):
    """Check if risk score falls in expected range."""
    actual = run.outputs["risk_score"]
    expected_range = example.outputs["risk_score_range"]
    return {"score": expected_range[0] <= actual <= expected_range[1]}

2. LLM-as-judge — nuanced quality assessment:

from langsmith.evaluation import LangChainStringEvaluator
 
relevance_evaluator = LangChainStringEvaluator(
    "criteria",
    config={"criteria": "Is the risk assessment well-reasoned and based on the provided signals?"}
)

3. Human review — annotation queues for manual scoring

4. Pairwise comparison — A/B test two agent versions side by side

Run Experiments

from langsmith.evaluation import evaluate
 
results = evaluate(
    risk_scoring_agent.invoke,
    data="deal-risk-scoring-v2",
    evaluators=[risk_score_in_range, relevance_evaluator],
    experiment_prefix="risk-agent-v2.1",
    num_repetitions=3,  # Run each example 3 times for consistency
)

Compare experiments across versions. If v2.1 scores lower than v2.0, investigate before deploying.

Online Evaluation: Monitor in Production

Online evaluators run automatically on live traffic — no datasets needed. They catch issues in real-time.

Set Up Online Evaluators

Configure rules that fire on every production trace:

Sampling and Filtering

You don't need to evaluate every trace — that's expensive. Apply filters:

# Evaluate 10% of production traces
sampling_rate: 0.1

# Only evaluate traces from the risk-scoring agent
filter: agent_name == "risk_scorer"

# Only evaluate traces with latency > 5s (investigate slow runs)
filter: latency > 5000

Alerts

Set alerts for anomalies:

Dashboards: The Operations View

LangSmith dashboards aggregate traces into operational metrics:

Metric What to Monitor Alert Threshold
P50/P95 latency Response time distribution P95 > 5s
Error rate Failed traces / total > 5%
Token usage Input + output per trace Sudden 2x spike
Cost per day Daily spend by model Exceeds budget
Evaluation scores Quality trends over time Downward trend
Trace volume Requests per hour Unexpected drop (outage?)

For RevAgent, I built a dashboard per agent (Risk, Forecast, Hygiene, Follow-up) — each with its own latency, cost, and quality metrics. When the Forecast agent's latency doubled, the dashboard caught it before any user complained.

The Feedback Loop

The most valuable pattern: production failures → dataset → offline evaluation → fix → redeploy.

  1. Online evaluator flags a bad trace in production
  2. Add that trace to your evaluation dataset
  3. Write a targeted evaluator that catches the specific failure
  4. Fix the agent (prompt, tool, logic)
  5. Run offline evaluation to confirm the fix works AND doesn't break other cases
  6. Deploy with confidence

This is how agents get better over time. Without this loop, you're fixing symptoms. With it, you're building regression coverage.

For BandiFinder, I added every mismatched tender (user rejected the agent's recommendation) to the evaluation dataset. After 3 months, the matching accuracy improved from 72% to 91% — entirely driven by this feedback loop.

PII Protection in Traces

Production traces capture real user data. For GDPR compliance, mask sensitive fields:

# Option 1: Mask specific fields
from langsmith import Client
 
client = Client(
    hide_inputs=lambda inputs: redact_pii(inputs),
    hide_outputs=lambda outputs: redact_pii(outputs),
)
 
# Option 2: Hide everything (debugging becomes harder)
os.environ["LANGSMITH_HIDE_INPUTS"] = "true"
os.environ["LANGSMITH_HIDE_OUTPUTS"] = "true"

I use Option 1 — mask PII but keep the rest visible. See my GDPR-Compliant AI post for the full approach.

Production Checklist

Related Posts


Need observability for your AI agents? LangSmith is what I run on every production deployment. Get in touch or book a call.