You shipped your AI agent. It works in testing. Then a user reports a wrong answer, your LLM costs spike 3x overnight, and you have no idea which tool call caused it. Without observability, production AI is a black box.
LangSmith is the observability and evaluation platform for LangChain/LangGraph agents. After running it across every agent I've shipped — BandiFinder (procurement matching), Pellemoda (inventory forecasting), RevAgent (autonomous RevOps), and Holding Morelli (compliance monitoring) — here's how to use it effectively.
Setup: Two Lines
import os
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_PROJECT"] = "revagent-prod"That's it. Every LangChain/LangGraph operation is now traced — LLM calls, tool invocations, retrieval steps, state transitions. No code changes to your agent.
Tracing: See What Your Agent Thinks
Every agent run produces a trace — a tree of operations showing exactly what happened:
Trace: "Score deal risk for Acme Corp"
├── LLM Call (gpt-4o-mini) — 340 tokens, 0.8s, $0.0003
│ └── Decision: "Need CRM data before scoring"
├── Tool Call: fetch_deal_data("acme-corp")
│ └── Result: {stage: "negotiation", amount: 85000, days_open: 45}
├── Tool Call: analyze_email_threads("acme-corp")
�� └── Result: {sentiment: 0.3, ghosting_detected: true, last_reply: "2026-03-20"}
├── LLM Call (gpt-4o-mini) — 520 tokens, 1.2s, $0.0005
│ └── Decision: "High risk — ghosting + declining sentiment"
└── Output: {risk_score: 0.82, signals: ["ghosting", "sentiment_decline", "stalled_stage"]}
You see every decision the agent made, with what context, at what cost. When something goes wrong, you don't guess — you look at the trace.
Token and Cost Breakdown
LangSmith automatically tracks costs across three categories:
- Input tokens: Prompt tokens sent to the model (including cache reads, image tokens)
- Output tokens: Generated response tokens (including reasoning tokens)
- Other: Tool calls, retrieval steps, custom runs
Costs are broken down per-run in the trace tree, aggregated per-project in dashboards, and viewable per-thread for conversation-level analysis. For RevAgent, this showed me that the email analysis tool was consuming 60% of total tokens — one optimization cut costs by 40%.
Custom Cost Tracking
For self-hosted models or custom operations, submit costs manually:
from langsmith import traceable
@traceable(
run_type="llm",
metadata={"ls_model_name": "my-fine-tuned-model", "ls_provider": "custom"}
)
def call_custom_model(prompt: str) -> str:
result = my_model.generate(prompt)
# LangSmith tracks this as a traced LLM call
return resultOffline Evaluation: Test Before You Ship
Offline evaluation runs your agent against a curated dataset before deployment. This is your CI/CD gate — if evaluation scores drop, don't deploy.
Create a Dataset
from langsmith import Client
client = Client()
dataset = client.create_dataset("deal-risk-scoring-v2")
# Add examples — input + expected output
client.create_examples(
inputs=[
{"deal": {"stage": "negotiation", "days_open": 90, "sentiment": 0.2}},
{"deal": {"stage": "closed_won", "days_open": 15, "sentiment": 0.9}},
{"deal": {"stage": "discovery", "days_open": 5, "sentiment": 0.7}},
],
outputs=[
{"risk_score_range": [0.7, 1.0], "expected_signals": ["stalled", "low_sentiment"]},
{"risk_score_range": [0.0, 0.2], "expected_signals": []},
{"risk_score_range": [0.1, 0.4], "expected_signals": []},
],
dataset_id=dataset.id,
)Build datasets from:
- Manually curated test cases — edge cases you know matter
- Historical production traces — real queries that failed or succeeded
- Synthetic data — LLM-generated variations for coverage
Define Evaluators
Four types of evaluators:
1. Code rules — deterministic, fast:
def risk_score_in_range(run, example):
"""Check if risk score falls in expected range."""
actual = run.outputs["risk_score"]
expected_range = example.outputs["risk_score_range"]
return {"score": expected_range[0] <= actual <= expected_range[1]}2. LLM-as-judge — nuanced quality assessment:
from langsmith.evaluation import LangChainStringEvaluator
relevance_evaluator = LangChainStringEvaluator(
"criteria",
config={"criteria": "Is the risk assessment well-reasoned and based on the provided signals?"}
)3. Human review — annotation queues for manual scoring
4. Pairwise comparison — A/B test two agent versions side by side
Run Experiments
from langsmith.evaluation import evaluate
results = evaluate(
risk_scoring_agent.invoke,
data="deal-risk-scoring-v2",
evaluators=[risk_score_in_range, relevance_evaluator],
experiment_prefix="risk-agent-v2.1",
num_repetitions=3, # Run each example 3 times for consistency
)Compare experiments across versions. If v2.1 scores lower than v2.0, investigate before deploying.
Online Evaluation: Monitor in Production
Online evaluators run automatically on live traffic — no datasets needed. They catch issues in real-time.
Set Up Online Evaluators
Configure rules that fire on every production trace:
- Safety checks: Flag responses containing PII, toxic content, or prompt injection attempts
- Format validation: Verify structured outputs match expected schemas
- Quality heuristics: Check response length, language consistency, citation presence
- LLM-as-judge: Score quality on a sample of production traces (with sampling to control costs)
Sampling and Filtering
You don't need to evaluate every trace — that's expensive. Apply filters:
# Evaluate 10% of production traces
sampling_rate: 0.1
# Only evaluate traces from the risk-scoring agent
filter: agent_name == "risk_scorer"
# Only evaluate traces with latency > 5s (investigate slow runs)
filter: latency > 5000
Alerts
Set alerts for anomalies:
- Average latency exceeds 3s → Slack notification
- Error rate exceeds 5% → PagerDuty alert
- Cost per trace exceeds $0.05 → Email warning
- Evaluation score drops below 0.7 → Block deployment
Dashboards: The Operations View
LangSmith dashboards aggregate traces into operational metrics:
| Metric | What to Monitor | Alert Threshold |
|---|---|---|
| P50/P95 latency | Response time distribution | P95 > 5s |
| Error rate | Failed traces / total | > 5% |
| Token usage | Input + output per trace | Sudden 2x spike |
| Cost per day | Daily spend by model | Exceeds budget |
| Evaluation scores | Quality trends over time | Downward trend |
| Trace volume | Requests per hour | Unexpected drop (outage?) |
For RevAgent, I built a dashboard per agent (Risk, Forecast, Hygiene, Follow-up) — each with its own latency, cost, and quality metrics. When the Forecast agent's latency doubled, the dashboard caught it before any user complained.
The Feedback Loop
The most valuable pattern: production failures → dataset → offline evaluation → fix → redeploy.
- Online evaluator flags a bad trace in production
- Add that trace to your evaluation dataset
- Write a targeted evaluator that catches the specific failure
- Fix the agent (prompt, tool, logic)
- Run offline evaluation to confirm the fix works AND doesn't break other cases
- Deploy with confidence
This is how agents get better over time. Without this loop, you're fixing symptoms. With it, you're building regression coverage.
For BandiFinder, I added every mismatched tender (user rejected the agent's recommendation) to the evaluation dataset. After 3 months, the matching accuracy improved from 72% to 91% — entirely driven by this feedback loop.
PII Protection in Traces
Production traces capture real user data. For GDPR compliance, mask sensitive fields:
# Option 1: Mask specific fields
from langsmith import Client
client = Client(
hide_inputs=lambda inputs: redact_pii(inputs),
hide_outputs=lambda outputs: redact_pii(outputs),
)
# Option 2: Hide everything (debugging becomes harder)
os.environ["LANGSMITH_HIDE_INPUTS"] = "true"
os.environ["LANGSMITH_HIDE_OUTPUTS"] = "true"I use Option 1 — mask PII but keep the rest visible. See my GDPR-Compliant AI post for the full approach.
Production Checklist
- Separate projects per environment:
my-agent-dev,my-agent-staging,my-agent-prod - Evaluation datasets from day one: Don't wait for failures — seed with known test cases
- Online evaluators on all production projects: At minimum, safety + format validation
- Cost alerts: Set a daily budget alarm — one bad loop can burn through your budget
- Dashboards per agent: Don't aggregate everything — each agent has different baselines
- Feedback loop: Route failed production traces into evaluation datasets weekly
- PII masking: Redact before tracing, not after — once PII is logged, it's a compliance issue
Related Posts
- Building AI Agents with LangGraph: From Prototype to Production — the orchestration framework that generates these traces
- GDPR-Compliant AI: Building Guardrails for EU AI Act Readiness — protecting PII in traces and audit logging for compliance
Need observability for your AI agents? LangSmith is what I run on every production deployment. Get in touch or book a call.