A fraud detection agent ships with 94% precision. Six weeks later it's at 81%. The model hasn't changed. The training data hasn't changed. The prompts haven't changed. Every unit test still passes.

The agent is just making worse decisions every week, and the only signal is a slow decline in business KPIs that someone notices on a Tuesday.

This is agent drift. Until recently, "drift" in the context of AI agents was a vague complaint with no formal definition, no measurement, and no agreed taxonomy. A January 2026 paper - Measuring and Mitigating Agent Drift in LLM-Based Systems - changed that by introducing the Agent Stability Index (ASI), a composite metric framework that quantifies drift across 12 behavioral dimensions. Around the same time, two reliability benchmarks dropped data that reframes how bad this problem actually is: GPT-4o goes from 61% accuracy on a single attempt to 25% across 8 repeated runs (ReliabilityBench, arXiv:2601.06112), and even the best models show only 73% consistency on paraphrased questions (arXiv:2602.16666).

This post covers the taxonomy, the math, the statistical detection methods, and working Python code you can drop into your agent today. The code that uses the Syrin SDK only does what the SDK actually does - no vaporware.

What is Agent Drift?

Agent drift is the progressive degradation of AI agent behavior, decision quality, and inter-agent coherence over extended interactions in production. Unlike traditional software bugs, agent drift doesn't throw exceptions or trigger alerts. The agent keeps responding - HTTP 200, latency normal, error rate zero - while its outputs silently become wrong.

Agent accuracy declines over 30 turns while the error rate stays flat at zero. The gap between the two lines is where agent drift lives. Agent accuracy declines over 30 turns while the error rate stays flat at zero. The gap between the two lines is where agent drift lives.

The chart above is the core problem: your monitoring is green while your agent is degrading. Error rate is a lagging indicator at best, and for agent drift it's useless - the agent isn't failing, it's producing plausible-but-wrong outputs.

Agent Drift vs Classical ML Drift

Classical ML monitoring has three well-understood drift types:

Data drift (covariate shift) - input distributions change while the feature-target relationship holds. Fix: retrain or re-engineer features.
Concept drift - the relationship between inputs and outputs changes. Fraudsters adapt, customer preferences shift. Fix: retrain on recent samples.
Label drift (prior probability shift) - class proportions move, shifting the optimal decision threshold. Fix: rebalance, retrain.

Agent drift is the fourth type that classical monitoring misses entirely. The model itself may be unchanged, but the derived context the agent reads at decision time - RAG retrievals, conversation history, tool responses, session state - has degraded.

Classical monitoring instruments the model. Agent drift happens downstream, in the agent logic and tools/context layer that classical tools never touch. Classical monitoring instruments the model. Agent drift happens downstream, in the agent logic and tools/context layer that classical tools never touch.

Drift type	What changes	Root cause	Fix
Data drift	Input distributions	New traffic, upstream pipeline changes	Retrain, feature re-engineering
Concept drift	Input-output relationship	External factors, adversarial adaptation	Retrain on recent samples
Label drift	Target distribution	Class proportions fluctuate	Rebalance, retrain
Agent drift	What the agent reads and does at decision time	Stale context, prompt degradation, goal distribution shift, tool changes	Runtime detection + control

The practical implication: you can have clean Datadog dashboards, green Grafana panels, and 0% error rates while your agent's task completion rate is quietly collapsing. Classical monitoring instruments the model. Agent drift happens in the layers downstream - context retrieval, tool chains, session behavior - that classical tools never see.

Six Types of Agent Drift

The ASI paper identifies three primary types (semantic, behavioral, coordination). Production adds three more (goal, context, prompt). Each attacks a different layer of the agent stack:

The six drift types mapped to the agent pipeline: Goal Drift at the user goal layer, Context Drift at retrieval, Prompt Drift at instructions, Behavioral Drift and Semantic Drift at LLM reasoning, and Coordination Drift at inter-agent handoffs. The six drift types mapped to the agent pipeline: Goal Drift at the user goal layer, Context Drift at retrieval, Prompt Drift at instructions, Behavioral Drift and Semantic Drift at LLM reasoning, and Coordination Drift at inter-agent handoffs.

1. Semantic drift - The agent deviates from its original instructions. At turn 3, it follows refund policy correctly. By turn 15, after emotional escalation in the conversation, it starts making exceptions nobody authorized. The system prompt didn't change - recent context gradually overrode it.

2. Behavioral drift - The agent develops unintended strategies. A support agent gives increasingly verbose responses because longer answers correlated with higher satisfaction scores in training. Thorough at turn 5 becomes rambling by turn 20.

3. Coordination drift - In multi-agent systems, handoff quality degrades. The router becomes less precise about which specialist to call. Each individual agent passes its own evals; the system's coherence erodes at the seams.

4. Goal drift - The distribution of task types in production diverges from the eval distribution. An agent evaluated on balanced query types hits production traffic skewed 45% toward one category. Accuracy on underrepresented types quietly degrades.

5. Context drift - Retrieved context degrades over time. Knowledge bases go stale. APIs change response formats. Vector embeddings shift as a domain evolves. The model is unchanged - the world it reads has moved.

6. Prompt drift - Instruction templates evolve inconsistently across team members without version control. Small phrasing changes produce dramatically different behavior, compounding over weeks.

Drift Type	What Degrades	Signal to watch
Semantic	Instruction adherence	Compliance score drops after turn 10
Behavioral	Response strategy	Response length 3× by turn 20
Coordination	Multi-agent handoffs	Handoff accuracy drops
Goal	Task distribution match	Chi-squared p < 0.05 vs eval distribution
Context	Retrieved data quality	Document freshness exceeds TTL
Prompt	Template consistency	Output variance increases week-over-week

The Agent Stability Index: 12 Dimensions

The ASI framework measures drift across 12 behavioral dimensions. Don't try to track all 12 on day one - start with the three highest-signal ones.

Dimension	What It Measures	Priority
Confidence calibration	Stated certainty vs actual accuracy	Start here
Goal adherence	Does the agent stay on task?	Start here
Tool usage patterns	Which tools called and in what sequence	Start here
Response consistency	Same input → same output over time	Medium
Reasoning pathway stability	Chain-of-thought stability	Medium
Policy compliance	Does the agent follow its rules?	Medium
Inter-agent agreement	Do agents in a fleet converge?	Medium
Output format stability	Response structure consistency	Low
Latency patterns	Processing time changes	Low
Token efficiency	Tokens per successful outcome	Low
Error recovery patterns	How failures are handled	Low
Context utilization	Use of available context	Low

⚠Warning

Confidence calibration is the most predictive single dimension. When an agent's stated confidence increases while its actual accuracy decreases, a production failure is typically 3–5 days away. Track this first.

The Math of Compounding Drift

The February 2026 reliability paper (arXiv:2602.16666) found that reliability improves at roughly half the rate of accuracy improvements. In customer service contexts, it's one-seventh. That 15% accuracy gain from upgrading your model? Expect maybe 2% reliability improvement in production.

ReliabilityBench gives the starkest numbers: GPT-4o scores 61% pass@1 (one attempt) but only 25% pass@8 (across 8 runs). Every demo you run is pass@1. Production is pass@1000.

Here's how this compounds in a multi-component agent system:

python

def system_reliability_with_drift(components, drift_rate, num_turns):
    """
    Compute system-level reliability at each turn given per-component
    drift. Components multiply because they're sequential dependencies.
 
    components: list of base reliability values per component
    drift_rate: reliability loss per turn (1% = 0.01)
    """
    reliabilities = []
    for turn in range(1, num_turns + 1):
        r = 1.0
        for base in components:
            r *= max(base * (1 - drift_rate * turn), 0)
        reliabilities.append(r)
    return reliabilities
 
# A typical 3-component agent: LLM call (90%), tool execution (85%), memory (97%)
# 1% drift rate per turn - conservative
curve = system_reliability_with_drift([0.90, 0.85, 0.97], 0.01, 30)
 
# Turn 1:  73.5%
# Turn 10: 58.9%
# Turn 20: 42.5%
# Turn 30: 24.7%

System reliability collapses from 73.5% at turn 1 to 24.7% at turn 30 while the error rate stays flat near zero the entire time. System reliability collapses from 73.5% at turn 1 to 24.7% at turn 30 while the error rate stays flat near zero the entire time.

The 1% per-turn drift rate is conservative. A 2024 paper on long-horizon agent tasks (arXiv:2512.12791) measured memory recall dropping to 13.1% in complex scenarios. At that rate, you're below useful reliability within 5 turns.

A Real Example: Goal Drift in an Order Management Agent

This table shows what happens when production traffic doesn't match your eval distribution - no model changes, no code changes, just a shift in who's using the agent and what they're asking for.

Task Type	Eval Dist	Eval Acc	Month 1 Dist	Month 1 Acc	Month 2 Dist	Month 2 Acc
Query status	25%	95%	20%	95%	15%	81%
Update order	25%	70%	45%	70%	40%	70%
Cancel order	25%	98%	25%	98%	20%	98%
Return/refund	25%	99%	10%	99%	15%	99%
Other (unsupported)	0%	-	0%	-	10%	0%
Overall	100%	90.5%	100%	81.7%	100%	69.95%

Month 1: Production skews toward update requests (45% vs 25% in eval). Since update accuracy is 70%, the weighted overall drops to 81.7%. The eval score is still 90.5%.

Month 2: An "Other" category appears - requests the agent was never designed to handle. "Query status" accuracy drops from 95% to 81% because the knowledge base wasn't refreshed. Overall: 69.95%.

Your eval says 90.5%. Production is at 69.95%. That gap is agent drift.

Why Standard Evals Don't Catch This

Three structural reasons evals miss drift:

Evals are short. Standard benchmarks test 3–5 turn conversations. An agent that drifts at turn 20 scores perfectly in eval. This isn't a flaw you can patch - it's a fundamental mismatch between how evals are designed and how production agents run.

Evals are path-independent. Drift is path-dependent. A conversation with emotional escalation, topic-switching, or ambiguous requests triggers drift that a clean eval query never would. You'd need to reproduce exact production conversation paths to catch it in eval.

Evals only check outputs. IBM's research on process drift (via their AI Safety and Governance work) found that after a model update, agents can skip critical steps - compliance checks, validation calls - while producing outputs that look plausible. You're passing eval on the final text while the decision chain is broken.

✖Error

Eval score: 90.5%. Production accuracy: 69.95%. No eval catches this because no eval runs long enough, with real traffic distributions, tracking the full tool call sequence.

How to Detect Drift: Statistical Methods

Distribution-level tests

Kolmogorov-Smirnov test - nonparametric test for whether two continuous distributions are the same. Use for response length, token count, latency shifts week-over-week.

Chi-squared test - for categorical distributions. Compare task type distribution in production vs eval. p-value < 0.05 means the distributions are meaningfully different. This is your primary signal for goal drift.

Population Stability Index (PSI) - industry-standard from credit risk modeling. PSI < 0.1 = stable, 0.1–0.2 = monitor closely, > 0.2 = action required. Use for prompt template usage patterns.

KL divergence - measures how far a distribution has moved from a reference. Use for embedding-space drift in RAG systems: compare retrieval vector distributions week-over-week.

Behavioral-level signals

Confidence calibration - log stated confidence vs verified accuracy on a rolling basis. A widening gap is your earliest warning signal.
Tool call sequence validation - log the expected tool call order per task type and flag deviations. This catches process drift that output-only monitoring misses entirely.
Session completion rate - track successful outcomes over time. The curve bends before users start complaining.
Response consistency - run a fixed set of baseline queries weekly, measure output similarity. Cosine similarity below 0.85 on your baseline set warrants investigation.

Context instrumentation

Freshness at read time - log the timestamp of retrieved documents at decision time. Track p99 freshness. If context is 30 minutes old for a sub-second decision in a rapidly changing domain, that's context drift.
Tool response schema validation - validate tool response schemas before the agent acts on them. APIs change their response format without warning and without breaking your HTTP calls.

Measuring Drift in Code

The code below is generic Python - no Syrin dependency. These are the statistical methods you'd run as part of a weekly monitoring job.

Establish a behavioral baseline

Run your most representative queries against the agent multiple times and record the outputs, confidence scores, tool sequences, and token counts. This is your ground truth for future drift comparisons.

python

BASELINE_QUERIES = [
    "What's your refund policy?",
    "My order hasn't arrived yet",
    "Can I change my subscription plan?",
    "I was charged twice for the same order",
]
 
def establish_baseline(queries, runs_per_query=10):
    baseline = {}
    for query in queries:
        responses = []
        for _ in range(runs_per_query):
            response = run_agent(query)
            responses.append({
                "text": response.text,
                "confidence": response.confidence,
                "tools_used": response.tools_called,
                "tool_call_order": response.tool_sequence,
                "tokens": response.token_count,
            })
        baseline[query] = responses
    return baseline

Detect goal drift with chi-squared test

python

from scipy.stats import chi2_contingency
import numpy as np
 
def detect_goal_drift(eval_dist: dict, prod_dist: dict, threshold=0.05) -> dict:
    """
    Compare task type distributions between eval and production.
    Returns drift_detected=True if distributions are meaningfully different.
    """
    categories = sorted(set(eval_dist) | set(prod_dist))
    eval_counts = [eval_dist.get(c, 0) for c in categories]
    prod_counts = [prod_dist.get(c, 0) for c in categories]
 
    _, p_value, _, _ = chi2_contingency(np.array([eval_counts, prod_counts]))
    return {"p_value": round(p_value, 4), "drift_detected": p_value < threshold}
 
# Eval was balanced. Production skewed heavily toward updates.
result = detect_goal_drift(
    eval_dist={"query": 25, "update": 25, "cancel": 25, "return": 25},
    prod_dist={"query": 20, "update": 45, "cancel": 25, "return": 10},
)
# → {"p_value": 0.0012, "drift_detected": True}

Measure behavioral drift against baseline

python

from difflib import SequenceMatcher
 
def measure_drift(baseline: dict, current: dict) -> dict:
    report = {}
    for query, base_responses in baseline.items():
        curr = current.get(query, [])
        if not curr:
            continue
 
        # Semantic drift: how similar are current responses to baseline?
        sim = [
            max(SequenceMatcher(None, c["text"], b["text"]).ratio()
                for b in base_responses)
            for c in curr
        ]
 
        # Confidence drift: are confidence scores moving?
        base_conf = sum(r["confidence"] for r in base_responses) / len(base_responses)
        curr_conf = sum(r["confidence"] for r in curr) / len(curr)
 
        # Process drift: is the tool call sequence changing?
        base_seqs = [r.get("tool_call_order", []) for r in base_responses]
        curr_seqs = [r.get("tool_call_order", []) for r in curr]
        seq_match = [any(cs == bs for bs in base_seqs) for cs in curr_seqs]
 
        report[query] = {
            "semantic_drift":   round(1 - sum(sim) / len(sim), 3),
            "confidence_drift": round(curr_conf - base_conf, 3),
            "process_drift":    round(1 - sum(seq_match) / len(seq_match), 3) if seq_match else 0,
        }
    return report

Compute a single ASI score

python

def compute_asi(drift_report: dict) -> float:
    """
    Weighted average of the three most predictive drift dimensions.
    ASI > 0.10 = moderate drift. ASI > 0.20 = critical.
    """
    weights = {
        "semantic_drift":   0.35,
        "confidence_drift": 0.35,
        "process_drift":    0.30,
    }
    scores = [
        sum(abs(metrics[dim]) * weight for dim, weight in weights.items())
        for metrics in drift_report.values()
    ]
    if not scores:
        return 0.0
 
    asi = sum(scores) / len(scores)
 
    if asi > 0.20:
        print(f"ASI {asi:.3f} - critical drift, intervene now.")
    elif asi > 0.10:
        print(f"ASI {asi:.3f} - moderate drift, monitor closely.")
    else:
        print(f"ASI {asi:.3f} - stable.")
 
    return asi

Logging Drift Signals with Syrin

The detection code above gives you an ASI score. What you do with that score is where instrumentation matters. Syrin SDK doesn't do drift detection - it gives you the observability layer to log drift signals, push config changes without redeploying, and save state before risky operations.

Here's what that actually looks like:

python

import syrin_sdk
from syrin_sdk import GovernanceStopError
 
# Two lines. Every LLM call is now instrumented.
syrin_sdk.init(api_key="syrin_...")
 
# Wrap agent calls in a session context so events group correctly
# on the dashboard timeline.
with syrin_sdk.context(user_id=user_id, agent="support-agent", window="day") as ctx:
    # Model and temperature are remotely configurable from the dashboard.
    # Push an override live - no redeploy, no restart.
    model = syrin_sdk.cfg("llm.model", "gpt-4o")
    temperature = syrin_sdk.cfg("llm.temperature", 0.7, ge=0.0, le=2.0)
 
    try:
        response = client.chat.completions.create(
            model=model,
            temperature=temperature,
            messages=messages,
        )
    except GovernanceStopError as e:
        # Backend governance rule fired (cost limit, loop detected, etc.)
        # e.drift_score contains the score that triggered the stop.
        logger.warning("Agent stopped by governance: %s", e.reason)
        return {"error": "blocked", "reason": e.reason}

When your weekly ASI job detects drift, log it as a structured event:

python

asi_score = compute_asi(measure_drift(baseline, current_responses))
 
syrin_sdk.log(
    "drift_measurement",
    level="warning" if asi_score > 0.10 else "info",
    metadata={
        "asi": round(asi_score, 4),
        "semantic_drift": drift_report.get("semantic_drift"),
        "confidence_drift": drift_report.get("confidence_drift"),
        "process_drift": drift_report.get("process_drift"),
    }
)

This event appears on the session timeline in the dashboard. When drift is high, you go to the dashboard and push a config override - lower temperature, swap model, update system prompt - without touching code.

Three Mitigation Strategies

1. Episodic memory consolidation

In-context information degrades as the context window fills. Early, precise instructions get buried under layers of tool results and conversation history. Periodically re-injecting critical facts counters this.

python

def consolidate_memory(history: list, interval: int = 10) -> dict | None:
    """
    Every `interval` turns, extract critical facts and re-inject them
    as a system message. This counteracts context window dilution.
    """
    if len(history) % interval != 0:
        return None
    facts = extract_critical_facts(history)
    return {
        "role": "system",
        "content": "CONTEXT ANCHOR - established facts for this session:\n"
                   + "\n".join(f"- {f}" for f in facts),
    }

2. Drift-aware routing

In a multi-agent fleet, route incoming tasks to the agent with the lowest current ASI score. Agent A at ASI 0.18 and Agent B at ASI 0.03 - route to B, and use the breathing room to diagnose A.

python

def drift_aware_route(agents: list[dict], get_asi_fn) -> str:
    scored = [(a["id"], get_asi_fn(a["id"])) for a in agents]
    return min(scored, key=lambda x: x[1])[0]

3. Checkpoints before risky operations

Save conversation state before any operation that could send the agent in a direction that's hard to undo. The Syrin SDK has a checkpoint API built for exactly this:

python

# Save state before calling an external tool that might change context
checkpoint = syrin_sdk.create_checkpoint(messages, label="pre-tool-call")
 
try:
    tool_result = call_external_tool()
    messages.append({"role": "tool", "content": tool_result})
except Exception:
    # Tool failed or returned unexpected schema - restore to known-good state
    messages = checkpoint.messages
    syrin_sdk.log(
        "checkpoint_restored",
        level="warning",
        metadata={"checkpoint_id": checkpoint.checkpoint_id, "label": checkpoint.label},
    )

The Research

Paper	Key result	arXiv
Measuring and Mitigating Agent Drift	ASI framework, 6 drift types, 12 measurement dimensions	2601.04170
The Science of Agent Reliability	73% consistency on paraphrased questions; reliability improves at half the rate of accuracy	2602.16666
ReliabilityBench	GPT-4o: 61% pass@1 → 25% pass@8	2601.06112
Beyond Task Completion	Memory recall drops to 13.1% in complex long-horizon tasks	2512.12791

FAQ

Is agent drift the same as hallucination?

No. Hallucination is an acute failure in a single response - the model fabricates a fact. Agent drift is longitudinal degradation - behavior shifts over time across many interactions. A drifting agent can produce factually grounded responses while systematically deviating from its original goal. Drift is harder to catch precisely because each individual response looks reasonable.

My evals pass. How do I know if I have drift?

Run your baseline queries against production traffic samples, not a fixed test set. Compare task type distributions week-over-week with a chi-squared test. Track your session completion rate over time - not accuracy, which requires ground truth labels you often don't have in production. A downward bend in completion rate precedes user complaints by days.

What's a reasonable ASI threshold to alert on?

The ASI paper sets 0.10 as moderate and 0.20 as critical based on their evaluation across several agent types. Financial and healthcare domains should alert at 0.05. Support and content generation agents have more tolerance. Set your thresholds by running a calibration period and correlating ASI scores with known degradation events.

How often should I measure?

Confidence calibration and session completion rate: continuously. Behavioral baseline comparison: weekly. Goal distribution drift: weekly. Context freshness: continuously with automatic TTL enforcement on your knowledge base. If your agent handles high-stakes decisions, run daily.

If you're building agents in production and haven't thought about drift yet, you have drift. The question is how long it's been accumulating.

Syrin instruments your agent pipeline - every LLM call, tool call, and agent handoff - so the signals to detect drift exist. The detection and mitigation code above is yours to run.

agentsobservabilityproductiondriftresearchreliability