Skip to main content
AI Operations

AI Production Reliability: SRE Practices for Keeping AI Systems Running

AI systems in production fail differently from traditional software. Here's how UK businesses are adapting site reliability engineering for AI — from model monitoring and drift detection to incident response playbooks.

Rod Hill·12 February 2026·9 min read

AI Production Reliability: SRE Practices for Keeping AI Systems Running

Your AI system worked brilliantly in testing. It passed evaluation benchmarks, impressed stakeholders in the demo, and deployed smoothly to production. Then, three weeks later, it started giving strange answers. Customer complaints trickled in. Nobody noticed for days because the system wasn't technically "down" — it was just quietly getting worse.

This is the fundamental challenge of AI reliability. Traditional software either works or it doesn't. A database query returns the right result or throws an error. AI systems occupy a treacherous middle ground: they can degrade gradually, confidently producing worse and worse outputs while every health check shows green.

Site reliability engineering (SRE) needs a rethink for AI. Here's what that looks like in 2026.

Why AI Systems Fail Differently

Silent Degradation

A REST API returning HTTP 500 is an obvious failure. An LLM that starts hallucinating 15% more often than last week is invisible to traditional monitoring. The system is "up." Latency is normal. No error codes. But quality has fallen off a cliff.

This is model drift — the gradual divergence between what your AI was trained or configured to do and what it actually does in production. It happens because:

  • Data distribution shifts — the inputs your users send change over time
  • Upstream model updates — your API provider ships a new model version
  • Context accumulation — long-running agent sessions build up context that skews behaviour
  • Integration drift — the systems your AI connects to change their APIs, data formats, or behaviour

Cascading Agent Failures

Modern AI deployments aren't single models. They're agent systems — chains of LLM calls, tool invocations, database queries, and API calls that form complex workflows. A failure in step three of an eight-step agent pipeline doesn't always surface cleanly. The agent might retry, hallucinate a workaround, or produce a plausible-looking result that's actually wrong.

Stochastic Behaviour

Traditional software is deterministic. Same input, same output. AI systems are probabilistic. The same prompt can produce different results each time. This makes debugging harder, reproduction harder, and "was that a bug or just variance?" a constant question.

The AI SRE Toolkit

1. Output Quality Monitoring

Forget uptime dashboards — you need quality dashboards. Key metrics:

  • Hallucination rate — how often does the AI state things that aren't grounded in provided context?
  • Task completion rate — for agent systems, what percentage of tasks complete successfully end-to-end?
  • Tool call success rate — are the AI's function calls working, or is it hitting errors and retrying?
  • User feedback signals — thumbs up/down, regeneration requests, escalation to humans
  • Semantic consistency — are answers to similar questions remaining consistent over time?

The tooling for this has matured significantly. Platforms like Langfuse, Arize, Helicone, and Braintrust now offer production-grade monitoring specifically designed for LLM applications. They capture traces, score outputs, and alert on quality degradation before users complain.

2. Trace-Based Observability

For agent systems, you need distributed tracing adapted for AI. Every agent run should produce a trace showing:

  • Each LLM call (prompt, completion, model, tokens, latency)
  • Each tool invocation (function name, parameters, result, duration)
  • Decision points (why the agent chose path A over path B)
  • Total cost and token consumption
  • Final output quality score

This is the AI equivalent of application performance monitoring (APM). When something goes wrong, you need to replay the exact sequence of decisions the agent made — not just see that it produced a bad output.

3. Drift Detection

Set up automated drift detection that runs continuously:

  • Input drift — are users sending queries that differ significantly from your test distribution?
  • Output drift — have the characteristics of your AI's outputs changed (length, tone, structure, confidence)?
  • Behavioural drift — is the AI making different tool-calling patterns than it did last week?
  • Cost drift — has token consumption per task changed? (Often an early signal of the AI looping or generating verbose outputs)

Statistical tests (KL divergence, population stability index) applied to embeddings of inputs and outputs can catch drift that human review would miss.

4. Circuit Breakers and Fallbacks

Borrow the circuit breaker pattern from microservices:

  • Quality circuit breaker — if output quality scores drop below a threshold over a rolling window, automatically route traffic to a fallback (simpler model, cached responses, human queue)
  • Cost circuit breaker — if a single agent run exceeds a token budget, kill it and escalate
  • Latency circuit breaker — if response times spike, switch to a faster model or return a "processing" response
  • Error rate circuit breaker — if tool calls are failing above threshold, disable that tool and use alternatives

The goal is graceful degradation rather than silent failure. Better to tell a user "I'm having trouble with that right now, let me connect you with a human" than to confidently produce garbage.

5. Canary Deployments for AI

When updating prompts, model versions, or agent configurations, don't just ship it. Run canaries:

  • Route 5-10% of traffic to the new configuration
  • Compare output quality metrics between canary and production
  • Automatically roll back if quality degrades
  • Gradually increase canary traffic as confidence builds

This applies to prompt changes too, not just model updates. A "minor" prompt tweak can cause unexpected behaviour changes across edge cases you didn't test.

Incident Response for AI Systems

The AI Incident Taxonomy

Not all AI failures are created equal. Classify them:

Severity 1 — Safety/Trust Failures:

  • AI producing harmful, biased, or legally problematic outputs
  • Data leakage through AI responses
  • AI taking actions it shouldn't (wrong API calls, unauthorized operations)
  • Response: Immediate traffic halt, human review of all recent outputs

Severity 2 — Quality Degradation:

  • Measurable drop in output quality metrics
  • Increased hallucination or task failure rates
  • Model provider outage or degradation
  • Response: Activate fallback, investigate root cause

Severity 3 — Performance Issues:

  • Elevated latency or cost
  • Capacity limits hit
  • Non-critical tool failures
  • Response: Monitor, optimise, no immediate user impact

The AI Postmortem

Traditional postmortems ask "what broke and why?" AI postmortems need additional questions:

  • When did quality actually start degrading? (Often days before detection)
  • How many users were affected with bad outputs? (Not just "how long was it down")
  • What was the detection gap? (Time between degradation start and alert firing)
  • Could we reproduce the failure? (Stochastic systems make this harder)
  • Did our evaluations cover this failure mode? (Update test suites accordingly)

Runbooks for Common AI Failures

Build runbooks for the failures you'll see most:

"Model provider changed something":

  1. Check provider status page and changelog
  2. Run evaluation suite against new behaviour
  3. If quality dropped: pin to previous model version if available
  4. If not available: adjust prompts to compensate, test, deploy

"AI is hallucinating more than usual":

  1. Check input distribution — are users asking novel questions?
  2. Check RAG retrieval quality — is the knowledge base returning relevant results?
  3. Check context length — are conversations exceeding effective context windows?
  4. Check for upstream data changes affecting grounding documents

"Agent is stuck in loops":

  1. Check tool availability — is a dependency down?
  2. Check for prompt regression — was a recent change made?
  3. Review traces for the loop pattern
  4. Apply token budget limits, kill stuck sessions

Building an AI Reliability Culture

Quality as an SLO

Define service level objectives for AI quality, not just availability:

  • "95% of customer-facing responses will score above 4/5 on our quality rubric"
  • "Agent task completion rate will remain above 85%"
  • "Mean time to detect quality degradation will be under 2 hours"

These SLOs should sit alongside traditional uptime SLOs and carry equal weight.

Evaluation-Driven Development

Every prompt change, model update, or configuration adjustment should go through an evaluation pipeline before reaching production:

  1. Unit evaluations — specific test cases for known edge cases
  2. Regression evaluations — broad test suite checking for unexpected behaviour changes
  3. Adversarial evaluations — red-team prompts testing safety and robustness
  4. Human evaluations — periodic human review of production outputs

Automate steps 1-3, schedule step 4. Make it impossible to ship changes that haven't been evaluated.

On-Call for AI

Your AI on-call rotation needs people who understand both the infrastructure and the AI layer. An SRE who can diagnose a Kubernetes pod failure but can't interpret a model quality dashboard isn't enough. An ML engineer who can fine-tune a model but can't read a trace isn't enough either.

The emerging role of AI reliability engineer combines:

  • Traditional SRE skills (monitoring, incident response, capacity planning)
  • ML understanding (model behaviour, evaluation, prompt engineering)
  • Product context (what "good" looks like for your specific use case)

Cost of Getting This Wrong

AI reliability failures are more expensive than traditional outages because they're often trust failures. When a website goes down, users are annoyed but understand. When an AI confidently gives wrong medical advice, incorrect financial figures, or inappropriate responses, users lose trust — and trust is much harder to rebuild than uptime.

UK businesses deploying AI in customer-facing roles — chatbots, advisory tools, decision support systems — need to treat AI reliability as a first-class concern, not an afterthought bolted on after launch.

Where to Start

If you're running AI in production today without dedicated reliability practices, start here:

  1. Instrument everything — capture traces for every AI interaction, not just errors
  2. Define quality metrics — what does "good output" mean for your use case? Measure it.
  3. Set up alerts — quality degradation alerts, cost anomaly alerts, completion rate alerts
  4. Build one runbook — for your most likely failure mode, document the response procedure
  5. Review weekly — look at quality trends, not just uptime graphs

The businesses that treat AI reliability seriously will be the ones whose AI systems earn and keep user trust. Everyone else will learn the hard way that a confident AI giving wrong answers is worse than no AI at all.


Need help building reliability practices for your AI systems? Get in touch to discuss AI operations strategy for your organisation.

Tags

ai reliabilitysreai monitoringmodel driftai incident responsemlopsai productionai operations
RH

Rod Hill

The Caversham Digital team brings 20+ years of hands-on experience across AI implementation, technology strategy, process automation, and digital transformation for UK businesses.

About the team →

Need help implementing this?

Start with a conversation about your specific challenges.

Talk to our AI →