AI Infrastructure

Context Windows and Prompt Caching: The Hidden Keys to AI Cost Control in 2026

Understanding context windows, prompt caching, and token economics — the technical fundamentals that determine whether your AI deployment costs £100 or £10,000 per month.

Rod Hill·5 February 2026·8 min read

Context Windows and Prompt Caching: The Hidden Keys to AI Cost Control in 2026

Most businesses adopting AI focus on the headline capabilities: reasoning, code generation, analysis. But the factor that most determines your monthly bill isn't what the model can do — it's how much context you feed it and how efficiently you manage that context.

Understanding context windows and prompt caching is the difference between a £500/month AI deployment and a £5,000 one doing the same work.

Context Windows: Your AI's Working Memory

A context window is the total amount of text (measured in tokens — roughly ¾ of a word) that an AI model can process in a single interaction. Think of it as the model's working memory.

Where We Are in 2026

Model	Context Window	Rough Equivalent
Claude Opus 4.5	200K tokens	~150,000 words (3 novels)
GPT-4o	128K tokens	~96,000 words
Gemini 2.0	2M tokens	~1.5 million words
Claude Sonnet 4	200K tokens	~150,000 words

These are enormous compared to 2023's 4K-8K limits. But bigger isn't always better — and filling these windows carelessly is the fastest way to burn money.

Why Context Windows Matter for Business

Every token in the context window costs money. Both input tokens (what you send) and output tokens (what you receive) are billed. The economics:

Claude Opus 4.5: ~$15 per million input tokens, ~$75 per million output tokens
Claude Sonnet 4: ~$3 per million input tokens, ~$15 per million output tokens
GPT-4o: ~$2.50 per million input tokens, ~$10 per million output tokens

If you're sending 50K tokens of context with every API call (a common pattern with RAG systems), and making 1,000 calls per day:

At Opus rates: $750/day on input tokens alone
At Sonnet rates: $150/day
At GPT-4o rates: $125/day

This is where most businesses get surprised. The model cost per query looks cheap. The cumulative context cost is not.

The Context Window Trap

Here's the pattern we see repeatedly:

Business builds an AI assistant with access to company knowledge
They stuff the entire knowledge base into every prompt (or retrieve too many chunks via RAG)
The assistant works brilliantly in testing
The monthly bill arrives and someone has a difficult conversation

The fix isn't to use less context — it's to use context intelligently.

Smart Context Management

Tiered retrieval: Don't retrieve everything. Build a hierarchy:

First pass: semantic search for the most relevant 2-3 chunks
Only expand if the model indicates it needs more information
Never dump entire documents when a paragraph would suffice

Context compression: Use smaller, faster models to summarise retrieved context before feeding it to the main model. A £0.01 summarisation call that reduces your context by 80% saves £0.50 on the main call.

Session management: For conversational AI, don't replay the entire chat history. Summarise older messages and keep only the recent 5-10 exchanges in full detail.

Selective system prompts: A 2,000-token system prompt included in every call adds up. If you're making 10,000 calls/day, that's 20 million tokens just in system prompts — £30-60/day depending on the model.

Prompt Caching: The Game Changer

Prompt caching is the most impactful cost optimisation technique available in 2026, and most businesses aren't using it.

How It Works

When you send a prompt to an AI model, the model processes every token from scratch — even if 90% of your prompt is identical to the last call. Prompt caching changes this:

First call: The model processes your full prompt and caches the processed result
Subsequent calls: If the prompt starts with the same content, the cached portion is reused
You pay reduced rates for cached tokens (typically 90% less)

The Numbers

With Anthropic's prompt caching (available on Claude models):

Token Type	Standard Cost	Cached Cost	Saving
Input (cache miss)	$15/M (Opus)	N/A	—
Input (cache hit)	N/A	$1.50/M (Opus)	90%
Cache write	$18.75/M (Opus)	N/A	Initial overhead

For a typical business AI deployment:

System prompt: 2,000 tokens (same every call) → cache it
Knowledge base context: 10,000 tokens (mostly stable) → cache it
User query: 200 tokens (changes every call) → not cached

Without caching: 12,200 tokens × $15/M = $0.183 per call With caching: 200 fresh tokens + 12,000 cached tokens = $0.021 per call

That's an 88% cost reduction on input processing. At 1,000 calls/day, you're saving ~$160/day, or roughly £4,000/month.

When Prompt Caching Works Best

Caching is most effective when your prompts have a stable prefix — content at the start that doesn't change between calls:

✅ Perfect for caching:

System prompts and instructions
Company knowledge base / reference documents
Few-shot examples
Tool definitions and schemas
Conversation context (growing, but prefix-stable)

❌ Can't benefit from caching:

Unique, one-off queries with no shared prefix
Prompts where the variable content comes first
Very short prompts (overhead exceeds benefit)

Implementation Tips

Structure prompts with stable content first. Put your system prompt, knowledge base, and examples before the variable user input.
Keep cached content above the minimum threshold. Most providers require at least 1,024-2,048 tokens for caching to activate.
Monitor cache hit rates. If you're paying cache write costs but not getting hits, your prompts aren't structured correctly.
Use explicit cache breakpoints where supported (Anthropic allows marking specific points in your prompt for caching).

Beyond Caching: The Full Cost Optimisation Toolkit

Model Routing

Not every query needs your most expensive model. Build a router:

User query → Classification (fast, cheap model)
  ├─ Simple question → Small model (Haiku/GPT-4o-mini)
  ├─ Standard task → Medium model (Sonnet/GPT-4o)
  └─ Complex reasoning → Large model (Opus)

With good classification, 60-70% of queries can be handled by cheaper models. That alone cuts costs by 40-50%.

Batching

If your use case isn't real-time, batch API calls. Most providers offer 50% discounts on batch processing:

Monthly report generation — batch, not real-time
Document analysis — batch overnight
Content generation — queue and process in bulk
Data classification — batch operations

Output Token Management

Output tokens typically cost 3-5x more than input tokens. Control output length:

Set max_tokens appropriate to the task
Be specific about desired format ("Respond in 2-3 sentences" vs. letting the model write an essay)
Use structured output (JSON schemas) to prevent verbose responses
Stop sequences to cut generation at the right point

Local Models for Repetitive Tasks

For high-volume, lower-complexity tasks, local models eliminate per-token costs entirely:

Document classification — Llama 3 running locally
Sentiment analysis — Small fine-tuned model
Data extraction — Structured output from quantised models
Embeddings — Local embedding models for RAG

The upfront compute cost is fixed, making it dramatically cheaper at scale.

Building a Cost-Conscious AI Architecture

Here's what a well-optimised business AI stack looks like:

Layer 1: Smart Routing

Every request hits a lightweight classifier that routes to the appropriate model based on complexity, urgency, and required capability.

Layer 2: Context Management

Retrieved context is compressed, deduplicated, and structured with stable prefixes for maximum cache utilisation.

Layer 3: Caching

Prompt caching enabled on all models that support it. Cache hit rates monitored and optimised weekly.

Layer 4: Model Selection

Task-appropriate models: expensive models for hard problems, cheap models for routine tasks, local models for high-volume processing.

Layer 5: Monitoring

Real-time dashboards showing:

Cost per query (by model, by use case)
Cache hit rates
Token efficiency (output quality vs. tokens consumed)
Monthly spend projections

The Counterintuitive Truth

Here's what surprises most businesses: spending more on architecture saves more on operations.

A day spent implementing prompt caching and model routing can save thousands per month. A week building proper context management can save tens of thousands per year.

The businesses running AI effectively in 2026 aren't the ones with the biggest budgets. They're the ones who understood token economics early and built their systems accordingly.

Quick Wins: Start Here

Audit your current prompts — How many tokens are you sending per call? How much is stable vs. variable?
Enable prompt caching — If your provider supports it and you have stable prefixes, this is the single biggest win
Implement model routing — Even a basic "simple/complex" split saves 30-40%
Set output limits — Stop paying for 500 tokens when 100 would do
Monitor costs daily — You can't optimise what you don't measure

The AI revolution isn't just about capability. It's about sustainable capability — building systems that deliver real value without costs that scale faster than the value they create.

Want help optimising your AI deployment costs? Get in touch — we specialise in building cost-efficient AI architectures that scale sustainably.

Context Windows and Prompt Caching: The Hidden Keys to AI Cost Control in 2026

Context Windows and Prompt Caching: The Hidden Keys to AI Cost Control in 2026

Context Windows: Your AI's Working Memory

Where We Are in 2026

Why Context Windows Matter for Business

The Context Window Trap

Smart Context Management

Prompt Caching: The Game Changer

How It Works

The Numbers

When Prompt Caching Works Best

Implementation Tips

Beyond Caching: The Full Cost Optimisation Toolkit

Model Routing

Batching

Output Token Management

Local Models for Repetitive Tasks

Building a Cost-Conscious AI Architecture

Layer 1: Smart Routing

Layer 2: Context Management

Layer 3: Caching

Layer 4: Model Selection

Layer 5: Monitoring

The Counterintuitive Truth

Quick Wins: Start Here

Tags

Rod Hill

Related Articles

MCP (Model Context Protocol): The USB-C of AI Integration and Why It Matters for Your Business

AI Agent Security: Enterprise Deployment & UK Compliance - February 2026

Need help implementing this?