Skip to main content
AI Infrastructure

Context Windows and Prompt Caching: The Hidden Keys to AI Cost Control in 2026

Understanding context windows, prompt caching, and token economics — the technical fundamentals that determine whether your AI deployment costs £100 or £10,000 per month.

Rod Hill·5 February 2026·8 min read

Context Windows and Prompt Caching: The Hidden Keys to AI Cost Control in 2026

Most businesses adopting AI focus on the headline capabilities: reasoning, code generation, analysis. But the factor that most determines your monthly bill isn't what the model can do — it's how much context you feed it and how efficiently you manage that context.

Understanding context windows and prompt caching is the difference between a £500/month AI deployment and a £5,000 one doing the same work.

Context Windows: Your AI's Working Memory

A context window is the total amount of text (measured in tokens — roughly ¾ of a word) that an AI model can process in a single interaction. Think of it as the model's working memory.

Where We Are in 2026

ModelContext WindowRough Equivalent
Claude Opus 4.5200K tokens~150,000 words (3 novels)
GPT-4o128K tokens~96,000 words
Gemini 2.02M tokens~1.5 million words
Claude Sonnet 4200K tokens~150,000 words

These are enormous compared to 2023's 4K-8K limits. But bigger isn't always better — and filling these windows carelessly is the fastest way to burn money.

Why Context Windows Matter for Business

Every token in the context window costs money. Both input tokens (what you send) and output tokens (what you receive) are billed. The economics:

  • Claude Opus 4.5: ~$15 per million input tokens, ~$75 per million output tokens
  • Claude Sonnet 4: ~$3 per million input tokens, ~$15 per million output tokens
  • GPT-4o: ~$2.50 per million input tokens, ~$10 per million output tokens

If you're sending 50K tokens of context with every API call (a common pattern with RAG systems), and making 1,000 calls per day:

  • At Opus rates: $750/day on input tokens alone
  • At Sonnet rates: $150/day
  • At GPT-4o rates: $125/day

This is where most businesses get surprised. The model cost per query looks cheap. The cumulative context cost is not.

The Context Window Trap

Here's the pattern we see repeatedly:

  1. Business builds an AI assistant with access to company knowledge
  2. They stuff the entire knowledge base into every prompt (or retrieve too many chunks via RAG)
  3. The assistant works brilliantly in testing
  4. The monthly bill arrives and someone has a difficult conversation

The fix isn't to use less context — it's to use context intelligently.

Smart Context Management

Tiered retrieval: Don't retrieve everything. Build a hierarchy:

  • First pass: semantic search for the most relevant 2-3 chunks
  • Only expand if the model indicates it needs more information
  • Never dump entire documents when a paragraph would suffice

Context compression: Use smaller, faster models to summarise retrieved context before feeding it to the main model. A £0.01 summarisation call that reduces your context by 80% saves £0.50 on the main call.

Session management: For conversational AI, don't replay the entire chat history. Summarise older messages and keep only the recent 5-10 exchanges in full detail.

Selective system prompts: A 2,000-token system prompt included in every call adds up. If you're making 10,000 calls/day, that's 20 million tokens just in system prompts — £30-60/day depending on the model.

Prompt Caching: The Game Changer

Prompt caching is the most impactful cost optimisation technique available in 2026, and most businesses aren't using it.

How It Works

When you send a prompt to an AI model, the model processes every token from scratch — even if 90% of your prompt is identical to the last call. Prompt caching changes this:

  1. First call: The model processes your full prompt and caches the processed result
  2. Subsequent calls: If the prompt starts with the same content, the cached portion is reused
  3. You pay reduced rates for cached tokens (typically 90% less)

The Numbers

With Anthropic's prompt caching (available on Claude models):

Token TypeStandard CostCached CostSaving
Input (cache miss)$15/M (Opus)N/A
Input (cache hit)N/A$1.50/M (Opus)90%
Cache write$18.75/M (Opus)N/AInitial overhead

For a typical business AI deployment:

  • System prompt: 2,000 tokens (same every call) → cache it
  • Knowledge base context: 10,000 tokens (mostly stable) → cache it
  • User query: 200 tokens (changes every call) → not cached

Without caching: 12,200 tokens × $15/M = $0.183 per call With caching: 200 fresh tokens + 12,000 cached tokens = $0.021 per call

That's an 88% cost reduction on input processing. At 1,000 calls/day, you're saving ~$160/day, or roughly £4,000/month.

When Prompt Caching Works Best

Caching is most effective when your prompts have a stable prefix — content at the start that doesn't change between calls:

Perfect for caching:

  • System prompts and instructions
  • Company knowledge base / reference documents
  • Few-shot examples
  • Tool definitions and schemas
  • Conversation context (growing, but prefix-stable)

Can't benefit from caching:

  • Unique, one-off queries with no shared prefix
  • Prompts where the variable content comes first
  • Very short prompts (overhead exceeds benefit)

Implementation Tips

  1. Structure prompts with stable content first. Put your system prompt, knowledge base, and examples before the variable user input.

  2. Keep cached content above the minimum threshold. Most providers require at least 1,024-2,048 tokens for caching to activate.

  3. Monitor cache hit rates. If you're paying cache write costs but not getting hits, your prompts aren't structured correctly.

  4. Use explicit cache breakpoints where supported (Anthropic allows marking specific points in your prompt for caching).

Beyond Caching: The Full Cost Optimisation Toolkit

Model Routing

Not every query needs your most expensive model. Build a router:

User query → Classification (fast, cheap model)
  ├─ Simple question → Small model (Haiku/GPT-4o-mini)
  ├─ Standard task → Medium model (Sonnet/GPT-4o)
  └─ Complex reasoning → Large model (Opus)

With good classification, 60-70% of queries can be handled by cheaper models. That alone cuts costs by 40-50%.

Batching

If your use case isn't real-time, batch API calls. Most providers offer 50% discounts on batch processing:

  • Monthly report generation — batch, not real-time
  • Document analysis — batch overnight
  • Content generation — queue and process in bulk
  • Data classification — batch operations

Output Token Management

Output tokens typically cost 3-5x more than input tokens. Control output length:

  • Set max_tokens appropriate to the task
  • Be specific about desired format ("Respond in 2-3 sentences" vs. letting the model write an essay)
  • Use structured output (JSON schemas) to prevent verbose responses
  • Stop sequences to cut generation at the right point

Local Models for Repetitive Tasks

For high-volume, lower-complexity tasks, local models eliminate per-token costs entirely:

  • Document classification — Llama 3 running locally
  • Sentiment analysis — Small fine-tuned model
  • Data extraction — Structured output from quantised models
  • Embeddings — Local embedding models for RAG

The upfront compute cost is fixed, making it dramatically cheaper at scale.

Building a Cost-Conscious AI Architecture

Here's what a well-optimised business AI stack looks like:

Layer 1: Smart Routing

Every request hits a lightweight classifier that routes to the appropriate model based on complexity, urgency, and required capability.

Layer 2: Context Management

Retrieved context is compressed, deduplicated, and structured with stable prefixes for maximum cache utilisation.

Layer 3: Caching

Prompt caching enabled on all models that support it. Cache hit rates monitored and optimised weekly.

Layer 4: Model Selection

Task-appropriate models: expensive models for hard problems, cheap models for routine tasks, local models for high-volume processing.

Layer 5: Monitoring

Real-time dashboards showing:

  • Cost per query (by model, by use case)
  • Cache hit rates
  • Token efficiency (output quality vs. tokens consumed)
  • Monthly spend projections

The Counterintuitive Truth

Here's what surprises most businesses: spending more on architecture saves more on operations.

A day spent implementing prompt caching and model routing can save thousands per month. A week building proper context management can save tens of thousands per year.

The businesses running AI effectively in 2026 aren't the ones with the biggest budgets. They're the ones who understood token economics early and built their systems accordingly.

Quick Wins: Start Here

  1. Audit your current prompts — How many tokens are you sending per call? How much is stable vs. variable?
  2. Enable prompt caching — If your provider supports it and you have stable prefixes, this is the single biggest win
  3. Implement model routing — Even a basic "simple/complex" split saves 30-40%
  4. Set output limits — Stop paying for 500 tokens when 100 would do
  5. Monitor costs daily — You can't optimise what you don't measure

The AI revolution isn't just about capability. It's about sustainable capability — building systems that deliver real value without costs that scale faster than the value they create.


Want help optimising your AI deployment costs? Get in touch — we specialise in building cost-efficient AI architectures that scale sustainably.

Tags

context windowsprompt cachingtoken economicsai costsllm optimizationai infrastructurecost optimization
RH

Rod Hill

The Caversham Digital team brings 20+ years of hands-on experience across AI implementation, technology strategy, process automation, and digital transformation for UK businesses.

About the team →

Need help implementing this?

Start with a conversation about your specific challenges.

Talk to our AI →