Prompt Caching
Reduce latency and cost by caching frequently reused prompt content. When the same content appears across requests, the provider can skip reprocessing cached tokens and charge a reduced rate.
Prompt caching is handled by the underlying providers. Pomex passes through caching directives and reports cache usage in the response.
How It Works
Each provider implements prompt caching differently, but the core idea is the same: prefix content that is identical across requests is processed once and reused from cache on subsequent calls.
| Provider | Mechanism | Minimum Cacheable Tokens | Cache Lifetime |
|---|---|---|---|
| Claude | Explicit breakpoints via cache_control |
1,024 tokens (Haiku 4.5, Opus 4.5, Opus 4.6: 4,096) | 5 minutes (default) or 1 hour (where supported) |
| Gemini | Automatic — longest matching prefix is cached | Provider-managed | Automatic (provider-managed) |
| GPT | Automatic prefix caching; optional prompt_cache_key for explicit control |
1,024 tokens | Automatic (minutes); up to 24 hours with prompt_cache_retention on supported models |
Claude: Explicit Cache Breakpoints
Claude uses explicit cache_control breakpoints. You mark the point in your prompt where the cache should end using cache_control on content blocks, system blocks, and tool definitions. Everything up to and including the marked block becomes a cacheable prefix.
Supported Locations
cache_control can be placed on:
- System text blocks
- Tool definitions
- User message content blocks (text, image, document, tool_result)
- Assistant message content blocks (text, tool_use)
Pomex applies cache_control as per-block breakpoints on individual content blocks, system blocks, and tools. This approach is supported by all Claude backends (AWS Bedrock, GCP Vertex, and direct Anthropic API). The top-level cache_control convenience field (which auto-applies a breakpoint to the last cacheable block) is not used, as it is not supported on Bedrock and Vertex.
cache_control Object
| Field | Type | Required | Description |
|---|---|---|---|
| type | string | Required | "ephemeral" |
| ttl | string | Optional | "5m" (default) or "1h" (where supported by the model and backend) |
Anthropic Format (/v1/messages)
Add cache_control directly on content blocks, system blocks, and tools. Pomex passes these through to Claude.
Caching a system prompt
curl https://api.pomex.ai/v1/messages \
-H "x-api-key: $POMEX_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "Content-Type: application/json" \
-d '{
"model": "anthropic/claude-opus-4.6",
"max_tokens": 1024,
"system": [
{
"type": "text",
"text": "You are a legal assistant. Here is the full text of the contract to analyze: ... (long document) ...",
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{"role": "user", "content": "Summarize the key terms."}
]
}'from anthropic import Anthropic
client = Anthropic(
base_url="https://api.pomex.ai",
api_key="YOUR_API_KEY",
)
message = client.messages.create(
model="anthropic/claude-opus-4.6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a legal assistant. Here is the full text of the contract to analyze: ... (long document) ...",
"cache_control": {"type": "ephemeral"},
}
],
messages=[
{"role": "user", "content": "Summarize the key terms."}
],
)
print(message.content[0].text)Caching tools and conversation history
{
"model": "anthropic/claude-opus-4.6",
"max_tokens": 1024,
"tools": [
{
"name": "search_database",
"description": "Search the product database",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string"}
},
"required": ["query"]
},
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Here is the product catalog: ... (large text) ...",
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": "Find products related to wireless charging."
}
]
}
]
}Using 1-hour TTL
For content that should remain cached longer, set the TTL to "1h" (where supported by the model and backend):
{
"type": "text",
"text": "Reference documentation that rarely changes...",
"cache_control": {"type": "ephemeral", "ttl": "1h"}
}OpenAI Format (/v1/chat/completions)
The OpenAI-compatible endpoint (/v1/chat/completions) now supports cache_control as a Claude-only request extension. When targeting Claude models, you can place cache_control on content parts, tool definitions, and tool calls just as you would with the Anthropic format. Non-Claude providers ignore these fields.
Supported Locations
cache_control can be placed on the following locations in a /v1/chat/completions request:
- System message content parts — array content format with
type: "text" - User message content parts —
text,image_url, anddocumentparts - Assistant message content parts —
textonly (non-text parts are silently dropped) - Tool definitions — top-level
cache_controlon the tool object - Tool calls — top-level
cache_controlon the tool_call object - Tool results — array content with
text,image_url, anddocumentparts
Claude-only extensions: cache_control, document content parts, is_error on tool results, and assistant/tool array content are Claude-only extensions to the OpenAI format. Non-Claude providers may ignore or degrade these fields.
String content format (e.g., "content": "Hello") does not support cache_control. You must use the array content format to attach cache breakpoints.
Content Type Support by Role
| Role | Supported type Values |
Notes |
|---|---|---|
| system | text | Array content format required for cache_control |
| user | text, image_url, document | |
| assistant | text | Non-text parts silently dropped |
| tool (result) | text, image_url, document | Array content for multi-part results |
Caching a system prompt
{
"model": "anthropic/claude-opus-4.6",
"max_tokens": 1024,
"messages": [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are a legal assistant. Here is the full text of the contract to analyze: ... (long document) ...",
"cache_control": {"type": "ephemeral"}
}
]
},
{"role": "user", "content": "Summarize the key terms."}
]
}Caching tools and user content
You can place cache_control on both tool definitions and content parts within the same request:
{
"model": "anthropic/claude-opus-4.6",
"max_tokens": 1024,
"tools": [
{
"type": "function",
"function": {
"name": "search_database",
"description": "Search the product database",
"parameters": {"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]}
},
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Here is the product catalog: ... (large text) ...",
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": "Find products related to wireless charging."
}
]
}
]
}Document content parts and tool results
The document content type and is_error on tool results are Claude-only extensions supported in the OpenAI format:
{
"model": "anthropic/claude-opus-4.6",
"max_tokens": 1024,
"messages": [
{
"role": "user",
"content": [
{
"type": "document",
"source": {"type": "base64", "media_type": "application/pdf", "data": "..."},
"cache_control": {"type": "ephemeral", "ttl": "1h"}
},
{"type": "text", "text": "Summarize this document."}
]
}
]
}Gemini: Automatic Caching
Gemini models automatically cache the longest matching prefix of your prompt. No explicit markup is needed. If successive requests share the same leading content (system prompt, initial messages, etc.), Gemini reuses the cached prefix.
Best Practices
- Place static content (system prompts, reference documents) at the beginning of the conversation
- Keep the order of messages consistent across requests
- Append new messages at the end rather than modifying earlier ones
Example
curl https://api.pomex.ai/v1/chat/completions \
-H "Authorization: Bearer $POMEX_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemini-2.5-pro",
"messages": [
{"role": "system", "content": "You are an expert code reviewer. Here is the full codebase: ... (large context) ..."},
{"role": "user", "content": "Review the authentication module."}
]
}'On the next request, keep the system message identical and only change the user message to benefit from automatic caching:
curl https://api.pomex.ai/v1/chat/completions \
-H "Authorization: Bearer $POMEX_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemini-2.5-pro",
"messages": [
{"role": "system", "content": "You are an expert code reviewer. Here is the full codebase: ... (large context) ..."},
{"role": "user", "content": "Now review the database layer."}
]
}'Explicit Cached Content (Gemini API)
On the Generate Content endpoint you can also pass a pre-created Gemini cached-content resource via the top-level cachedContent field:
{
"contents": [{"role": "user", "parts": [{"text": "Summarize chapter 3."}]}],
"cachedContent": "projects/my-project/cachedContents/abc123"
}cachedContent is preserved only on the native Gemini dispatch path. If you use the Gemini endpoint with a Claude or GPT model (cross-provider dispatch), the field is dropped silently — those providers use the caching mechanisms documented above.
GPT: Automatic and Explicit Caching
GPT models automatically cache prompt prefixes of 1,024 tokens or more. No explicit markup is needed for basic caching. Cached input tokens are billed at a reduced rate (up to 90% less than base input price) with no separate cache-write fee.
For more control, Pomex passes through two optional parameters:
| Field | Type | Description |
|---|---|---|
| prompt_cache_key | string | A hint to improve cache routing. Requests with the same key are more likely to hit the same cached prefix, but cache hits still require an exact prefix match. |
| prompt_cache_retention | string | Cache retention policy: "in_memory" (default automatic behavior) or "24h" (retain for up to 24 hours, on supported models). |
Best Practices
- Structure prompts so that static content (system instructions, few-shot examples, reference documents) comes first
- Keep the shared prefix identical across requests — even small changes invalidate automatic caching
- Dynamic content (the actual user query) should go at the end
- Use
prompt_cache_keyto improve cache routing for related requests that reuse the same exact prompt prefix - Set
prompt_cache_retentionto"24h"for workloads where the same prompt is reused over longer periods (on supported models)
Example: Automatic Caching
curl https://api.pomex.ai/v1/chat/completions \
-H "Authorization: Bearer $POMEX_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4.1",
"messages": [
{"role": "system", "content": "You are a customer support agent. Here are the support policies: ... (large document) ..."},
{"role": "user", "content": "A customer wants to return a product after 45 days."}
]
}'Example: Explicit Cache Key with 24h Retention
curl https://api.pomex.ai/v1/chat/completions \
-H "Authorization: Bearer $POMEX_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4.1",
"prompt_cache_key": "support-policies-v2",
"prompt_cache_retention": "24h",
"messages": [
{"role": "system", "content": "You are a customer support agent. Here are the support policies: ... (large document) ..."},
{"role": "user", "content": "A customer wants to return a product after 45 days."}
]
}'Cache Usage in Responses
Pomex reports cache hit and creation metrics in the response usage object so you can verify caching is working.
OpenAI Format
Cache usage appears in usage.prompt_tokens_details:
{
"usage": {
"prompt_tokens": 12500,
"completion_tokens": 150,
"total_tokens": 12650,
"prompt_tokens_details": {
"cached_tokens": 12000
}
}
}| Field | Description |
|---|---|
| prompt_tokens | Total input tokens (inclusive of cached and cache-creation tokens) |
| prompt_tokens_details.cached_tokens | Tokens read from cache (reduced cost) |
Anthropic Format
Cache usage appears directly on the usage object:
{
"usage": {
"input_tokens": 500,
"output_tokens": 150,
"cache_creation_input_tokens": 12000,
"cache_read_input_tokens": 0
}
}On the first request, cache_creation_input_tokens reflects the tokens written to cache. Subsequent requests with the same prefix will show cache_read_input_tokens instead:
{
"usage": {
"input_tokens": 500,
"output_tokens": 140,
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 12000,
"cache_creation": {
"ephemeral_5m_input_tokens": 0,
"ephemeral_1h_input_tokens": 0
}
}
}| Field | Description |
|---|---|
| input_tokens | Fresh (non-cached) input tokens |
| cache_creation_input_tokens | Tokens written to cache on this request |
| cache_read_input_tokens | Tokens read from cache (reduced cost) |
| cache_creation.ephemeral_5m_input_tokens | Cache creation tokens with 5-minute TTL |
| cache_creation.ephemeral_1h_input_tokens | Cache creation tokens with 1-hour TTL |
Sticky Routing
Pomex automatically improves prompt cache hit rates using sticky routing. When a request is served successfully, Pomex remembers which provider instance handled it. Subsequent requests with similar content are routed to the same instance, maximizing the chance that the provider's prompt cache is warm.
Sticky routing is fully automatic and requires no configuration. It works across all providers and both API formats (/v1/chat/completions and /v1/messages).
How It Works
- Content hashing — Pomex computes a one-way hash from the system/developer message and the first user message in the request. Only the short hash is stored — your prompt content is never saved. If
prompt_cache_keyis set, it is included in the hash for more precise routing. - Cache lookup — Before routing, Pomex checks if a previous request with the same content hash (scoped to your organization and model) was served by a specific provider instance.
- Preferred routing — On a cache hit, the request is directed to the same instance. If that instance is unavailable, Pomex falls back to normal routing.
- Cache update — After a successful response, the routing preference is stored (or refreshed) with a short TTL.
When Sticky Routing Helps
Sticky routing is most beneficial when your application makes multiple requests that share a common prompt prefix — for example, a chatbot with a long system prompt, or an agent that sends the same tool definitions on every turn. By routing these requests to the same backend instance, the provider is more likely to serve them from its prompt cache.
This is especially valuable with multi-account provider pools, where round-robin load balancing would otherwise spread requests across instances and reduce cache hit rates.
Interaction with prompt_cache_key
For GPT models, if you set prompt_cache_key, it is included in the sticky routing hash. This means requests with the same cache key are routed to the same instance, combining OpenAI's explicit cache routing with Pomex's instance-level routing for maximum cache efficiency.
Pricing Impact
Cached tokens are billed at reduced rates compared to fresh input tokens. The exact discount depends on the provider:
| Provider | Cache Write Cost | Cache Read Cost |
|---|---|---|
| Claude | 25% more than base input price | 90% less than base input price |
| Gemini | No separate write fee | Varies by caching mode; implicit cache hits are typically discounted (check Google pricing for current rates) |
| GPT | No separate write fee | Up to 90% less than base input price |
For high-volume use cases with long, repeated prompts, caching can significantly reduce both cost and latency.
Tips
- Put static content first. System prompts, reference documents, and tool definitions should appear before dynamic user queries.
- Avoid modifying cached prefixes. Even a single character change in a cached block invalidates the cache for that block and everything after it.
- Use longer TTLs for stable content. For Claude, set
"ttl": "1h"on content that remains constant across many requests (where supported by the model and backend). For GPT, setprompt_cache_retentionto"24h"on supported models. - Place breakpoints strategically. For Claude, put
cache_controlon the last block you want included in the cached prefix. You can use up to 4 breakpoints per request. - Monitor cache usage. Check the
cached_tokens(OpenAI format) orcache_read_input_tokens(Anthropic format) fields to verify caching is working. - Keep system prompts consistent. Sticky routing hashes the system message and first user message. Keeping your system prompt identical across requests ensures they are routed to the same backend instance for better cache hits.