Prompt Caching

Reduce latency and cost by caching frequently reused prompt content. When the same content appears across requests, the provider can skip reprocessing cached tokens and charge a reduced rate.

Prompt caching is handled by the underlying providers. Pomex passes through caching directives and reports cache usage in the response.

How It Works

Each provider implements prompt caching differently, but the core idea is the same: prefix content that is identical across requests is processed once and reused from cache on subsequent calls.

Provider	Mechanism	Minimum Cacheable Tokens	Cache Lifetime
Claude	Explicit breakpoints via `cache_control`	1,024 tokens (Haiku 4.5, Opus 4.5, Opus 4.6: 4,096)	5 minutes (default) or 1 hour (where supported)
Gemini	Automatic — longest matching prefix is cached	Provider-managed	Automatic (provider-managed)
GPT	Automatic prefix caching; optional `prompt_cache_key` for explicit control	1,024 tokens	Automatic (minutes); up to 24 hours with `prompt_cache_retention` on supported models

Claude: Explicit Cache Breakpoints

Claude uses explicit cache_control breakpoints. You mark the point in your prompt where the cache should end using cache_control on content blocks, system blocks, and tool definitions. Everything up to and including the marked block becomes a cacheable prefix.

Supported Locations

cache_control can be placed on:

System text blocks
Tool definitions
User message content blocks (text, image, document, tool_result)
Assistant message content blocks (text, tool_use)

Pomex applies cache_control as per-block breakpoints on individual content blocks, system blocks, and tools. This approach is supported by all Claude backends (AWS Bedrock, GCP Vertex, and direct Anthropic API). The top-level cache_control convenience field (which auto-applies a breakpoint to the last cacheable block) is not used, as it is not supported on Bedrock and Vertex.

`cache_control` Object

Field	Type	Required	Description
type	string	Required	`"ephemeral"`
ttl	string	Optional	`"5m"` (default) or `"1h"` (where supported by the model and backend)

Anthropic Format (`/v1/messages`)

Add cache_control directly on content blocks, system blocks, and tools. Pomex passes these through to Claude.

Caching a system prompt

curl https://api.pomex.ai/v1/messages \
  -H "x-api-key: $POMEX_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic/claude-opus-4.6",
    "max_tokens": 1024,
    "system": [
      {
        "type": "text",
        "text": "You are a legal assistant. Here is the full text of the contract to analyze: ... (long document) ...",
        "cache_control": {"type": "ephemeral"}
      }
    ],
    "messages": [
      {"role": "user", "content": "Summarize the key terms."}
    ]
  }'

from anthropic import Anthropic

client = Anthropic(
    base_url="https://api.pomex.ai",
    api_key="YOUR_API_KEY",
)

message = client.messages.create(
    model="anthropic/claude-opus-4.6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a legal assistant. Here is the full text of the contract to analyze: ... (long document) ...",
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {"role": "user", "content": "Summarize the key terms."}
    ],
)
print(message.content[0].text)

Caching tools and conversation history

{
  "model": "anthropic/claude-opus-4.6",
  "max_tokens": 1024,
  "tools": [
    {
      "name": "search_database",
      "description": "Search the product database",
      "input_schema": {
        "type": "object",
        "properties": {
          "query": {"type": "string"}
        },
        "required": ["query"]
      },
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Here is the product catalog: ... (large text) ...",
          "cache_control": {"type": "ephemeral"}
        },
        {
          "type": "text",
          "text": "Find products related to wireless charging."
        }
      ]
    }
  ]
}

Using 1-hour TTL

For content that should remain cached longer, set the TTL to "1h" (where supported by the model and backend):

{
  "type": "text",
  "text": "Reference documentation that rarely changes...",
  "cache_control": {"type": "ephemeral", "ttl": "1h"}
}

OpenAI Format (`/v1/chat/completions`)

The OpenAI-compatible endpoint (/v1/chat/completions) now supports cache_control as a Claude-only request extension. When targeting Claude models, you can place cache_control on content parts, tool definitions, and tool calls just as you would with the Anthropic format. Non-Claude providers ignore these fields.

Supported Locations

cache_control can be placed on the following locations in a /v1/chat/completions request:

System message content parts — array content format with type: "text"
User message content parts — text, image_url, and document parts
Assistant message content parts — text only (non-text parts are silently dropped)
Tool definitions — top-level cache_control on the tool object
Tool calls — top-level cache_control on the tool_call object
Tool results — array content with text, image_url, and document parts

Claude-only extensions: cache_control, document content parts, is_error on tool results, and assistant/tool array content are Claude-only extensions to the OpenAI format. Non-Claude providers may ignore or degrade these fields.

String content format (e.g., "content": "Hello") does not support cache_control. You must use the array content format to attach cache breakpoints.

Content Type Support by Role

Role	Supported `type` Values	Notes
system	text	Array content format required for `cache_control`
user	text, image_url, document
assistant	text	Non-text parts silently dropped
tool (result)	text, image_url, document	Array content for multi-part results

Caching a system prompt

{
  "model": "anthropic/claude-opus-4.6",
  "max_tokens": 1024,
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are a legal assistant. Here is the full text of the contract to analyze: ... (long document) ...",
          "cache_control": {"type": "ephemeral"}
        }
      ]
    },
    {"role": "user", "content": "Summarize the key terms."}
  ]
}

Caching tools and user content

You can place cache_control on both tool definitions and content parts within the same request:

{
  "model": "anthropic/claude-opus-4.6",
  "max_tokens": 1024,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "search_database",
        "description": "Search the product database",
        "parameters": {"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]}
      },
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Here is the product catalog: ... (large text) ...",
          "cache_control": {"type": "ephemeral"}
        },
        {
          "type": "text",
          "text": "Find products related to wireless charging."
        }
      ]
    }
  ]
}

Document content parts and tool results

The document content type and is_error on tool results are Claude-only extensions supported in the OpenAI format:

{
  "model": "anthropic/claude-opus-4.6",
  "max_tokens": 1024,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "document",
          "source": {"type": "base64", "media_type": "application/pdf", "data": "..."},
          "cache_control": {"type": "ephemeral", "ttl": "1h"}
        },
        {"type": "text", "text": "Summarize this document."}
      ]
    }
  ]
}

Gemini: Automatic Caching

Gemini models automatically cache the longest matching prefix of your prompt. No explicit markup is needed. If successive requests share the same leading content (system prompt, initial messages, etc.), Gemini reuses the cached prefix.

Best Practices

Place static content (system prompts, reference documents) at the beginning of the conversation
Keep the order of messages consistent across requests
Append new messages at the end rather than modifying earlier ones

Example

curl https://api.pomex.ai/v1/chat/completions \
  -H "Authorization: Bearer $POMEX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemini-2.5-pro",
    "messages": [
      {"role": "system", "content": "You are an expert code reviewer. Here is the full codebase: ... (large context) ..."},
      {"role": "user", "content": "Review the authentication module."}
    ]
  }'

On the next request, keep the system message identical and only change the user message to benefit from automatic caching:

curl https://api.pomex.ai/v1/chat/completions \
  -H "Authorization: Bearer $POMEX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemini-2.5-pro",
    "messages": [
      {"role": "system", "content": "You are an expert code reviewer. Here is the full codebase: ... (large context) ..."},
      {"role": "user", "content": "Now review the database layer."}
    ]
  }'

Explicit Cached Content (Gemini API)

On the Generate Content endpoint you can also pass a pre-created Gemini cached-content resource via the top-level cachedContent field:

{
  "contents": [{"role": "user", "parts": [{"text": "Summarize chapter 3."}]}],
  "cachedContent": "projects/my-project/cachedContents/abc123"
}

cachedContent is preserved only on the native Gemini dispatch path. If you use the Gemini endpoint with a Claude or GPT model (cross-provider dispatch), the field is dropped silently — those providers use the caching mechanisms documented above.

GPT: Automatic and Explicit Caching

GPT models automatically cache prompt prefixes of 1,024 tokens or more. No explicit markup is needed for basic caching. Cached input tokens are billed at a reduced rate (up to 90% less than base input price) with no separate cache-write fee.

For more control, Pomex passes through two optional parameters:

Field	Type	Description
prompt_cache_key	string	A hint to improve cache routing. Requests with the same key are more likely to hit the same cached prefix, but cache hits still require an exact prefix match.
prompt_cache_retention	string	Cache retention policy: `"in_memory"` (default automatic behavior) or `"24h"` (retain for up to 24 hours, on supported models).

Best Practices

Structure prompts so that static content (system instructions, few-shot examples, reference documents) comes first
Keep the shared prefix identical across requests — even small changes invalidate automatic caching
Dynamic content (the actual user query) should go at the end
Use prompt_cache_key to improve cache routing for related requests that reuse the same exact prompt prefix
Set prompt_cache_retention to "24h" for workloads where the same prompt is reused over longer periods (on supported models)

Example: Automatic Caching

curl https://api.pomex.ai/v1/chat/completions \
  -H "Authorization: Bearer $POMEX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4.1",
    "messages": [
      {"role": "system", "content": "You are a customer support agent. Here are the support policies: ... (large document) ..."},
      {"role": "user", "content": "A customer wants to return a product after 45 days."}
    ]
  }'

Example: Explicit Cache Key with 24h Retention

curl https://api.pomex.ai/v1/chat/completions \
  -H "Authorization: Bearer $POMEX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4.1",
    "prompt_cache_key": "support-policies-v2",
    "prompt_cache_retention": "24h",
    "messages": [
      {"role": "system", "content": "You are a customer support agent. Here are the support policies: ... (large document) ..."},
      {"role": "user", "content": "A customer wants to return a product after 45 days."}
    ]
  }'

Cache Usage in Responses

Pomex reports cache hit and creation metrics in the response usage object so you can verify caching is working.

OpenAI Format

Cache usage appears in usage.prompt_tokens_details:

{
  "usage": {
    "prompt_tokens": 12500,
    "completion_tokens": 150,
    "total_tokens": 12650,
    "prompt_tokens_details": {
      "cached_tokens": 12000
    }
  }
}

Field	Description
prompt_tokens	Total input tokens (inclusive of cached and cache-creation tokens)
prompt_tokens_details.cached_tokens	Tokens read from cache (reduced cost)

Anthropic Format

Cache usage appears directly on the usage object:

{
  "usage": {
    "input_tokens": 500,
    "output_tokens": 150,
    "cache_creation_input_tokens": 12000,
    "cache_read_input_tokens": 0
  }
}

On the first request, cache_creation_input_tokens reflects the tokens written to cache. Subsequent requests with the same prefix will show cache_read_input_tokens instead:

{
  "usage": {
    "input_tokens": 500,
    "output_tokens": 140,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 12000,
    "cache_creation": {
      "ephemeral_5m_input_tokens": 0,
      "ephemeral_1h_input_tokens": 0
    }
  }
}

Field	Description
input_tokens	Fresh (non-cached) input tokens
cache_creation_input_tokens	Tokens written to cache on this request
cache_read_input_tokens	Tokens read from cache (reduced cost)
cache_creation.ephemeral_5m_input_tokens	Cache creation tokens with 5-minute TTL
cache_creation.ephemeral_1h_input_tokens	Cache creation tokens with 1-hour TTL

Sticky Routing

Pomex automatically improves prompt cache hit rates using sticky routing. When a request is served successfully, Pomex remembers which provider instance handled it. Subsequent requests with similar content are routed to the same instance, maximizing the chance that the provider's prompt cache is warm.

Sticky routing is fully automatic and requires no configuration. It works across all providers and both API formats (/v1/chat/completions and /v1/messages).

How It Works

Content hashing — Pomex computes a one-way hash from the system/developer message and the first user message in the request. Only the short hash is stored — your prompt content is never saved. If prompt_cache_key is set, it is included in the hash for more precise routing.
Cache lookup — Before routing, Pomex checks if a previous request with the same content hash (scoped to your organization and model) was served by a specific provider instance.
Preferred routing — On a cache hit, the request is directed to the same instance. If that instance is unavailable, Pomex falls back to normal routing.
Cache update — After a successful response, the routing preference is stored (or refreshed) with a short TTL.

When Sticky Routing Helps

Sticky routing is most beneficial when your application makes multiple requests that share a common prompt prefix — for example, a chatbot with a long system prompt, or an agent that sends the same tool definitions on every turn. By routing these requests to the same backend instance, the provider is more likely to serve them from its prompt cache.

This is especially valuable with multi-account provider pools, where round-robin load balancing would otherwise spread requests across instances and reduce cache hit rates.

Interaction with `prompt_cache_key`

For GPT models, if you set prompt_cache_key, it is included in the sticky routing hash. This means requests with the same cache key are routed to the same instance, combining OpenAI's explicit cache routing with Pomex's instance-level routing for maximum cache efficiency.

Pricing Impact

Cached tokens are billed at reduced rates compared to fresh input tokens. The exact discount depends on the provider:

Provider	Cache Write Cost	Cache Read Cost
Claude	25% more than base input price	90% less than base input price
Gemini	No separate write fee	Varies by caching mode; implicit cache hits are typically discounted (check Google pricing for current rates)
GPT	No separate write fee	Up to 90% less than base input price

For high-volume use cases with long, repeated prompts, caching can significantly reduce both cost and latency.

Tips

Put static content first. System prompts, reference documents, and tool definitions should appear before dynamic user queries.
Avoid modifying cached prefixes. Even a single character change in a cached block invalidates the cache for that block and everything after it.
Use longer TTLs for stable content. For Claude, set "ttl": "1h" on content that remains constant across many requests (where supported by the model and backend). For GPT, set prompt_cache_retention to "24h" on supported models.
Place breakpoints strategically. For Claude, put cache_control on the last block you want included in the cached prefix. You can use up to 4 breakpoints per request.
Monitor cache usage. Check the cached_tokens (OpenAI format) or cache_read_input_tokens (Anthropic format) fields to verify caching is working.
Keep system prompts consistent. Sticky routing hashes the system message and first user message. Keeping your system prompt identical across requests ensures they are routed to the same backend instance for better cache hits.

Prompt Caching

How It Works

Claude: Explicit Cache Breakpoints

Supported Locations

cache_control Object

Anthropic Format (/v1/messages)

Caching a system prompt

Caching tools and conversation history

Using 1-hour TTL

OpenAI Format (/v1/chat/completions)

Supported Locations

Content Type Support by Role

Caching a system prompt

Caching tools and user content

Document content parts and tool results

Gemini: Automatic Caching

Best Practices

Example

Explicit Cached Content (Gemini API)

GPT: Automatic and Explicit Caching

Best Practices

Example: Automatic Caching

Example: Explicit Cache Key with 24h Retention

Cache Usage in Responses

OpenAI Format

Anthropic Format

Sticky Routing

How It Works

When Sticky Routing Helps

Interaction with prompt_cache_key

Pricing Impact

Tips

`cache_control` Object

Anthropic Format (`/v1/messages`)

OpenAI Format (`/v1/chat/completions`)

Interaction with `prompt_cache_key`