Prompt Caching

Reduce latency and cost by caching frequently reused prompt content. When the same content appears across requests, the provider can skip reprocessing cached tokens and charge a reduced rate.

Prompt caching is handled by the underlying providers. Pomex passes through caching directives and reports cache usage in the response.


How It Works

Each provider implements prompt caching differently, but the core idea is the same: prefix content that is identical across requests is processed once and reused from cache on subsequent calls.

Provider Mechanism Minimum Cacheable Tokens Cache Lifetime
Claude Explicit breakpoints via cache_control 1,024 tokens (Haiku 4.5, Opus 4.5, Opus 4.6: 4,096) 5 minutes (default) or 1 hour (where supported)
Gemini Automatic — longest matching prefix is cached Provider-managed Automatic (provider-managed)
GPT Automatic prefix caching; optional prompt_cache_key for explicit control 1,024 tokens Automatic (minutes); up to 24 hours with prompt_cache_retention on supported models

Claude: Explicit Cache Breakpoints

Claude uses explicit cache_control breakpoints. You mark the point in your prompt where the cache should end using cache_control on content blocks, system blocks, and tool definitions. Everything up to and including the marked block becomes a cacheable prefix.

Supported Locations

cache_control can be placed on:

Pomex applies cache_control as per-block breakpoints on individual content blocks, system blocks, and tools. This approach is supported by all Claude backends (AWS Bedrock, GCP Vertex, and direct Anthropic API). The top-level cache_control convenience field (which auto-applies a breakpoint to the last cacheable block) is not used, as it is not supported on Bedrock and Vertex.

cache_control Object

Field Type Required Description
type string Required "ephemeral"
ttl string Optional "5m" (default) or "1h" (where supported by the model and backend)

Anthropic Format (/v1/messages)

Add cache_control directly on content blocks, system blocks, and tools. Pomex passes these through to Claude.

Caching a system prompt

curl https://api.pomex.ai/v1/messages \
  -H "x-api-key: $POMEX_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic/claude-opus-4.6",
    "max_tokens": 1024,
    "system": [
      {
        "type": "text",
        "text": "You are a legal assistant. Here is the full text of the contract to analyze: ... (long document) ...",
        "cache_control": {"type": "ephemeral"}
      }
    ],
    "messages": [
      {"role": "user", "content": "Summarize the key terms."}
    ]
  }'
from anthropic import Anthropic

client = Anthropic(
    base_url="https://api.pomex.ai",
    api_key="YOUR_API_KEY",
)

message = client.messages.create(
    model="anthropic/claude-opus-4.6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a legal assistant. Here is the full text of the contract to analyze: ... (long document) ...",
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {"role": "user", "content": "Summarize the key terms."}
    ],
)
print(message.content[0].text)

Caching tools and conversation history

{
  "model": "anthropic/claude-opus-4.6",
  "max_tokens": 1024,
  "tools": [
    {
      "name": "search_database",
      "description": "Search the product database",
      "input_schema": {
        "type": "object",
        "properties": {
          "query": {"type": "string"}
        },
        "required": ["query"]
      },
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Here is the product catalog: ... (large text) ...",
          "cache_control": {"type": "ephemeral"}
        },
        {
          "type": "text",
          "text": "Find products related to wireless charging."
        }
      ]
    }
  ]
}

Using 1-hour TTL

For content that should remain cached longer, set the TTL to "1h" (where supported by the model and backend):

{
  "type": "text",
  "text": "Reference documentation that rarely changes...",
  "cache_control": {"type": "ephemeral", "ttl": "1h"}
}

OpenAI Format (/v1/chat/completions)

The OpenAI-compatible endpoint (/v1/chat/completions) now supports cache_control as a Claude-only request extension. When targeting Claude models, you can place cache_control on content parts, tool definitions, and tool calls just as you would with the Anthropic format. Non-Claude providers ignore these fields.

Supported Locations

cache_control can be placed on the following locations in a /v1/chat/completions request:

Claude-only extensions: cache_control, document content parts, is_error on tool results, and assistant/tool array content are Claude-only extensions to the OpenAI format. Non-Claude providers may ignore or degrade these fields.

String content format (e.g., "content": "Hello") does not support cache_control. You must use the array content format to attach cache breakpoints.

Content Type Support by Role

Role Supported type Values Notes
system text Array content format required for cache_control
user text, image_url, document
assistant text Non-text parts silently dropped
tool (result) text, image_url, document Array content for multi-part results

Caching a system prompt

{
  "model": "anthropic/claude-opus-4.6",
  "max_tokens": 1024,
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are a legal assistant. Here is the full text of the contract to analyze: ... (long document) ...",
          "cache_control": {"type": "ephemeral"}
        }
      ]
    },
    {"role": "user", "content": "Summarize the key terms."}
  ]
}

Caching tools and user content

You can place cache_control on both tool definitions and content parts within the same request:

{
  "model": "anthropic/claude-opus-4.6",
  "max_tokens": 1024,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "search_database",
        "description": "Search the product database",
        "parameters": {"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]}
      },
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Here is the product catalog: ... (large text) ...",
          "cache_control": {"type": "ephemeral"}
        },
        {
          "type": "text",
          "text": "Find products related to wireless charging."
        }
      ]
    }
  ]
}

Document content parts and tool results

The document content type and is_error on tool results are Claude-only extensions supported in the OpenAI format:

{
  "model": "anthropic/claude-opus-4.6",
  "max_tokens": 1024,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "document",
          "source": {"type": "base64", "media_type": "application/pdf", "data": "..."},
          "cache_control": {"type": "ephemeral", "ttl": "1h"}
        },
        {"type": "text", "text": "Summarize this document."}
      ]
    }
  ]
}

Gemini: Automatic Caching

Gemini models automatically cache the longest matching prefix of your prompt. No explicit markup is needed. If successive requests share the same leading content (system prompt, initial messages, etc.), Gemini reuses the cached prefix.

Best Practices

Example

curl https://api.pomex.ai/v1/chat/completions \
  -H "Authorization: Bearer $POMEX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemini-2.5-pro",
    "messages": [
      {"role": "system", "content": "You are an expert code reviewer. Here is the full codebase: ... (large context) ..."},
      {"role": "user", "content": "Review the authentication module."}
    ]
  }'

On the next request, keep the system message identical and only change the user message to benefit from automatic caching:

curl https://api.pomex.ai/v1/chat/completions \
  -H "Authorization: Bearer $POMEX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemini-2.5-pro",
    "messages": [
      {"role": "system", "content": "You are an expert code reviewer. Here is the full codebase: ... (large context) ..."},
      {"role": "user", "content": "Now review the database layer."}
    ]
  }'

Explicit Cached Content (Gemini API)

On the Generate Content endpoint you can also pass a pre-created Gemini cached-content resource via the top-level cachedContent field:

{
  "contents": [{"role": "user", "parts": [{"text": "Summarize chapter 3."}]}],
  "cachedContent": "projects/my-project/cachedContents/abc123"
}

cachedContent is preserved only on the native Gemini dispatch path. If you use the Gemini endpoint with a Claude or GPT model (cross-provider dispatch), the field is dropped silently — those providers use the caching mechanisms documented above.


GPT: Automatic and Explicit Caching

GPT models automatically cache prompt prefixes of 1,024 tokens or more. No explicit markup is needed for basic caching. Cached input tokens are billed at a reduced rate (up to 90% less than base input price) with no separate cache-write fee.

For more control, Pomex passes through two optional parameters:

Field Type Description
prompt_cache_key string A hint to improve cache routing. Requests with the same key are more likely to hit the same cached prefix, but cache hits still require an exact prefix match.
prompt_cache_retention string Cache retention policy: "in_memory" (default automatic behavior) or "24h" (retain for up to 24 hours, on supported models).

Best Practices

Example: Automatic Caching

curl https://api.pomex.ai/v1/chat/completions \
  -H "Authorization: Bearer $POMEX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4.1",
    "messages": [
      {"role": "system", "content": "You are a customer support agent. Here are the support policies: ... (large document) ..."},
      {"role": "user", "content": "A customer wants to return a product after 45 days."}
    ]
  }'

Example: Explicit Cache Key with 24h Retention

curl https://api.pomex.ai/v1/chat/completions \
  -H "Authorization: Bearer $POMEX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4.1",
    "prompt_cache_key": "support-policies-v2",
    "prompt_cache_retention": "24h",
    "messages": [
      {"role": "system", "content": "You are a customer support agent. Here are the support policies: ... (large document) ..."},
      {"role": "user", "content": "A customer wants to return a product after 45 days."}
    ]
  }'

Cache Usage in Responses

Pomex reports cache hit and creation metrics in the response usage object so you can verify caching is working.

OpenAI Format

Cache usage appears in usage.prompt_tokens_details:

{
  "usage": {
    "prompt_tokens": 12500,
    "completion_tokens": 150,
    "total_tokens": 12650,
    "prompt_tokens_details": {
      "cached_tokens": 12000
    }
  }
}
Field Description
prompt_tokens Total input tokens (inclusive of cached and cache-creation tokens)
prompt_tokens_details.cached_tokens Tokens read from cache (reduced cost)

Anthropic Format

Cache usage appears directly on the usage object:

{
  "usage": {
    "input_tokens": 500,
    "output_tokens": 150,
    "cache_creation_input_tokens": 12000,
    "cache_read_input_tokens": 0
  }
}

On the first request, cache_creation_input_tokens reflects the tokens written to cache. Subsequent requests with the same prefix will show cache_read_input_tokens instead:

{
  "usage": {
    "input_tokens": 500,
    "output_tokens": 140,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 12000,
    "cache_creation": {
      "ephemeral_5m_input_tokens": 0,
      "ephemeral_1h_input_tokens": 0
    }
  }
}
Field Description
input_tokens Fresh (non-cached) input tokens
cache_creation_input_tokens Tokens written to cache on this request
cache_read_input_tokens Tokens read from cache (reduced cost)
cache_creation.ephemeral_5m_input_tokens Cache creation tokens with 5-minute TTL
cache_creation.ephemeral_1h_input_tokens Cache creation tokens with 1-hour TTL

Sticky Routing

Pomex automatically improves prompt cache hit rates using sticky routing. When a request is served successfully, Pomex remembers which provider instance handled it. Subsequent requests with similar content are routed to the same instance, maximizing the chance that the provider's prompt cache is warm.

Sticky routing is fully automatic and requires no configuration. It works across all providers and both API formats (/v1/chat/completions and /v1/messages).

How It Works

  1. Content hashing — Pomex computes a one-way hash from the system/developer message and the first user message in the request. Only the short hash is stored — your prompt content is never saved. If prompt_cache_key is set, it is included in the hash for more precise routing.
  2. Cache lookup — Before routing, Pomex checks if a previous request with the same content hash (scoped to your organization and model) was served by a specific provider instance.
  3. Preferred routing — On a cache hit, the request is directed to the same instance. If that instance is unavailable, Pomex falls back to normal routing.
  4. Cache update — After a successful response, the routing preference is stored (or refreshed) with a short TTL.

When Sticky Routing Helps

Sticky routing is most beneficial when your application makes multiple requests that share a common prompt prefix — for example, a chatbot with a long system prompt, or an agent that sends the same tool definitions on every turn. By routing these requests to the same backend instance, the provider is more likely to serve them from its prompt cache.

This is especially valuable with multi-account provider pools, where round-robin load balancing would otherwise spread requests across instances and reduce cache hit rates.

Interaction with prompt_cache_key

For GPT models, if you set prompt_cache_key, it is included in the sticky routing hash. This means requests with the same cache key are routed to the same instance, combining OpenAI's explicit cache routing with Pomex's instance-level routing for maximum cache efficiency.


Pricing Impact

Cached tokens are billed at reduced rates compared to fresh input tokens. The exact discount depends on the provider:

Provider Cache Write Cost Cache Read Cost
Claude 25% more than base input price 90% less than base input price
Gemini No separate write fee Varies by caching mode; implicit cache hits are typically discounted (check Google pricing for current rates)
GPT No separate write fee Up to 90% less than base input price

For high-volume use cases with long, repeated prompts, caching can significantly reduce both cost and latency.


Tips