Prompt Caching - AI Stats Docs

Use prompt caching when the same large context appears across many requests. Mark stable instructions, documents, examples, tool outputs, or tool definitions as cacheable so supported providers can reuse that context on later calls. Prompt caching is different from response caching. Prompt caching still runs inference, but can reduce repeated input processing cost and latency. Response caching returns a previously generated answer for an identical request.

Prompt caching is provider- and model-specific. Unsupported providers ignore cache hints or route without cached pricing. Check the model page pricing table for cache read/write rates.

What to cache

Cache content that stays stable across requests:

long system instructions
reused RAG documents
few-shot examples
tool definitions
large tool results that are reused in the next turn

Avoid caching content that changes every request, contains short one-off user input, or includes sensitive data that your policy does not allow to be stored by the selected provider.

Cache controls

AI Stats accepts Anthropic/OpenRouter-style cache controls on supported text content blocks:

{
  "cache_control": {
    "type": "ephemeral",
    "ttl": "5m"
  }
}

Use ttl: "5m" for short-lived shared context and ttl: "1h" when the provider and model support longer-lived prompt cache entries. You can also apply a default Anthropic cache policy through provider_options:

{
  "provider_options": {
    "anthropic": {
      "cache_control": {
        "type": "ephemeral",
        "ttl": "5m",
        "scope": "last_user_message"
      }
    }
  }
}

Supported scope values:

Scope	Behavior
`all_text`	Add cache control to system text and user text/image blocks that do not already have cache control.
`last_user_message`	Add cache control only to the latest user message.
`none`	Do not apply a default cache policy.

Per-block cache_control wins over the default policy.

Chat Completions

Use /v1/chat/completions when you are using OpenAI-compatible chat clients.

curl https://api.phaseo.app/v1/chat/completions \
  -H "Authorization: Bearer $AI_STATS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic/claude-sonnet-4",
    "messages": [
      {
        "role": "system",
        "content": [
          {
            "type": "text",
            "text": "You are a support assistant. Follow the company policy exactly.",
            "cache_control": { "type": "ephemeral", "ttl": "1h" }
          }
        ]
      },
      {
        "role": "user",
        "content": "Summarise the latest ticket."
      }
    ]
  }'

For OpenAI-routed requests, you can also pass OpenAI cache retention options:

{
  "provider_options": {
    "openai": {
      "prompt_cache_retention": "24h"
    }
  }
}

Responses

Use /v1/responses for new OpenAI-compatible text integrations and agent flows.

curl https://api.phaseo.app/v1/responses \
  -H "Authorization: Bearer $AI_STATS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic/claude-sonnet-4",
    "input": [
      {
        "role": "user",
        "content": [
          {
            "type": "input_text",
            "text": "Reference document: Refunds are available for 30 days when...",
            "cache_control": { "type": "ephemeral", "ttl": "5m" }
          },
          {
            "type": "input_text",
            "text": "Answer this customer: Can I return an item after 20 days?"
          }
        ]
      }
    ]
  }'

When you already have a Google Gemini cached content resource, pass it through provider_options.google.cached_content:

{
  "provider_options": {
    "google": {
      "cached_content": "cachedContents/abc123"
    }
  }
}

Anthropic Messages

Use /v1/messages when your client is Anthropic-compatible.

curl https://api.phaseo.app/v1/messages \
  -H "Authorization: Bearer $AI_STATS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic/claude-sonnet-4",
    "max_tokens": 512,
    "system": [
      {
        "type": "text",
        "text": "You are a careful support assistant. Use the policy below.",
        "cache_control": { "type": "ephemeral", "ttl": "1h" }
      }
    ],
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Policy: refunds are available for 30 days when...",
            "cache_control": { "type": "ephemeral", "ttl": "5m" }
          },
          {
            "type": "text",
            "text": "Can this customer return an item after 20 days?"
          }
        ]
      }
    ],
    "tools": [
      {
        "name": "lookup_order",
        "description": "Look up order status.",
        "input_schema": {
          "type": "object",
          "properties": {
            "order_id": { "type": "string" }
          },
          "required": ["order_id"]
        },
        "cache_control": { "type": "ephemeral", "ttl": "5m" }
      }
    ]
  }'

Anthropic Messages supports cache control on:

system text blocks
message text and image blocks
tool result blocks
tool definitions

Usage and pricing fields

When a provider returns cache usage, AI Stats normalizes it into common usage fields.

Field	Meaning
`input_tokens_details.cached_tokens`	Input tokens read from cache.
`output_tokens_details.cached_tokens`	Input tokens written to provider prompt cache.
`cached_write_text_tokens_5m`	Cache write tokens for a 5 minute TTL when the provider reports TTL-specific writes.
`cached_write_text_tokens_1h`	Cache write tokens for a 1 hour TTL when the provider reports TTL-specific writes.

Cache writes are usually more expensive than normal input tokens. Cache reads are usually cheaper. Exact pricing depends on the provider, model, and TTL.

Practical checks

After you add prompt caching:

Send one request to create or warm the cache.
Send a second request with the same cacheable content.
Check the response usage and request details for cached read/write fields.
Compare latency and cost over repeated calls, not just the first call.

​What to cache

​Cache controls

​Chat Completions

​Responses

​Anthropic Messages

​Usage and pricing fields

​Practical checks

​Related pages

What to cache

Cache controls

Chat Completions

Responses

Anthropic Messages

Usage and pricing fields

Practical checks

Related pages