Use response caching with presets

Use this recipe when you have repeated requests with stable prompts and want lower latency plus lower repeated inference cost.

1. Start with a preset, not ad-hoc requests

Create a preset when the following should stay stable:

This keeps the cache key stable because the request stays stable.

Configure:

Good defaults:

Response caching is strongest when you avoid unnecessary request drift. Avoid changing:

If one caller changes these constantly, move that caller to a different preset instead of defeating cache reuse for everyone else.

The request details should show cache information for requests that pass through the cache path. Verify:

Recommended patterns:

use one preset for deterministic structured outputs with caching on
use a second preset for exploratory or higher-temperature requests with caching off

This keeps the cache high-signal instead of mixing incompatible traffic.

If you expect hits but keep getting misses, compare:

Widening TTL will not fix a fingerprint mismatch.

Do not enable response caching by default for:

Last modified on May 19, 2026