Architecture
- Isolate provider logic. Wrap Gateway calls in a dedicated service or SDK so changes stay confined.
- Stream responses when possible. Reduce latency for assistants by streaming tokens to the UI as they arrive.
- Batch non-critical jobs. Queue background generations or evaluations to smooth out traffic spikes.
Security
- Store API keys in secret managers (AWS Secrets Manager, Doppler, 1Password) instead of
.envfiles in production. - Enable request signing on your own APIs so client applications never expose Gateway keys.
- Log only truncated keys (for example
sk_prod_abcd…) to avoid leaking credentials.
Reliability
- Implement exponential backoff with jitter for all retryable errors (HTTP 429 or 5xx).
- Respect the
Retry-Afterheader before retrying rate-limited requests. - Monitor provider metadata to detect automatic failovers from one vendor to another.
Observability
| Signal | Recommended action |
|---|---|
| Latency | Track p95 and p99 timings to catch regressions early. |
| Token usage | Correlate costs with business metrics to evaluate ROI. |
| Error rates | Alert when failures exceed 1% for a sustained period. |
| Provider mix | Ensure traffic is distributed as expected across vendors. |
Collaboration
- Document prompt templates, input parameters, and output handling in your internal wiki.
- Share dashboards that highlight benchmark movements and how they affect your product.
- Encourage regular reviews with product, research, and support teams to keep behaviour aligned.
Further reading
- Learn how to Handle errors gracefully.
- Explore Rate limits to plan for scaling.
- Check our Examples for reference implementations.