Use this workflow to choose quantization variants safely.Documentation Index
Fetch the complete documentation index at: https://docs.ai-stats.phaseo.app/llms.txt
Use this file to discover all available pages before exploring further.
Step 1: define your success criteria
Before choosing a quant:- Set quality thresholds (task accuracy, pass rates, structured output validity).
- Set latency and cost targets.
- Define unacceptable failure modes.
Step 2: pick a high-quality baseline
Start with a stronger baseline (oftenBF16 or FP16) and measure:
- Output quality
- Median and tail latency
- Token/cost profile
Step 3: step down gradually
Evaluate progressively lower precision:FP8orINT8FP4orINT4only if needed for cost/throughput
Step 4: evaluate real workloads
Use production-like prompts:- Long prompts
- Multi-step reasoning
- Tool calling
- Edge cases from support tickets
Step 5: decide per endpoint/use case
One quantization does not need to fit everything. Examples:- Customer support and critical decisioning: higher precision.
- Large-volume classification/summarization: lower precision can be acceptable.
Suggested starting points
| Workload type | Good starting quant |
|---|---|
| Quality-critical reasoning | BF16 / FP16 |
| Balanced cost + quality | FP8 / INT8 |
| High-throughput, cost-sensitive | INT4 / FP4 (after validation) |
Deployment policy
- Keep a fallback model/quantization configured.
- Monitor quality drift after rollout.
- Re-test when provider runtime or model revision changes.