This page is the entry point for a full guide series on quantization and inference parameters. If you are deciding model variants (for exampleDocumentation Index
Fetch the complete documentation index at: https://docs.ai-stats.phaseo.app/llms.txt
Use this file to discover all available pages before exploring further.
BF16 vs FP8) or tuning request settings (temperature, top_p, token budgets), start here and work through the linked guides.
What is quantization?
Quantization is how model weights and activations are represented with lower precision numbers to reduce memory usage and speed up inference. In general:- Higher precision: better quality/stability, higher cost.
- Lower precision: better throughput/cost, higher quality risk.
Guide series
- Quantization Formats
- Quantization Methods
- Choosing a Quantization Strategy
- Inference Parameters
- Sampling and Decoding Parameters
- Context and Token Budgeting
Recommended order
- Read formats and methods first.
- Use the selection guide to pick an initial variant.
- Tune inference parameters on top of that baseline.
- Finalize token budgets with production prompt traces.