Skip to main content
This guide explains common precision formats you will see in model variants and inference runtimes.

Quick reference

FormatTypical usageMain benefitMain risk
FP32Baseline/full precisionMaximum numerical stabilityHighest memory and compute cost
BF16Training + high-quality inferenceStrong quality with better efficiency vs FP32Still heavier than 8-bit/4-bit
FP16Common inference/training pathsGood speed and memory reductionLower dynamic range than BF16
FP8Throughput-focused inference/trainingLarge speed/memory gainsNoticeable quality regressions on difficult prompts
FP4Aggressive efficiency modesVery high density and low costHigher output degradation risk
INT8Widely supported inference quantizationStrong efficiency with moderate quality impactCalibration/sensitivity issues for some models
INT4Cost-sensitive deploymentVery low memory and high throughputBigger quality and robustness tradeoffs

BF16 vs FP16

  • BF16 keeps a larger exponent range, which helps stability.
  • FP16 is still common and often fast, but can be less stable for some workloads.
  • For quality-critical production paths, BF16 is often the safer default when available.

FP8 and FP4

  • FP8 usually gives a good efficiency jump while keeping usable quality for many tasks.
  • FP4 pushes efficiency further, but degradation is more likely (especially nuanced reasoning, long contexts, and strict structured outputs).

Integer formats (INT8 / INT4)

  • Usually optimized for inference throughput and memory.
  • Quality depends heavily on quantization method, calibration, and model architecture.
  • INT8 is generally easier to deploy safely than INT4.

Reading model labels

You may see labels like:
  • bf16, fp16, fp8, int8, q4, 4bit, 8bit
Naming is not fully standardized across providers. Always verify exact variant details in the model/provider documentation.
Last modified on February 18, 2026