Skip to main content
Quantization is one half of model behavior.
Inference parameters are the other half.

Core parameters

ParameterWhat it controlsTypical effect
temperatureRandomness in token selectionHigher = more creative/variable, lower = more deterministic
top_pNucleus sampling cutoffLower = safer/more focused outputs
top_kCandidate token pool size (where supported)Lower = tighter and more predictable outputs
max_tokens / max_output_tokensMax generated output lengthCaps cost and latency, may truncate responses
stopExplicit stopping sequencesImproves output framing and parser reliability
seedRandom seed (when supported)Improves reproducibility for debugging/testing

Interactions to watch

  • temperature and top_p both affect randomness. Change one at a time while tuning.
  • Aggressive low-bit quantization + high randomness can amplify output instability.
  • Low max_tokens can look like model failure when it is just truncation.
  • temperature: 0.2 to 0.7 depending on creativity needs
  • top_p: 0.9 to 1.0
  • max_tokens: sized to your expected output class, not a huge blanket value

Tuning workflow

  1. Fix quantization first.
  2. Lock an eval set.
  3. Sweep one parameter at a time.
  4. Compare quality + latency + token usage.
  5. Keep separate presets per endpoint/use case.
Last modified on February 18, 2026