Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.ai-stats.phaseo.app/llms.txt

Use this file to discover all available pages before exploring further.

Quantization is one half of model behavior.
Inference parameters are the other half.

Core parameters

ParameterWhat it controlsTypical effect
temperatureRandomness in token selectionHigher = more creative/variable, lower = more deterministic
top_pNucleus sampling cutoffLower = safer/more focused outputs
top_kCandidate token pool size (where supported)Lower = tighter and more predictable outputs
max_tokens / max_output_tokensMax generated output lengthCaps cost and latency, may truncate responses
stopExplicit stopping sequencesImproves output framing and parser reliability
seedRandom seed (when supported)Improves reproducibility for debugging/testing

Interactions to watch

  • temperature and top_p both affect randomness. Change one at a time while tuning.
  • Aggressive low-bit quantization + high randomness can amplify output instability.
  • Low max_tokens can look like model failure when it is just truncation.
  • temperature: 0.2 to 0.7 depending on creativity needs
  • top_p: 0.9 to 1.0
  • max_tokens: sized to your expected output class, not a huge blanket value

Tuning workflow

  1. Fix quantization first.
  2. Lock an eval set.
  3. Sweep one parameter at a time.
  4. Compare quality + latency + token usage.
  5. Keep separate presets per endpoint/use case.
Last modified on February 18, 2026