Inference parameters are the other half.
Core parameters
| Parameter | What it controls | Typical effect |
|---|---|---|
temperature | Randomness in token selection | Higher = more creative/variable, lower = more deterministic |
top_p | Nucleus sampling cutoff | Lower = safer/more focused outputs |
top_k | Candidate token pool size (where supported) | Lower = tighter and more predictable outputs |
max_tokens / max_output_tokens | Max generated output length | Caps cost and latency, may truncate responses |
stop | Explicit stopping sequences | Improves output framing and parser reliability |
seed | Random seed (when supported) | Improves reproducibility for debugging/testing |
Interactions to watch
temperatureandtop_pboth affect randomness. Change one at a time while tuning.- Aggressive low-bit quantization + high randomness can amplify output instability.
- Low
max_tokenscan look like model failure when it is just truncation.
Recommended defaults (starting point)
temperature:0.2to0.7depending on creativity needstop_p:0.9to1.0max_tokens: sized to your expected output class, not a huge blanket value
Tuning workflow
- Fix quantization first.
- Lock an eval set.
- Sweep one parameter at a time.
- Compare quality + latency + token usage.
- Keep separate presets per endpoint/use case.