Skip to main content
Benchmarks provide the foundation for comparisons across models and providers. This page explains how data enters the platform and how we maintain consistency.

Data sources

  • Official releases — Provider announcements, technical blogs, and press materials.
  • Academic papers — Peer-reviewed publications and arXiv preprints with reproducible results.
  • Community submissions — Verified contributions from the AI Stats community and partners.
Every score must include a public citation or reproducible evaluation before it is accepted.

Normalisation

We standardise benchmark names, metrics, and splits to make cross-model comparisons accurate:
  1. Map each submission to an internal benchmark ID.
  2. Convert reported metrics to a canonical scale (for example percentages).
  3. Note any deviations, such as few-shot variants or custom evaluation harnesses.
  4. Run automated validation to ensure fields and ranges are sensible.

Update cadence

  • Weekly for popular suites such as MMLU, GSM8K, and HumanEval.
  • Monthly for niche or emerging benchmarks.
  • On-demand when providers or community members submit high-quality evaluations.
Large updates trigger changelog entries so you can audit historical movements.

Transparency and reproducibility

PracticeOutcome
Link to primary sourcesUsers can verify scores quickly.
Track last_updated datesHighlights when a benchmark may need review.
Document evaluation setupClarifies prompt format, context windows, and sampling.
Flag human-reviewed scoresDistinguishes manual validation from automated pipelines.

Feedback loop

If you notice discrepancies or missing data:
  1. Open a GitHub issue with evidence or references.
  2. Submit a pull request using the contribution guides.
  3. Reach out in Discord to coordinate larger data imports.
We review every report and provide status updates as the benchmark is verified.
Last modified on February 11, 2026