Data sources
- Official releases — Provider announcements, technical blogs, and press materials.
- Academic papers — Peer-reviewed publications and arXiv preprints with reproducible results.
- Community submissions — Verified contributions from the AI Stats community and partners.
Normalisation
We standardise benchmark names, metrics, and splits to make cross-model comparisons accurate:- Map each submission to an internal benchmark ID.
- Convert reported metrics to a canonical scale (for example percentages).
- Note any deviations, such as few-shot variants or custom evaluation harnesses.
- Run automated validation to ensure fields and ranges are sensible.
Update cadence
- Weekly for popular suites such as MMLU, GSM8K, and HumanEval.
- Monthly for niche or emerging benchmarks.
- On-demand when providers or community members submit high-quality evaluations.
Transparency and reproducibility
| Practice | Outcome |
|---|---|
| Link to primary sources | Users can verify scores quickly. |
Track last_updated dates | Highlights when a benchmark may need review. |
| Document evaluation setup | Clarifies prompt format, context windows, and sampling. |
| Flag human-reviewed scores | Distinguishes manual validation from automated pipelines. |
Feedback loop
If you notice discrepancies or missing data:- Open a GitHub issue with evidence or references.
- Submit a pull request using the contribution guides.
- Reach out in Discord to coordinate larger data imports.