Benchmark Methodology

Benchmarks provide the foundation for comparisons across models and providers. This page explains how data enters the platform and how we maintain consistency.

Data sources

Official releases — Provider announcements, technical blogs, and press materials.
Academic papers — Peer-reviewed publications and arXiv preprints with reproducible results.
Community submissions — Verified contributions from the AI Stats community and partners.

Every score must include a public citation or reproducible evaluation before it is accepted.

Normalisation

We standardise benchmark names, metrics, and splits to make cross-model comparisons accurate:

Map each submission to an internal benchmark ID.
Convert reported metrics to a canonical scale (for example percentages).
Note any deviations, such as few-shot variants or custom evaluation harnesses.
Run automated validation to ensure fields and ranges are sensible.

Update cadence

Weekly for popular suites such as MMLU, GSM8K, and HumanEval.
Monthly for niche or emerging benchmarks.
On-demand when providers or community members submit high-quality evaluations.

Large updates trigger changelog entries so you can audit historical movements.

Transparency and reproducibility

Practice	Outcome
Link to primary sources	Users can verify scores quickly.
Track `last_updated` dates	Highlights when a benchmark may need review.
Document evaluation setup	Clarifies prompt format, context windows, and sampling.
Flag human-reviewed scores	Distinguishes manual validation from automated pipelines.

Feedback loop

If you notice discrepancies or missing data:

Open a GitHub issue with evidence or references.
Submit a pull request using the contribution guides.
Reach out in Discord to coordinate larger data imports.

We review every report and provide status updates as the benchmark is verified.

Quickstart

Overview

Features

Integrations

Operations

OAuth (Alpha)

Platform & Data

Migration Guides

Community

Data sources

Normalisation

Update cadence

Transparency and reproducibility

Feedback loop

Quickstart

Overview

Features

Integrations

Operations

OAuth (Alpha)

Platform & Data

Migration Guides

Community

​Data sources

​Normalisation

​Update cadence

​Transparency and reproducibility

​Feedback loop

Data sources

Normalisation

Update cadence

Transparency and reproducibility

Feedback loop