They provide objective, repeatable metrics that help you compare model performance across tasks, domains, and providers โ from reasoning and coding to language understanding and image recognition.
What are benchmarks?
A benchmark is a dataset or evaluation designed to test a modelโs performance on a specific task.By running multiple models on the same dataset, we can measure relative strengths and weaknesses. Common examples include:
- ๐ง MMLU (Massive Multitask Language Understanding) โ tests general knowledge and reasoning.
- ๐ข GSM8K โ tests mathematical reasoning and problem-solving.
- ๐ฌ HellaSwag โ evaluates common sense and contextual understanding.
- ๐งฎ ARC-Challenge โ measures scientific reasoning and logic.
- ๐ก HumanEval โ assesses programming and code generation skills.
How AI Stats handles benchmarks
AI Stats collects benchmark data from multiple public sources and official reports, then standardises them into a consistent format. Each modelโs benchmark section includes:- Score value โ typically expressed as a percentage or normalised scale.
- Dataset source โ e.g. MMLU, GSM8K, etc.
- Evaluation method โ indicates how the score was derived (official report, third-party eval, or community submission).
- Last updated โ shows when the benchmark data was last refreshed.
Example benchmark page
Benchmark categories
Benchmarks are grouped into categories that represent the skill being tested:| Category | Examples | Description |
|---|---|---|
| Reasoning | MMLU, GSM8K, ARC | Tests general logic, multi-step reasoning, and problem-solving. |
| Language Understanding | HellaSwag, BoolQ, PIQA | Measures reading comprehension, common sense, and linguistic ability. |
| Code Generation | HumanEval, MBPP | Evaluates programming and code synthesis accuracy. |
| Math & Science | GSM8K, MATH, SciQ | Focuses on numerical and scientific reasoning. |
| Multimodal | MathVista, MMMU | Combines text and visual comprehension tasks. |
How to interpret scores
| Value Range | Interpretation |
|---|---|
| 90-100% | Exceptional โ near state-of-the-art performance. |
| 80-89% | Excellent โ capable for most production use cases. |
| 60-79% | Good โ reliable for general tasks but may struggle on edge cases. |
| < 60% | Developing โ best for experimental or research scenarios. |
AI Stats normalises these where possible to make comparisons more meaningful.
Visualising benchmarks on AI Stats
Benchmark results are visualised using:- ๐ Bar charts for comparing models within the same dataset.
- ๐ Trend charts showing model improvements over time.
- ๐งฎ Aggregate scores to summarise performance across multiple datasets.
- Model type (chat, code, embedding, etc.)
- Provider
- Release year
- Dataset category
Data sources and methodology
AI Stats aggregates benchmark data from:- Official research papers and model reports.
- Provider evaluation pages (OpenAI Evals, Anthropic Reports, etc.).
- Open community datasets (e.g. Hugging Face leaderboards).
- Direct submissions from users via GitHub.
See Research & Methodology โ Benchmark Methodology for full details.
Example use cases
| Goal | Example |
|---|---|
| Compare reasoning models | โWhich model scores highest on GSM8K?โ |
| Evaluate cost-performance trade-offs | โDoes GPT-4o justify its higher cost compared to Claude 3.5 Sonnet?โ |
| Track yearly progress | โHow much has MMLU accuracy improved since 2022?โ |
| Research benchmarks by category | โWhich models lead in multimodal reasoning?โ |
Contributing benchmark data
You can submit new or corrected benchmark scores to help keep AI Stats up to date.Each submission is reviewed before inclusion to ensure consistency and accuracy.
Contribute Benchmark Data
Learn how to add or edit benchmark information safely.
Next steps
Once you understand how benchmark data works, you can explore the API providers that make these models available programmatically.Explore API Providers
See where and how models are hosted, priced, and accessed.