Exploring Benchmarks

Benchmarks are at the heart of how AI Stats measures model quality and progress.
They provide objective, repeatable metrics that help you compare model performance across tasks, domains, and providers — from reasoning and coding to language understanding and image recognition.

What are benchmarks?

A benchmark is a dataset or evaluation designed to test a model’s performance on a specific task.
By running multiple models on the same dataset, we can measure relative strengths and weaknesses. Common examples include:

🧠 MMLU (Massive Multitask Language Understanding) — tests general knowledge and reasoning.
🔢 GSM8K — tests mathematical reasoning and problem-solving.
💬 HellaSwag — evaluates common sense and contextual understanding.
🧮 ARC-Challenge — measures scientific reasoning and logic.
💡 HumanEval — assesses programming and code generation skills.

How AI Stats handles benchmarks

AI Stats collects benchmark data from multiple public sources and official reports, then standardises them into a consistent format. Each model’s benchmark section includes:

Score value — typically expressed as a percentage or normalised scale.
Dataset source — e.g. MMLU, GSM8K, etc.
Evaluation method — indicates how the score was derived (official report, third-party eval, or community submission).
Last updated — shows when the benchmark data was last refreshed.

Example benchmark page

Benchmark categories

Benchmarks are grouped into categories that represent the skill being tested:

Category	Examples	Description
Reasoning	MMLU, GSM8K, ARC	Tests general logic, multi-step reasoning, and problem-solving.
Language Understanding	HellaSwag, BoolQ, PIQA	Measures reading comprehension, common sense, and linguistic ability.
Code Generation	HumanEval, MBPP	Evaluates programming and code synthesis accuracy.
Math & Science	GSM8K, MATH, SciQ	Focuses on numerical and scientific reasoning.
Multimodal	MathVista, MMMU	Combines text and visual comprehension tasks.

How to interpret scores

Value Range	Interpretation
90-100%	Exceptional — near state-of-the-art performance.
80-89%	Excellent — capable for most production use cases.
60-79%	Good — reliable for general tasks but may struggle on edge cases.
< 60%	Developing — best for experimental or research scenarios.

Keep in mind that some benchmarks have different scoring systems.
AI Stats normalises these where possible to make comparisons more meaningful.

Visualising benchmarks on AI Stats

Benchmark results are visualised using:

📊 Bar charts for comparing models within the same dataset.
📈 Trend charts showing model improvements over time.
🧮 Aggregate scores to summarise performance across multiple datasets.

Each benchmark dataset on AI Stats includes filters for:

Model type (chat, code, embedding, etc.)
Provider
Release year
Dataset category

This lets you quickly identify patterns — like which families dominate reasoning or coding tasks.

Data sources and methodology

AI Stats aggregates benchmark data from:

Official research papers and model reports.
Provider evaluation pages (OpenAI Evals, Anthropic Reports, etc.).
Open community datasets (e.g. Hugging Face leaderboards).
Direct submissions from users via GitHub.

All data is standardised and verified before being published.
See Research & Methodology → Benchmark Methodology for full details.

Example use cases

Goal	Example
Compare reasoning models	“Which model scores highest on GSM8K?”
Evaluate cost-performance trade-offs	“Does GPT-4o justify its higher cost compared to Claude 3.5 Sonnet?”
Track yearly progress	“How much has MMLU accuracy improved since 2022?”
Research benchmarks by category	“Which models lead in multimodal reasoning?”

Contributing benchmark data

You can submit new or corrected benchmark scores to help keep AI Stats up to date.
Each submission is reviewed before inclusion to ensure consistency and accuracy.

Contribute Benchmark Data

Learn how to add or edit benchmark information safely.

Next steps

Once you understand how benchmark data works, you can explore the API providers that make these models available programmatically.

Explore API Providers

See where and how models are hosted, priced, and accessed.

Quickstart

Overview

Features

Integrations

Operations

OAuth (Alpha)

Platform & Data

Migration Guides

Community

What are benchmarks?

How AI Stats handles benchmarks

Example benchmark page

Benchmark categories

How to interpret scores

Visualising benchmarks on AI Stats

Data sources and methodology

Example use cases

Contributing benchmark data

Contribute Benchmark Data

Next steps

Explore API Providers

Quickstart

Overview

Features

Integrations

Operations

OAuth (Alpha)

Platform & Data

Migration Guides

Community

​What are benchmarks?

​How AI Stats handles benchmarks

​Example benchmark page

​Benchmark categories

​How to interpret scores

​Visualising benchmarks on AI Stats

​Data sources and methodology

​Example use cases

​Contributing benchmark data

Contribute Benchmark Data

​Next steps

Explore API Providers

What are benchmarks?

How AI Stats handles benchmarks

Example benchmark page

Benchmark categories

How to interpret scores

Visualising benchmarks on AI Stats

Data sources and methodology

Example use cases

Contributing benchmark data

Next steps