> ## Documentation Index
> Fetch the complete documentation index at: https://docs.ai-stats.phaseo.app/llms.txt
> Use this file to discover all available pages before exploring further.

# Exploring Benchmarks

> Understand how AI model benchmarks are collected, standardised, and visualised on AI Stats.

Benchmarks are at the heart of how AI Stats measures model quality and progress.\
They provide **objective, repeatable metrics** that help you compare model performance across tasks, domains, and providers — from reasoning and coding to language understanding and image recognition.

***

## What are benchmarks?

A **benchmark** is a dataset or evaluation designed to test a model’s performance on a specific task.\
By running multiple models on the same dataset, we can measure relative strengths and weaknesses.

Common examples include:

* 🧠 **MMLU (Massive Multitask Language Understanding)** — tests general knowledge and reasoning.
* 🔢 **GSM8K** — tests mathematical reasoning and problem-solving.
* 💬 **HellaSwag** — evaluates common sense and contextual understanding.
* 🧮 **ARC-Challenge** — measures scientific reasoning and logic.
* 💡 **HumanEval** — assesses programming and code generation skills.

***

## How AI Stats handles benchmarks

AI Stats collects benchmark data from **multiple public sources** and **official reports**, then standardises them into a consistent format.

Each model’s benchmark section includes:

* **Score value** — typically expressed as a percentage or normalised scale.
* **Dataset source** — e.g. MMLU, GSM8K, etc.
* **Evaluation method** — indicates how the score was derived (official report, third-party eval, or community submission).
* **Last updated** — shows when the benchmark data was last refreshed.

***

## Example benchmark page

***

## Benchmark categories

Benchmarks are grouped into categories that represent the skill being tested:

| Category                   | Examples               | Description                                                           |
| -------------------------- | ---------------------- | --------------------------------------------------------------------- |
| **Reasoning**              | MMLU, GSM8K, ARC       | Tests general logic, multi-step reasoning, and problem-solving.       |
| **Language Understanding** | HellaSwag, BoolQ, PIQA | Measures reading comprehension, common sense, and linguistic ability. |
| **Code Generation**        | HumanEval, MBPP        | Evaluates programming and code synthesis accuracy.                    |
| **Math & Science**         | GSM8K, MATH, SciQ      | Focuses on numerical and scientific reasoning.                        |
| **Multimodal**             | MathVista, MMMU        | Combines text and visual comprehension tasks.                         |

***

## How to interpret scores

| Value Range | Interpretation                                                    |
| ----------- | ----------------------------------------------------------------- |
| **90-100%** | Exceptional — near state-of-the-art performance.                  |
| **80-89%**  | Excellent — capable for most production use cases.                |
| **60-79%**  | Good — reliable for general tasks but may struggle on edge cases. |
| **\< 60%**  | Developing — best for experimental or research scenarios.         |

Keep in mind that some benchmarks have **different scoring systems**.\
AI Stats normalises these where possible to make comparisons more meaningful.

***

## Visualising benchmarks on AI Stats

Benchmark results are visualised using:

* 📊 **Bar charts** for comparing models within the same dataset.
* 📈 **Trend charts** showing model improvements over time.
* 🧮 **Aggregate scores** to summarise performance across multiple datasets.

Each benchmark dataset on AI Stats includes filters for:

* Model type (chat, code, embedding, etc.)
* Provider
* Release year
* Dataset category

This lets you quickly identify patterns — like which families dominate reasoning or coding tasks.

***

## Data sources and methodology

AI Stats aggregates benchmark data from:

* Official research papers and model reports.
* Provider evaluation pages (OpenAI Evals, Anthropic Reports, etc.).
* Open community datasets (e.g. Hugging Face leaderboards).
* Direct submissions from users via GitHub.

All data is standardised and verified before being published.\
See [Research & Methodology → Benchmark Methodology](../research/benchmark-methodology.mdx) for full details.

***

## Example use cases

| Goal                                 | Example                                                              |
| ------------------------------------ | -------------------------------------------------------------------- |
| Compare reasoning models             | “Which model scores highest on GSM8K?”                               |
| Evaluate cost-performance trade-offs | “Does GPT-4o justify its higher cost compared to Claude 3.5 Sonnet?” |
| Track yearly progress                | “How much has MMLU accuracy improved since 2022?”                    |
| Research benchmarks by category      | “Which models lead in multimodal reasoning?”                         |

***

## Contributing benchmark data

You can submit new or corrected benchmark scores to help keep AI Stats up to date.\
Each submission is reviewed before inclusion to ensure consistency and accuracy.

<Card title="Contribute Benchmark Data" icon="github" href="../research/submitting-scores.mdx" horizontal>
  Learn how to add or edit benchmark information safely.
</Card>

***

## Next steps

Once you understand how benchmark data works, you can explore the **API providers** that make these models available programmatically.

<Card title="Explore API Providers" icon="network" href="./api-providers.mdx" horizontal>
  See where and how models are hosted, priced, and accessed.
</Card>
