Skip to main content
Benchmarks are at the heart of how AI Stats measures model quality and progress.
They provide objective, repeatable metrics that help you compare model performance across tasks, domains, and providers โ€” from reasoning and coding to language understanding and image recognition.

What are benchmarks?

A benchmark is a dataset or evaluation designed to test a modelโ€™s performance on a specific task.
By running multiple models on the same dataset, we can measure relative strengths and weaknesses.
Common examples include:
  • ๐Ÿง  MMLU (Massive Multitask Language Understanding) โ€” tests general knowledge and reasoning.
  • ๐Ÿ”ข GSM8K โ€” tests mathematical reasoning and problem-solving.
  • ๐Ÿ’ฌ HellaSwag โ€” evaluates common sense and contextual understanding.
  • ๐Ÿงฎ ARC-Challenge โ€” measures scientific reasoning and logic.
  • ๐Ÿ’ก HumanEval โ€” assesses programming and code generation skills.

How AI Stats handles benchmarks

AI Stats collects benchmark data from multiple public sources and official reports, then standardises them into a consistent format. Each modelโ€™s benchmark section includes:
  • Score value โ€” typically expressed as a percentage or normalised scale.
  • Dataset source โ€” e.g. MMLU, GSM8K, etc.
  • Evaluation method โ€” indicates how the score was derived (official report, third-party eval, or community submission).
  • Last updated โ€” shows when the benchmark data was last refreshed.

Example benchmark page


Benchmark categories

Benchmarks are grouped into categories that represent the skill being tested:
CategoryExamplesDescription
ReasoningMMLU, GSM8K, ARCTests general logic, multi-step reasoning, and problem-solving.
Language UnderstandingHellaSwag, BoolQ, PIQAMeasures reading comprehension, common sense, and linguistic ability.
Code GenerationHumanEval, MBPPEvaluates programming and code synthesis accuracy.
Math & ScienceGSM8K, MATH, SciQFocuses on numerical and scientific reasoning.
MultimodalMathVista, MMMUCombines text and visual comprehension tasks.

How to interpret scores

Value RangeInterpretation
90-100%Exceptional โ€” near state-of-the-art performance.
80-89%Excellent โ€” capable for most production use cases.
60-79%Good โ€” reliable for general tasks but may struggle on edge cases.
< 60%Developing โ€” best for experimental or research scenarios.
Keep in mind that some benchmarks have different scoring systems.
AI Stats normalises these where possible to make comparisons more meaningful.

Visualising benchmarks on AI Stats

Benchmark results are visualised using:
  • ๐Ÿ“Š Bar charts for comparing models within the same dataset.
  • ๐Ÿ“ˆ Trend charts showing model improvements over time.
  • ๐Ÿงฎ Aggregate scores to summarise performance across multiple datasets.
Each benchmark dataset on AI Stats includes filters for:
  • Model type (chat, code, embedding, etc.)
  • Provider
  • Release year
  • Dataset category
This lets you quickly identify patterns โ€” like which families dominate reasoning or coding tasks.

Data sources and methodology

AI Stats aggregates benchmark data from:
  • Official research papers and model reports.
  • Provider evaluation pages (OpenAI Evals, Anthropic Reports, etc.).
  • Open community datasets (e.g. Hugging Face leaderboards).
  • Direct submissions from users via GitHub.
All data is standardised and verified before being published.
See Research & Methodology โ†’ Benchmark Methodology for full details.

Example use cases

GoalExample
Compare reasoning modelsโ€œWhich model scores highest on GSM8K?โ€
Evaluate cost-performance trade-offsโ€œDoes GPT-4o justify its higher cost compared to Claude 3.5 Sonnet?โ€
Track yearly progressโ€œHow much has MMLU accuracy improved since 2022?โ€
Research benchmarks by categoryโ€œWhich models lead in multimodal reasoning?โ€

Contributing benchmark data

You can submit new or corrected benchmark scores to help keep AI Stats up to date.
Each submission is reviewed before inclusion to ensure consistency and accuracy.

Contribute Benchmark Data

Learn how to add or edit benchmark information safely.

Next steps

Once you understand how benchmark data works, you can explore the API providers that make these models available programmatically.

Explore API Providers

See where and how models are hosted, priced, and accessed.