1. Prepare your data
- Confirm the benchmark exists in
/data/benchmarks. If not, create an issue to propose it. - Collect the raw results, including prompt format, evaluation harness, and hardware details.
- Provide a public citation (blog post, paper, or dataset repository) when available.
✅ Tip Sharing a reproducible script or notebook speeds up verification significantly.
2. Fork the repository
- Fork github.com/ai-stats/data.
- Create a new branch (for example
add-claude-3.5-sonnet-mmlu). - Update the relevant benchmark JSON with your scores and metadata.
3. Run validation
Use the repository tooling to ensure structure and ranges are correct:4. Open a pull request
Include the following in your PR description:- Benchmark name and split (for example
GSM8K test). - Model ID and provider.
- Links to citations or evaluation logs.
- Summary of how the score was produced.