This leaderboard shows performance on the Sudoku-Bench reasoning evaluation dataset.
Sudoku variants are unique and creative puzzles that can test whether a reasoning model can think like a human — using meta-reasoning and creativity to find logical break-ins without relying on brute force search. As such, Sudoku-Bench is designed to evaluate models without tool use or code execution. Consequently, we omit models such as OpenAI’s o3 and o4-mini and Claude Opus 4 from the present leaderboard. But you are welcome to try yourself! Please see Example Prompts for Each Puzzle at the bottom of this page.
Please see our technical report for and introduction to Sudoku variants and their utility in AI reasoning research.
Models are evaluated using one of two configurations:
The evaluation measures performance based on two primary metrics:
The benchmark includes 100 puzzles of different grid sizes (15 4x4, 15 6x6, 70 9x9).
Model | Multi-Step | Single-Shot | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
4x4 | 6x6 | 9x9 | All Puzzles | 4x4 | 6x6 | 9x9 | All Puzzles | ||||
ASR | ACP | ASR | ACP | ASR | ACP | Avg Solve Rate | ASR | ASR | ASR | Avg Solve Rate |
(Note: A ‘-’ indicates insufficient data for reporting thresholds due to cost limitations.)
For a more granular view, the following table details the performance of selected top models on each puzzle. Each cell shows the outcome for that model and puzzle.
Click the emoji to see the model’s response.
Puzzle |
---|
Explore individual puzzles from the challenge_100
subset of the Sudoku-Bench dataset.
Loading puzzles...
For details on the evaluation methodology, data, and code, please refer to the Sudoku-Bench GitHub repository. Please also see our technical report.
Also visit
For attribution in academic contexts, please cite the technical report
Seely, J., Imajuku, Y., Zhao, T., Cetin, E., & Jones, L. (2025). Sudoku-Bench: Evaluating creative reasoning with Sudoku variants. arXiv preprint arXiv:2505.16135.
BibTeX citation
@misc{seely2025sudoku, title={Sudoku-Bench: Evaluating creative reasoning with Sudoku variants}, author={Jeffrey Seely and Yuki Imajuku and Tianyu Zhao and Edoardo Cetin and Llion Jones}, year={2025}, eprint={2505.16135}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2505.16135}, }
We release our code for this project here.