Sudoku-Bench Leaderboard

Evaluating creative reasoning with Sudoku variants

This leaderboard shows performance on the Sudoku-Bench reasoning evaluation dataset.

Sudoku variants are unique and creative puzzles that can test whether a reasoning model can think like a human — using meta-reasoning and creativity to find logical break-ins without relying on brute force search. As such, Sudoku-Bench is designed to evaluate models without tool use or code execution. Consequently, we omit models such as OpenAI’s o3 and o4-mini and Claude Opus 4 from the present leaderboard. But you are welcome to try yourself! Please see Example Prompts for Each Puzzle at the bottom of this page.

Please see our technical report for and introduction to Sudoku variants and their utility in AI reasoning research.

Sudoku-Bench Leaderboard:

Models are evaluated using one of two configurations:

The evaluation measures performance based on two primary metrics:

The benchmark includes 100 puzzles of different grid sizes (15 4x4, 15 6x6, 70 9x9).

Model Multi-Step Single-Shot
4x4 6x6 9x9 All Puzzles 4x4 6x6 9x9 All Puzzles
ASRACP ASRACP ASRACP Avg Solve Rate ASR ASR ASR Avg Solve Rate

(Note: A ‘-’ indicates insufficient data for reporting thresholds due to cost limitations.)


Results by Puzzle

For a more granular view, the following table details the performance of selected top models on each puzzle. Each cell shows the outcome for that model and puzzle.

Click the emoji to see the model’s response.

Legend:
Puzzle

Example Prompts for Each Puzzle

Explore individual puzzles from the challenge_100 subset of the Sudoku-Bench dataset.

Loading puzzles...


References

For details on the evaluation methodology, data, and code, please refer to the Sudoku-Bench GitHub repository. Please also see our technical report.

Also visit

Acknowledgements

Citation

For attribution in academic contexts, please cite the technical report

Seely, J., Imajuku, Y., Zhao, T., Cetin, E., & Jones, L. (2025). Sudoku-Bench: Evaluating creative reasoning with Sudoku variants. arXiv preprint arXiv:2505.16135.

BibTeX citation

@misc{seely2025sudoku,
      title={Sudoku-Bench: Evaluating creative reasoning with Sudoku variants}, 
      author={Jeffrey Seely and Yuki Imajuku and Tianyu Zhao and Edoardo Cetin and Llion Jones},
      year={2025},
      eprint={2505.16135},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.16135}, 
}

Open Source Code

We release our code for this project here.