This leaderboard shows performance on the Sudoku-Bench reasoning evaluation dataset.
Sudoku variants are not like vanilla Sudoku puzzles -- Each puzzle author invents new rules specific to that puzzle (scroll down to see examples). These human-crafted puzzles are designed to stump expert solvers at first glance, but allow progress once the solver finds a non-obvious "break-in." As such, Sudoku variants are the ideal (and possibly densest) source of "aha" or "euraka" moments in creative problem solving.
This makes Sudoku variants one of the most high-signal domains for benchmarking LLM reasoning models. We find that most LLMs struggle to make progress on most variant puzzles.
Sudoku-Bench is intended to evaluate LLM reasoning models without tool use. Humans can solve these puzzles by finding a creative insight that leads to a "break-in." This means that humans can solve these in a "token-efficient" manner, but only after discovering some creative insight about the particular puzzle. We thus test LLMs to do the same and in all evaluations below we disable tool use.
Please see our technical report for an introduction to Sudoku variants and their utility in AI reasoning research.
| Model | Multi-Step | Single-Shot | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 4x4 | 6x6 | 9x9 | All Puzzles | 4x4 | 6x6 | 9x9 | All Puzzles | ||||
| ASR | ACP | ASR | ACP | ASR | ACP | Avg Solve Rate | ASR | ASR | ASR | Avg Solve Rate | |
(Note: A '-' indicates insufficient data to meet reporting thresholds due to cost limitations.)
Models are evaluated using one of two configurations:
The evaluation measures performance based on two primary metrics:
The benchmark includes 100 puzzles of different grid sizes (15 4x4, 15 6x6, 70 9x9).
Is there a private test set?
Due to the significant investment human setters make in crafting each puzzle, we do not have a private test set. The current Sudoku-Bench represents a curated snapshot of existing puzzles created before 2025. We may consider a private test set in the future.
Dozens of high-quality Sudoku variants are published daily on the web so any puzzle after an LLM's training cutoff date can be considered out-of-domain if one wishes to test on them.
How is Sudoku-Bench different from other LLM reasoning benchmarks?
At its core, Sudoku-Bench is meant to measure the "aha" or "eureka" moment from creative problem-solving. Sudoku variants may be the densest source of eureka moments available, since puzzle authors explicitly design their puzzles to be seemingly unbreakable at first glance -- even to expert solvers who have years of experience -- but admit progress after a creative discovery is made.
Such insights are difficult to find in standard benchmarks, since a domain can be readily mastered through enough training data. But Sudoku variants are constantly evolving. Each Sudoku variant puzzle is unique -- either through a unique ruleset or by requiring a solving tactic never seen before -- making the domain of Sudoku variants more resistant to memorization compared to other benchmarks.
The memorization-resistance is similar in spirit to the ARC-AGI benchmark. A difference between ARC-AGI and Sudoku-Bench is worth noting: ARC-AGI puzzles ask the solver to discover the puzzle constraints by presenting a few examples, after which execution is often straightforward. In a Sudoku variant, all constraints are explicitly given as part of the puzzle, but a direct application of each rule in isolation typically yields no progress -- a creative process is required to see how the constraints interact in non-obvious ways to create intermediate results, and ultimately a break-in to the solution. This makes Sudoku variants more similar to ARC-AGI-2 which emphasizes compositional reasoning.
Is Sudoku-Bench only for LLMs?
Mostly, yes.
Puzzles from ARC-AGI or standard (non-variant) Sudoku puzzles have a universal tokenized representation that applies to all training and test samples. However, Sudoku variants are so varied that we need natural language to encode and represent each puzzle. Sudoku-Bench in its current form is best suited for benchmarking LLM reasoning models.
Sudoku variants are not suitable for certain non-LLM reasoning models (e.g., HRM or TRM) without an explicit LLM module of some kind.
We present text-only representations of all puzzles in Sudoku-Bench. However, the dataset is naturally applicable for VLMs as well, which may be required for certain puzzles whose visual elements are too complex to fit into text. We explicitly selected the 100 puzzles of Sudoku-Bench to be ones that admit a text-only representation.
Explore individual puzzles from the challenge_100 subset of the Sudoku-Bench dataset. Below we provide each puzzle and an example prompt for ease of use.
The full benchmark data is available on Hugging Face; see the Sudoku-Bench GitHub repo for an entry point and instructions on interacting with the full dataset.
Loading puzzles...
For details on the evaluation methodology, data, and code, please refer to the Sudoku-Bench GitHub repository. Please also see our technical report.
Also visit:
For attribution in academic contexts, please cite the technical report
Seely, J., Imajuku, Y., Zhao, T., Cetin, E., & Jones, L. (2025). Sudoku-Bench: Evaluating creative reasoning with Sudoku variants. arXiv preprint arXiv:2505.16135.
BibTeX citation
@misc{seely2025sudoku,
title={Sudoku-Bench: Evaluating creative reasoning with Sudoku variants},
author={Jeffrey Seely and Yuki Imajuku and Tianyu Zhao and Edoardo Cetin and Llion Jones},
year={2025},
eprint={2505.16135},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2505.16135},
}
We release our code for this project here.