In May 2025, Sakana AI released the Sudoku-Bench, a collection of handcrafted sudoku puzzles that were meant to test the reasoning capabilities of LLMs, presenting itself as a grand challenge to AI reasoning. At the release of the Sudoku-Bench, reasoning models such as ChatGPT-o3 were not able to solve any of the classic 9x9 problems.
Since then, a new generation of models have been released, boasting impressive results on domains such as math, coding, and agentic tasks. But what about the Sudoku-Bench?
In this post, we are proud to present:
challenge_100, GPT-5 is also the first model that is able to solve a 9x9 modern Sudoku problem, showcasing its strong capabilities in spatial and logical reasoning.Sudoku, the beloved logic puzzle that was popularized in Japan and exploded in global popularity through the 1980s and 2000s, presents a deceptively simple challenge: fill a 9×9 grid so that each row, column, and 3×3 box contains all digits 1-9. Together with the popular classic puzzles, the Sudoku-Bench includes "Modern Sudokus", complex variants with unique constraints that can involve everything from following colored pathways to understanding abstract scenarios like guiding rats through teleporter mazes. These modern variants present an extraordinary challenge for AI reasoning systems because, unlike games with fixed rules like Chess or Go, each puzzle requires meta-reasoning to first understand entirely new rulesets before attempting to solve them. While current AI models can often comprehend these novel rules and make progress through locally consistent steps, they frequently fail at maintaining global consistency over long reasoning chains, especially when encountering the creative "break-in points" that human experts use to elegantly unlock solutions. This is precisely why Sakana AI developed Sudoku-Bench, a carefully curated benchmark ranging from simple puzzles current models can solve to impossibly complex variants that push the absolute boundaries of AI reasoning capabilities, featuring hand-crafted puzzles from Nikoli and thousands of hours of expert human reasoning data from Cracking The Cryptic.
Here are the relevant releases:
| Model | Multi-Step | Single-Shot | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 4x4 | 6x6 | 9x9 | All Puzzles | 4x4 | 6x6 | 9x9 | All Puzzles | ||||
| ASR | ACP | ASR | ACP | ASR | ACP | Avg Solve Rate | ASR | ASR | ASR | Avg Solve Rate | |
(Note: A '-' indicates insufficient data to meet reporting thresholds due to cost limitations.)
Models are evaluated using one of two configurations:
The evaluation measures performance based on two primary metrics:
The benchmark includes 100 puzzles of different grid sizes (15 4x4, 15 6x6, 70 9x9).
As of this post, GPT-5 overtakes the previous lead on the Sudoku-Bench to be the leader, boasting impressive solve rates on both the multi-step and single-shot setting with solve rates of 33% with high reasoning effort! This presents a two-times lead over the previous leader ChatGPT-o3-mini. Most impressively, GPT-5 is the first LLM to solve a 9x9 modern sudoku problem, Theta. Despite its impressive results, there are still shortcomings of the models. In the later sections, we present examples of outputs from GPT-5 and other models we experimented with, and share why the Sudoku-Bench is still challenging for current models to solve. Note that while querying GPT-5 via the API doesn’t reveal its reasoning traces, the models can be prompted to provide a summary of its insights. The examples presented below are, therefore, are GPT-5's summary of the reasoning traces.
How do recent advances in open-source models perform on the benchmark? This year, GRPO has gained significant attention for its efficiency and stellar results on math benchmarks, with DeepseekMath-7b[1] achieving superior performance over models 10 times larger. Given these promising results in mathematical reasoning, we applied GRPO to fine-tune Qwen2.5-7b-Instruct on vanilla Sudoku problems in both single-shot and multi-turn settings. However, unlike the impressive gains seen in traditional math problems, we found the results on our Sudoku-Bench to be lackluster, suggesting that the reasoning skills that transfer well between mathematical domains may not readily extend to the spatial and logical reasoning demands of Sudoku.
Apart from RL-finetuning, perhaps the most promising approach for enabling LLMs to think like humans is to train directly on human thought processes[2]. Just as golden reasoning traces and explicit thought chains improve RL learning, the Sudoku-Bench release includes video transcripts from Cracking the Cryptic, the popular puzzle YouTube channel, capturing expert solvers as they work through complex puzzles while articulating their reasoning and actions on the SudokuPad app. These transcripts offer a window into how world-class puzzle solvers approach each unique challenge, providing authentic human reasoning data that goes far beyond typical chain-of-thought synthetic examples. However, one key challenge emerged from the length of this data: the extremely long transcripts, often containing 30-60 minutes of detailed utterances per puzzle, proved too lengthy for models to process as context. To address this, we summarized the extensive commentary into high-level break-in insights that could serve as reasoning hints or chain-of-thought guidance. Our approach involved rewarding Qwen2.5-3B-Instruct to reproduce CTC's thought processes during puzzle solving, then prompting the model to generate Sudoku solutions based solely on these internalized reasoning patterns. Critically, the actual answers were withheld during training, forcing the model to learn the underlying logical reasoning rather than memorizing solutions, creating a more authentic test of whether models can capitalize on human-like thinking to solve the puzzles.
We have discussed our evaluation of GPT-5 and ran experiments training smaller models with modern techniques on the Sudoku-Bench. Once again, we congratulate the OpenAI team for their breakthroughs on the Sudoku-Bench. Nevertheless, more research needs to be done to bridge the gap between human thinking and AI reasoning processes, as our experiments here have shown that even advanced training methodologies like GRPO and thought cloning face fundamental limitations when applied to Sudoku. While GPT-5 demonstrated impressive mathematical reasoning capabilities and human-like strategic thinking on algebraically-constrained puzzles, it struggled significantly with spatial reasoning challenges that require spatial understanding. Our smaller model experiments revealed that current fine-tuning approaches often lead to superficial pattern matching rather than genuine logical reasoning development. The Sudoku-Bench continues to expose critical gaps between computational problem-solving and authentic human-like reasoning, particularly in tasks that demand the seamless integration of mathematical logic, spatial awareness, and creative insight. As the field advances toward more sophisticated AI systems, we believe the Sudoku-Bench represents an invaluable testing ground for evaluating whether models can truly reason rather than merely compute. We encourage and invite all new foundation models to challenge themselves against this benchmark, as the puzzles within represent some of the most demanding tests of logical reasoning available today, offering a clear pathway toward developing AI systems that can think with the flexibility, creativity, and systematic rigor that characterizes expert human problem-solving.
For details on the evaluation methodology, data, and code, please refer to the Sudoku-Bench GitHub repository. Please also see our technical report.
[1]: Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y. K., Wu, Y., & Guo, D. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv preprint arXiv:2402.03300. https://arxiv.org/abs/2402.03300
[2]: Hu, S., & Clune, J. (2024). Thought Cloning: Learning to Think while Acting by Imitating Human Thinking. arXiv preprint arXiv:2306.00323. https://arxiv.org/abs/2306.00323
Also visit
We would like to thank the team at OpenAI for confirming the results of GPT-5's performance on the Sudoku-Bench, and advice on how to work with GPT-5.
For attribution in academic contexts, please cite the technical report
Seely, J., Imajuku, Y., Zhao, T., Cetin, E., & Jones, L. (2025). Sudoku-Bench: Evaluating creative reasoning with Sudoku variants. arXiv preprint arXiv:2505.16135.
BibTeX citation
@misc{seely2025sudoku,
title={Sudoku-Bench: Evaluating creative reasoning with Sudoku variants},
author={Jeffrey Seely and Yuki Imajuku and Tianyu Zhao and Edoardo Cetin and Llion Jones},
year={2025},
eprint={2505.16135},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2505.16135},
}
We release our code for this project here.