Neural networks (NNs) were originally inspired by biological brains, yet they remain significantly distinct from their biological counterparts. Brains demonstrate complex neural dynamics that evolve over time, but modern NNs intentionally abstract away such temporal dynamics in order to facilitate large-scale deep learning. For instance, the activation functions of standard NNs can be seen as an intentional abstraction of a neuron's firing rate, replacing the temporal dynamics of biological processes with a single, static value. Such simplifications, though enabling significant advancements in large-scale machine learning
Over hundreds of millions of years, evolution has endowed biological brains with rich neural dynamics, including spike-timing-dependent plasticity (STDP)
Indeed, the notably high performance of modern AI across many fields suggests the emulation of neural dynamics is unwarranted. However, the gap between the highly flexible and general nature of human cognition and the current state of modern AI suggests missing components in our current models.
For these reasons, we argue that time should be a central component of artificial intelligence in order for it to eventually achieve levels of competency that rival or surpass human brains
The frontier of artificial intelligence faces a critical juncture: moving beyond simple input-output mappings towards genuine reasoning capabilities. While scaling existing models has yielded remarkable advancements, the associated computational cost and data demands are unsustainable and raise questions about the long-term viability of this approach. For sequential data, longstanding recurrent architectures
The Continuous Thought Machine (CTM) is a neural network architecture that enables a novel approach to thinking about data. It departs from conventional feed-forward models by explicitly incorporating the concept of Neural Dynamics as the central component to its functionality. The video above gives a pictorial overview of the internal workings of the CTM. We give all technical details, including additional figures and verbose explanations in our Technical Report. A GitHub repository is also available. We will provide links to relevant parts of the repository as we explain the model below.
Variable | Description |
---|---|
$\mathbf{z}^t$ | Post-activations at internal tick $t$, after neuron-level models have been used. |
$\theta_{\text{syn}}$ | Recurrent (synapse) model weights; U-NET-like architecture that connects neurons at a given internal tick, $t$. |
$\mathbf{a}^t$ | Pre-activations at internal tick $t$. |
$\mathbf{A}^t$ | History of most recent pre-activations, designed as a FIFO list so that they are always length $M$; inputs to neuron-level models. |
$\theta_{\text{d}}$ | Weights of a single neuron-level model, $d$ of $D$; MLP architecture, unique weights per neuron. |
$\mathbf{Z}^t$ | History of all post-activations up to this internal tick, variable length; used as input for synchronization dot products. |
$\mathbf{S}^t$ | Synchronization matrix at internal tick $t$. In practice we use far fewer neurons than $D$ for separate $\mathbf{S}^t_{\text{out}}$ and $\mathbf{S}^t_{\text{action}}$ synchronization representations. |
$\mathbf{W}_{\text{out}}$, $\mathbf{W}_{\text{in}}$ | Linear weight matrices that project from $\mathbf{S}^t_{\text{out}}$ and $\mathbf{S}^t_{\text{action}}$ to attention queries and predictions, respectively. |
$\mathbf{o}^t$ | Cross attention output. |
The CTM consists of three main ideas:
While data is undoubtedly crucial for any modeling, the CTM is designed around the idea of internal recurrence and synchronization, where the role of data is somewhat secondary to the internal process itself.
Input data is attended to and ingested at each internal tick based on the current sychronisation, and similarly for predictions.
We start by introducing the continuous internal dimension: . Unlike conventional sequential models -- such as RNNs or Transformers -- that process inputs step-by-step according to the sequence inherent in the data (e.g., words in a sentence or frames in a video), the CTM operates along a self-generated timeline of internal thought steps. This internal unfolding allows the model to iteratively build and refine its representations, even when processing static or non-sequential data such as images or mazes. To conform with existing nomenclature used in related works
The CTM's internal dimension is that over which the dynamics of neural activity can unfold. We believe that such dynamics are likely a cornerstone of intelligent thought.
A recurrent multi-layer perceptron (MLP structured in a U-NET fashion
where is from input data. The most recent pre-activations are then collected into a pre-activation 'history':
effectively defines the length of the history of pre-activations that each neuron level model works with. Each neuron, , is then given its own privately parameterized MLP that produces what we consider post-activations:
where are the unique parameters for neuron , and is a single unit in the vector that contains all post-activations. is a -dimensional vector (time series). The full set of neuron post-activations are then concatenated with attention output and fed recurrently into to produce pre-activations for next step, , in the unfolding thought process.
How should the CTM interact with the outside world? Specifically, how should the CTM consume inputs and produce outputs? We introduced a timing dimension over which something akin to thought can unfold. We also want the CTM's relationship with data (its interaction, so to speak) to depend not on a snapshot of the state of neurons (at some ), but rather on the ongoing temporal dynamics of neuron activities. By way of solution, we turn again to natural brains for inspiration and find the concept of neural synchronization
The length of is equal to the current internal tick, meaning that this dimension is not fixed and can be arbitrarily large. We define neural synchronization as the matrix yielded by the inner dot product between post-activation histories:
Since this matrix scales in it makes practical sense to subsample row-column pairs, which capture the synchronization between neurons and . To do so we randomly select and pairs from , thus collecting two synchronization representations, and . can then be projected to an output space as:
As the model width, D, grows, the synchronization representation grows with \(\frac{D \times (D+1)}{2}\), offering opportunities for improved expressiveness without the need for more parameters in order to project a latent space to this size.
can be used to take actions in the world (e.g., via attention as is in our setup):
where and are learned weight matrices that project synchronization into vectors for observation (e.g., attention queries, ) or outputs (e.g., logits, ). Even though there are unique pairings in , and can be orders of magnitude smaller than this. That said, the full synchronization matrix is a large representation that has high future potential.
In most of our experiments we used standard cross attention
where a 'FeatureExtractor' model, e.g., a ResNet
The CTM produces outputs at each internal tick, . A key question arises: how do we optimize the model across this internal temporal dimension? Let be the prediction vector (e.g., probabilities of classes) at internal tick , where is the number of classes. Let be the ground truth target. We can compute a loss at each internal tick using a standard loss function, such as cross-entropy:
and a corresponding certainty measure, . We compute certainty simply as 1 - normalised entropy. We compute and for all , yielding losses and certainties per internal tick, and .
A natural question arises: how should we reduce into a scalar loss for learning? Our loss function is designed to optimize CTM performance across the internal thought dimension. Instead of relying on a single step (e.g., the last step), which can incentivize the model to only output at that specific step, we dynamically aggregate information from two internal ticks: the point of minimum loss and the point of maximum certainty:
This approach is advantageous because it means that the CTM can perform meaningful computations across multiple internal ticks, naturally facilitates a curriculum effect, and enables the CTM to tailor computation based on problem difficulty. The final loss is computed as:
Please take a look at our Technical Report for more information.
Specifically, it includes additional information on how we enable the CTM to learn short versus long time dependency.
This is a subset of results from our ImageNet experiments (see the Technical Report for more). Crucially, the CTM enables Adaptive Compute where the internal steps, (how much thought the CTM is putting into the problem) can be cut short. These figures show what can be expected in terms of accuracy when cutting thinking short. Only marginal gains are had past a certain point, but gains nonetheless.
Fig 4. shows where the CTM looks as it reasons about the data. We show the Attention Weights for all 16 heads and demark where the model is looking for each (and on average at the top). The predictions are shown on the bottom left and certainty over time on the bottom right. Fig 6. shows a visualization of Neural Activity as the CTM thinks about a single image: note the multi-scale structure and how activity seems to 'flow'.
We never set out to train a model that achieved some remarkable new state-of-the-art performance on ImageNet. AI researchers already expect high performance on ImageNet after over a decade of research that uses it. Instead, we wanted to show just how different and interesting the CTM's interaction with data can be. The videos on the left/above demonstrate the thought process the CTM undertakes and the figures show its benefits.
Let's contextualize just what's going on here: the CTM is looking around these images, all the while building up its prediction, all by using the synchronization of neural activity directly as a representation. The neural dynamics we showed earlier are actually examples of dynamics from a CTM observing ImageNet! The paths output by the CTM in the maze demo are akin to the class predictions made here.
Biological intelligence is still superior to AI in many cases
The details on model hyper-parameters can be found in the Technical Report.
Solving mazes is a challenging task for machines
We trained n CTM on a new setup, requiring it to directly predict a path (truncated for simplicity) from start to finish in the form of steps: Left, Right, Up, Down, or Wait. A small version of the resultant model can be explored in the interactive demo at the top of this page. We show a demonstration of larger model here. Remarkably, the attention pattern is intuitive and follows the solution, all while using neural synchronization as a representation. It even generalizes beyond the truncated path! See the Technical Report.
Each video below shows how well the CTM generalizes to bigger and more complex mazes, while retaining its reasoning prowess. To generate these we used a CTM trained to solve a path up to length 100 on 39 x 39 mazes, but the mazes shown here are of size 99 x 99 and the full paths are roughly 6x as long.
Why run these experiments? We know that neural networks can be tailored to solve 2D mazes if we present the data in the "right" way. But, when presented in a fashion that requires a clear process through which the model must progress, existing methods fall short. Even current SoTA LLMs rely on tool use, which is impressive in its own right, but somewhat unsatisfying: an intelligent machine should be demonstrably intelligent, and humans don't require tools to solve these mazes.
We set out to show that the CTM has the capacity to learn when complex reasoning is required, unlike the most comparable baseline methods. We also show how the CTM generalizes to larger and more complex mazes, indicating that its internal reasoning is not merely memorization, but rather a more natural and correct way to solve the underlying maze problem. Importantly, we made no specific structural changes to the model compared to the CTM we trained for ImageNet; the only meaningful structural change was to output the solution as a 2D class space, applying cross entropy for each step.
We chose our setup carefully: (1) we used no positional embedding for attention; and (2) we required that the models predict the routes directly as a string of classes (e.g., go left, left, right, up, etc.). By forgoing positional embedding the CTM must build an internal world model in order to query the data and navigate the maze. The fact that it does so in such a convincing fashion is remarkable.
We have some strong evidence that the CTM is capable of solving challenging problems, and it does so in intuitive and interesting ways. The fact that it can solve mazes by building an internal world model "on the fly" without any positional embedding opens up avenues for future research. For instance, we would like to see how the CTM finds its way around more complex environments (e.g., games or videos) without any explicit positional encodings.
The parity of a binary sequence, given by the sign of the product of its elements, can reasonably be predicted by an RNN when the data is fed sequentially - the model need only maintain an internal state, flipping a 'switch' whenever a negative number if encountered. When the entire sequence is provided at once, however, the task is significantly more challenging
We trained CTMs to solve a variant of this parity task: the model is input with a 64-length binary vector, and must predict the cumulative parity at each of the 64 positions.
We compare the accuracy of CTMs trained with different numbers of internal ticks to parameter matched LSTMs. We found that CTMs with over 75 internal ticks could reliably solve this task, with some runs achieving 100% accuracy. The LSTMs, on the other hand, struggled to learn with over 10 internal ticks, suggesting that LSTMs are not well suited to unfolding an internal thought dimension.
The left/above demonstration shows the solving process of the CTM: the movement of the attention weights, as well as their argmax overlayed on the inputs, the models predictions, the target and the neuron activations. Notice how the attention moves backwards through the data and determines the solution after observing the entire input. Some attention heads display interpretable behavior, such as the first attention head which attends to only negative parity positions (\(\blacksquare\)).
We visualise the learned algorithms by plotting the accuracy (top) and attention weights (bottom) over the 75 internal ticks for each position in the 64-length sequence, at different points during training. One model (left) attends to the data in reverse order before predicting the cumulative parity at once; the other attends forward, predicting parity incrementally. Both achieve perfect accuracy.
The ability of the CTM to search through the data in reverse order, suggests that the CTM is carrying out some form of planning, building up its understanding of the data before making a final decision -- the CTM is capable of forming and following a strategy.
To assess the CTM’s ability to memorise and recall information, we design a Question and Answering (Q&A) MNIST task. In this task, the model first observes a sequence of MNIST digits, followed by a series of interleaved index and operator embeddings that specify which digits should be recalled and which modular operation should be applied. Once all digits and index/operator embeddings have been presented, a zero-tensor flag signals the model to produce its final answer. An example is shown below.
In our experiments, the memory length of the CTMs is such that the MNIST digits will always lie outside of the activation history window used by the neuron-level models. In this way, the CTM must organize its activations such that it can recall digits are later timesteps.
Our results show that, while the LSTM outperforms the CTM when only a single internal tick is used to process each input, the LSTM becomes more unstable when more internal ticks are used. The CTM, on the other hand, exhibits stronger performance with increasing internal ticks, achieving over 95% accuracy in the most challenging in-distribution task.
Furthermore, we highlight the ability of the CTM to recall digit values observed many timesteps in the past, arising purely from the organization and synchronization of neurons. This strong performance suggests that processing timing information through the synchronization of neuron activations may be a powerful mechanism for memorization and recall.
We examine the generalization capabilities of the CTM by measuring the accuracy of the model when input with more digits or index-operator embeddings than observed during training, depicted below, with the training regime marked in red. We find that both the CTM and the LSTM baseline can generalize to an increased number of operations. Empirically, we find that this generalization arises from the model’s approach to solving the task: each time a new index embedding is presented, the model computes and stores the result of the specified operation, regardless of whether the answer flag has been given. This enables it to continue processing a stream of index and operator embeddings without needing to wait for a final signal.
In this section we test the CTM using CIFAR-10, comparing it to human performance, a feed-forward baseline, and an LSTM baseline. The purpose of this experiment was to contextualize the performance of the CTM alongside a standard feed-forward baseline, an LSTM baseline that also uses internal ticks for reasoning (potentially), and humans. We used a restricted backbone to highlight the differences between models (details in the Technical Report).
We used two datasets of human labels for CIFAR-10; we call these CIFAR-10D
For the human calibration we used the probabilities provided in CIFAR-10H, which were computed using guesses from multiple humans using the available human datasets. We computed calibration (Fig 16b.) as we did for ImageNet: we compute the predictive probability as the average probability for the chosen class over all internal ticks (for both CTM and LSTM). The CTM demonstrates the best calibration, even when compared to humans.
Fig 17. shows the neural activities for the CTM and the LSTM baseline. The CTM yields rich, diverse, and complex dynamics with multiple interesting features, including periodic behavior (there is no periodic driving function). The distinct difference between the CTM and LSTM neural activities is evidence that the two novel elements of the CTM (neuron-level models and synchronization as a representation) enable neural dynamics as a fundamental computational mechanic.
Fig 18. shows what happens when we vary the number of neurons (i.e., the model width) while keeping all else constant, including the training time. As with other models, a wider network could evidently benefit from a longer training time or different training hyper-parameters, hence the reduction in accuracy in Fig 18a. For Fig 18b. and Fig 18c. we set out to understand how unique the Neuron-level models tend to be, and that was related to the model width, as measured by the cosine similarity between the dynamics of different neurons. Fig 18b. shows that with a wider model (i.e., more neurons), we see more diversity instead of less. One might expect that with more neurons there is less 'space' for diversity, but we observed the opposite.
Fig 19. shows the relationship between predictions and the number of internal ticks used by the CTM. We trained several CTMs (again keeping all other variables constant). In Fig 19b. we plot the distributions of the data over which steps the CTM is most certain (i.e., in the loss function). What this shows is that the CTM uses a wide range of steps to become most certain about the data it observes. For each setup (25, 50 and 100 internal ticks), there are two concentrated areas in the distributions, indicating that the CTM is following separate internal processes depending on the data.
For these experiments we trained a CTM to sort 30 real numbers from . The purpose of this experiment was twofold: (1) to understand if and when the CTM applies more or less compute in a controlled environment; and (2) see if we can train the CTM to output a sequence in sequential order using the CTC loss. This CTM could sort a length 30 list of real numbers approximately 80% of the time.
We have shown that the CTM can process non-sequential data via an continuous thought dimension. Here, we extend the CTM to tasks involving interation with an external environment, training CTMs with proximal policy optimization
Although our results show that the CTM achieves a comparable performance to the LSTM baseline, the central goal of this section is provide evidence that the CTM can learn in a continuous environment.
The Continuous Thought Machine (CTM) represents a novel step towards bridging computational efficiency with biological plausibility in artificial intelligence. By moving beyond traditional pointwise activation functions to private neuron-level models, the CTM cultivates far richer neuron dynamics. Crucially, it leverages neural synchronization as a powerful and fundamentally new type of representation - distinct from the activation vectors prevalent since the early days of neural networks. This direct use of neuron dynamics as a first-class representational citizen allows the CTM to exhibit behaviors qualitatively different from contemporary models.
Our research demonstrates the tangible benefits of this approach. The CTM can dynamically build representations over time for tasks like image classification, form rich internal maps to attend to specific input data without positional embeddings, and naturally exhibit adaptive computation. Furthermore, it learns to synchronize neural dynamics to store and retrieve memories beyond its immediate activation history. This internal processing also lends itself to greater interpretability, as seen in its methodical solving of mazes and parity tasks.
Remarkably, the core CTM architecture remained largely consistent across a diverse range of challenging tasks, requiring only input/output module adjustments. This versatility and trainability were particularly evident in complex scenarios like maze navigation. The CTM succeeded with minimal tuning, where a traditional model like the LSTMs still struggled even after significant tuning efforts.
This work underscores a vital, yet often underexplored, synergy between neuroscience and machine learning. While modern AI is ostensibly brain-inspired, the two fields often operate in surprising isolation. The CTM serves as a testament to the power of drawing inspiration from biological principles. By starting with such inspiration and iteratively following the emergent, interesting behaviors, we developed a model with unexpected capabilities, such as its surprisingly strong calibration in classification tasks, a feature that was not explicitly designed for.
It is crucial to note that our approach advocates for borrowing concepts from biology rather than insisting on strict, literal plausibility; real neurons may not access their activation history as modeled in the CTM, yet emergent phenomena like traveling waves still manifest. This nuanced balance between practicality and biological inspiration opens a landscape of new research directions, which may hold the key to unlocking capabilities currently missing in AI, potentially leading to systems that exhibit more human-like intelligence and address its current limitations.
When we initially asked, "why do this research?", we hoped the journey of the CTM would provide compelling answers. By embracing light biological inspiration and pursuing the novel behaviors observed, we have arrived at a model with emergent capabilities that exceeded our initial designs. We are committed to continuing this exploration, borrowing further concepts to discover what new and exciting behaviors will emerge, pushing the boundaries of what AI can achieve.
For attribution in academic contexts, please cite this work as
Luke Darlow, Ciaran Regan, Sebastian Risi, Jeffrey Seely, and Llion Jones. (2025). Continuous Thought Machines. Sakana AI Technical Report.
BibTeX citation
@techreport{darlow2025ctm, author = {Luke Darlow and Ciaran Regan and Sebastian Risi and Jeffrey Seely and Llion Jones}, title = {{Continuous Thought Machines}}, institution = {Sakana AI}, year = {2025}, month = {April}, note = {Technical Report} }
We release our code for this project here.
Please view the PDF version of the paper for the appendix, which contains additional details and experiments.