Continuous Thought Machines

Neural networks (NNs) were originally inspired by biological brains, yet they remain significantly distinct from their biological counterparts. Brains demonstrate complex neural dynamics that evolve over time, but modern NNs intentionally abstract away such temporal dynamics in order to facilitate large-scale deep learning. For instance, the activation functions of standard NNs can be seen as an intentional abstraction of a neuron's firing rate, replacing the temporal dynamics of biological processes with a single, static value. Such simplifications, though enabling significant advancements in large-scale machine learning , have resulted in a departure from the fundamental principles that govern biological neural computation.

Over hundreds of millions of years, evolution has endowed biological brains with rich neural dynamics, including spike-timing-dependent plasticity (STDP) and neuronal oscillations. Emulating these mechanisms, particularly the temporal coding inherent in spike timing and synchrony, presents a significant challenge. Consequently, modern neural networks do not rely on temporal dynamics to perform compute, but rather prioritize simplicity and computational efficiency. This abstraction, while boosting performance on specific tasks, contributes to a recognized gap between the flexible, general nature of human cognition and current AI capabilities, suggesting fundamental components, potentially related to temporal processing, are missing from our current models .

For these reasons, we argue that time should be a central component of artificial intelligence in order for it to eventually achieve levels of competency that rival or surpass human brains . Therefore, in this work, we address the strong limitation imposed by overlooking neural activity as a central aspect of intelligence. We introduce the Continuous Thought Machine (CTM), a novel neural network architecture designed to explicitly incorporate neural timing as a foundational element. Our contributions are as follows:

Reasoning models and recurrence

The frontier of artificial intelligence faces a critical juncture: moving beyond simple input-output mappings towards genuine reasoning capabilities. While scaling existing models has yielded remarkable advancements, the associated computational cost and data demands are unsustainable and raise questions about the long-term viability of this approach. For sequential data, longstanding recurrent architectures have largely been superseded by transformer-based approaches . Nevertheless, recurrence is re-emerging as a natural avenue for extending model complexity. Recurrence is promising because it enables iterative processing and the accumulation of information over time. Modern text generation models (sometimes referred to as 'reasoning models') use intermediate generations as a form of recurrence that enables additional compute during test-time. Recently, other works have demonstrated the benefits of the recurrent application of latent layers . While such methods bring us closer to the recurrent structure of biological brains, a fundamental gap nevertheless remains. We posit that recurrence, while essential, is merely one piece of the puzzle. The temporal dynamics unlocked by recurrence -- the precise timing and interplay of neural activity -- are equally crucial. The CTM differs from existing approaches in three ways: (1) the decoupled internal dimension enables sequential thought on any conceivable data modality; (2) private neuron-level models enables the consideration of precise neural timing; and (3) neural synchronization used directly as a representation for solving tasks.

Method

The Continuous Thought Machine (CTM) is a neural network architecture that enables a novel approach to thinking about data. It departs from conventional feed-forward models by explicitly incorporating the concept of Neural Dynamics as the central component to its functionality. The video above gives a pictorial overview of the internal workings of the CTM. We give all technical details, including additional figures and verbose explanations in our Technical Report. A GitHub repository is also available. We will provide links to relevant parts of the repository as we explain the model below.

**Fig 2.** CTM architecture: The 1 synapse model (weights depicted as blue lines) models the cross-neuron interactions to produce pre-activations. For each neuron, a 2 history of pre-activations is kept, the most recent of which are used by the 3 neuron-level models (weights depicted as red lines) to produce 4 post-activations. A 5 history of post-activations is also kept and used to 6 compute a synchronization matrix. Neuron pairs are 7 selected from the synchronization matrix, yielding the 8 latent representations with which the CTM 9 produces outputs and modulates data through cross-attention. Modulated data (e.g., attention outputs) are 10 concatenated with post-activations for the next internal tick.

Variable	Description
$\mathbf{z}^t$	Post-activations at internal tick $t$, after neuron-level models have been used.
$\theta_{\text{syn}}$	Recurrent (synapse) model weights; U-NET-like architecture that connects neurons at a given internal tick, $t$.
$\mathbf{a}^t$	Pre-activations at internal tick $t$.
$\mathbf{A}^t$	History of most recent pre-activations, designed as a FIFO list so that they are always length $M$; inputs to neuron-level models.
$\theta_{\text{d}}$	Weights of a single neuron-level model, $d$ of $D$; MLP architecture, unique weights per neuron.
$\mathbf{Z}^t$	History of all post-activations up to this internal tick, variable length; used as input for synchronization dot products.
$\mathbf{S}^t$	Synchronization matrix at internal tick $t$. In practice we use far fewer neurons than $D$ for separate $\mathbf{S}^t_{\text{out}}$ and $\mathbf{S}^t_{\text{action}}$ synchronization representations.
$\mathbf{W}_{\text{out}}$, $\mathbf{W}_{\text{in}}$	Linear weight matrices that project from $\mathbf{S}^t_{\text{out}}$ and $\mathbf{S}^t_{\text{action}}$ to attention queries and predictions, respectively.
$\mathbf{o}^t$	Cross attention output.

Internal ticks: the 'thought' dimension

We start by introducing the continuous internal dimension:

t \in \{ 1, \ldots ,T \}

. Unlike conventional sequential models -- such as RNNs or Transformers -- that process inputs step-by-step according to the sequence inherent in the data (e.g., words in a sentence or frames in a video), the CTM operates along a self-generated timeline of internal thought steps. This internal unfolding allows the model to iteratively build and refine its representations, even when processing static or non-sequential data such as images or mazes. To conform with existing nomenclature used in related works , we refer to these thought steps as 'internal ticks' from here on.

Recurrent weights: synapses

A recurrent multi-layer perceptron (MLP structured in a U-NET fashion ) acts as a synapse model for the CTM. At any internal tick

t

, the synapse model produces what we consider pre-activations:

\bold{a}^t = f_{\theta_{\text{syn}}}(\text{concat}(\bold{z}^t, \bold{o}^t)) \in~\mathbb{R}^D,

where

\bold{o}^t

is from input data. The

M

most recent pre-activations are then collected into a pre-activation 'history':

\bold{A}^t = \begin{bmatrix} \bold{a}^{t-M+1} & \bold{a}^{t-M+2} & \cdots & \bold{a}^t \end{bmatrix} \in~\mathbb{R}^{D \times M}.

Neuron-level models

M

effectively defines the length of the history of pre-activations that each neuron level model works with. Each neuron,

\{1, \ldots, D\}

, is then given its own privately parameterized MLP that produces what we consider post-activations:

where

\theta_d

are the unique parameters for neuron

d

, and

\bold{z}_d^{t+1}

is a single unit in the vector that contains all post-activations.

\bold{A}_d^t

is a

M

-dimensional vector (time series). The full set of neuron post-activations are then concatenated with attention output and fed recurrently into

f_{\theta_{\text{syn}}}

to produce pre-activations for next step,

t+1

, in the unfolding thought process.

Synchronization as a representation: modulating data

How should the CTM interact with the outside world? Specifically, how should the CTM consume inputs and produce outputs? We introduced a timing dimension over which something akin to thought can unfold. We also want the CTM's relationship with data (its interaction, so to speak) to depend not on a snapshot of the state of neurons (at some

t

), but rather on the ongoing temporal dynamics of neuron activities. By way of solution, we turn again to natural brains for inspiration and find the concept of neural synchronization both fitting and powerful. For synchronization we start by collecting the post-activations into a post-activation 'history':

\bold{Z}^t = \begin{bmatrix} \bold{z}^{1} & \bold{z}^{2} & \cdots & \bold{z}^t \end{bmatrix} \in \mathbb{R}^{D \times t}.

The length of

\bold{Z}^t

is equal to the current internal tick, meaning that this dimension is not fixed and can be arbitrarily large. We define neural synchronization as the matrix yielded by the inner dot product between post-activation histories:

\bold{S}^t = \bold{Z}^t \cdot (\bold{Z}^t)^\intercal \in~\mathbb{R}^{D\times D}.

Since this matrix scales in

O(D^2)

it makes practical sense to subsample

(i,j)

row-column pairs, which capture the synchronization between neurons

i

and

j

. To do so we randomly select

D_\text{out}

and

D_\text{action}

(i,j)

pairs from

\bold{S}

, thus collecting two synchronization representations,

\bold{S}^t_\text{out} \in~\mathbb{R}^{D_\text{out}}

and

\bold{S}^t_\text{action} \in~\mathbb{R}^{D_\text{action}}

\bold{S}^t_\text{out}

can then be projected to an output space as:

Modulating input data

\bold{S}^t_\text{action}

can be used to take actions in the world (e.g., via attention as is in our setup):

where

\bold{W}_{\text{out}}

and

\bold{W}_{\text{in}}

are learned weight matrices that project synchronization into vectors for observation (e.g., attention queries,

\bold{q}^t

) or outputs (e.g., logits,

\bold{y}^t

). Even though there are

(D \times (D+1))/2

unique pairings in

\bold{S}^t

D_\text{out}

and

D_\text{action}

can be orders of magnitude smaller than this. That said, the full synchronization matrix is a large representation that has high future potential.

\bold{o}^t = \text{Attention}(Q=\bold{q}^t, KV=\text{FeatureExtractor}(\text{data}))

where a 'FeatureExtractor' model, e.g., a ResNet , is first used to build useful local features for the keys and values.

\bold{o}^{t}

is concatenated with

\bold{z}^{t+1}

for the next cycle of recurrence.

Loss function: optimizing across internal ticks

The CTM produces outputs at each internal tick,

t

. A key question arises: how do we optimize the model across this internal temporal dimension? Let

\bold{y}^t \in \mathbb{R}^{C}

be the prediction vector (e.g., probabilities of classes) at internal tick

t

, where

C

is the number of classes. Let

y_{true}

be the ground truth target. We can compute a loss at each internal tick using a standard loss function, such as cross-entropy:

and a corresponding certainty measure,

\mathcal{C}^t

. We compute certainty simply as 1 - normalised entropy. We compute

\mathcal{L}^t

and

\mathcal{C}^t

for all

t \in \{1, \ldots, T\}

, yielding losses and certainties per internal tick,

\mathcal{L} \in \mathbb{R}^{T}

and

\mathcal{C} \in \mathbb{R}^{T}

A natural question arises: how should we reduce

\mathcal{L}

into a scalar loss for learning? Our loss function is designed to optimize CTM performance across the internal thought dimension. Instead of relying on a single step (e.g., the last step), which can incentivize the model to only output at that specific step, we dynamically aggregate information from two internal ticks: the point of minimum loss and the point of maximum certainty:

This approach is advantageous because it means that the CTM can perform meaningful computations across multiple internal ticks, naturally facilitates a curriculum effect, and enables the CTM to tailor computation based on problem difficulty. The final loss is computed as:

Experiment: ImageNet

Demonstrations

Fig 4. Thinking about Images: Top left is the average attention weighting (of the 16 heads shown) when the CTM observes the image on the right. Class predictions are shown on the bottom left and the certainty is shown on the bottom right (green denotes a correct prediction). The small images at the bottom are buttons to load other examples, showing a diversity of certainties and correctness.

Results

Accuracy types for ImageNet — **Fig 5a.** Top-5 Accuracies: using different mechanisms for predictions, the CTM achieves different levels of accuracy per internal tick (thought step). At about 15 ticks it makes sense to account for certainty.

Calibration on ImageNet validation set — **Fig 5b.** Calibration: often considered an important measure of how well a model fits the underlying data distribution, the CTM has remarkably good calibration.

Setting a threshold of 0.5 certainty — **Fig 5c.** Certainty threshold=0.5: top-5 accuracy at this certainty threshold (black line, bottom right in the videos to the left).

Setting a threshold of 0.8 certainty — **Fig 5d.** Certainty threshold=0.9: top-5 accuracy at this certainty threshold (black line, bottom right in the videos to the left).

This is a subset of results from our ImageNet experiments (see the Technical Report for more). Crucially, the CTM enables Adaptive Compute where the internal steps, (how much thought the CTM is putting into the problem) can be cut short. These figures show what can be expected in terms of accuracy when cutting thinking short. Only marginal gains are had past a certain point, but gains nonetheless.

Fig 4. shows where the CTM looks as it reasons about the data. We show the Attention Weights for all 16 heads and demark where the model is looking for each (and on average at the top). The predictions are shown on the bottom left and certainty over time on the bottom right. Fig 6. shows a visualization of Neural Activity as the CTM thinks about a single image: note the multi-scale structure and how activity seems to 'flow'.

Fig 6. Neural activity: visualised in 2D using a UMAP projection. Each neuron is shown as an individual dot, scaling in size with absolute magnitude, and color with value (blue for negative, red for positive). We show similar visualizations inside later demonstrations.

Discussion

We never set out to train a model that achieved some remarkable new state-of-the-art performance on ImageNet. AI researchers already expect high performance on ImageNet after over a decade of research that uses it. Instead, we wanted to show just how different and interesting the CTM's interaction with data can be. The videos on the left/above demonstrate the thought process the CTM undertakes and the figures show its benefits.

Let's contextualize just what's going on here: the CTM is looking around these images, all the while building up its prediction, all by using the synchronization of neural activity directly as a representation. The neural dynamics we showed earlier are actually examples of dynamics from a CTM observing ImageNet! The paths output by the CTM in the maze demo are akin to the class predictions made here.

The missing ingredient: TIME

Biological intelligence is still superior to AI in many cases . Biological brains solve tasks very differently to conventional neural networks, which might explain why this is the case. It might be that biological intelligence pays heed to time in ways that modern AI simply does not. In this work, we aimed to develop a model that approaches problem-solving in a manner more aligned with biological brains, emphasizing the central role of the precise timing and interplay of neural dynamics. The interpretable and intuitive outcome we point at in the video demonstrations is very exciting as it suggests that the CTM is indeed leveraging time to its advantage, in order to reason about data.

The details on model hyper-parameters can be found in the Technical Report.

Experiment: Solving 2D Mazes - doing it the hard way

The why and the how

Solving mazes is a challenging task for machines , where only the current bleeding edge models perform well on fairly simple mazes. Even so, existing methods either require careful design of the data/objective (e.g., outputs are images instead of a solution), or extensive tool use (e.g., LLMs that perform well at this), indicating that the underlying intelligent reasoning required to solve a maze, step-by-step, is not evidenced by these approaches.

We trained n CTM on a new setup, requiring it to directly predict a path (truncated for simplicity) from start to finish in the form of steps: Left, Right, Up, Down, or Wait. A small version of the resultant model can be explored in the interactive demo at the top of this page. We show a demonstration of larger model here. Remarkably, the attention pattern is intuitive and follows the solution, all while using neural synchronization as a representation. It even generalizes beyond the truncated path! See the Technical Report.

Demonstration

Fig 7. Thinking about mazes: each animation segment shows 75 internal ticks of the CTM when it is provided with the input maze. We show the route as it is constructed through the internal 'thought process', showing only the valid route (i.e., ignoring predictions through walls; see the associated toggle on the demo). 16 attention heads' weights are shown at the bottom and the average is overlayed on the maze to show where the CTM is focusing. We 'teleport' the CTM to its resultant predicted location until it lands on the target and then load a new maze.