This page requires Javascript. Please enable it to view the website.

RePo: Language Models with Context Re-Positioning

tl;dr

Transformers read a prompt as one long, flat line of tokens, which can be lossy, especially when one is dealing with structured text. RePo adds a tiny learned module that assigns each token a real-valued position based on its semantics, which preserves important relationships between tokens. We demonstrate that this affords gains on a variety of tasks.

Interactive demonstration

Introduction

Large language models (LLMs) can do impressive things with in-context information—few-shot examples, retrieved passages, tool outputs, long instructions, even tables pasted as text. But inside the model, all of that arrives as a single flat sequence. The only “layout” signal it reliably gets is the token index: $0, 1, 2, \cdots, L-1$ .

For humans, presentation matters. Cognitive Load Theory (CLT) states that our working memory is limited, and performance drops when we waste capacity on extraneous load—effort caused by clutter or poor organization—rather than on germane load, the effort that actually helps solve the task . Putting related items together and removing distractions can make the same problem dramatically easier.

We argue modern LLMs have a similar bottleneck: their positional structure is usually rigid—either a strict linear index (as in RoPE) or effectively a constant position (NoPE). This rigidity bakes in a locality bias and can make it harder to (i) ignore irrelevant context (“noise”), (ii) reason over structured inputs that were flattened into text, and (iii) use far-away but crucial information in long contexts.

RePo is a lightweight module that lets the model reshape the geometry of the context for attention without changing the autoregressive order. For each token, a small network $f_\phi$ predicts a real-valued position $z_i$ from the token’s hidden state. These learned positions are then plugged into a differentiable positional encoding function (e.g., RoPE), so attention can treat “semantically related” tokens as closer—even if they were far apart in the original prompt.

In our experiments, we continually pre-train OLMo-2 1B for 50B tokens and find consistent gains on noisy-context, structured-data, and long-context evaluations, while staying competitive on short general benchmarks.

Contributions (at a glance):

We introduce a context re-positioning (RePo) module for Transformer-based LLMs that reduces extraneous cognitive load by reorganizing context for attention.
We run extensive experiments showing RePo helps on noisy contexts, structured data, and extrapolation to longer contexts.
We provide analyses suggesting RePo (a) shifts attention toward relevant-but-distant tokens, (b) learns denser, non-linear position spaces, and (c) discovers position patterns aligned with intrinsic context structure.

Background

Rigid positions in Transformers

In a standard Transformer, each token $x_i$ is mapped to an embedding or a bias value and then passed through $K$ layers. Each layer applies self-attention, which depends on pairwise interactions between tokens. Positional encoding provides the model with a notion of “where” a token is in the sequence.

RoPE is a widely used positional encoding. In RoPE, position information is incorporated by rotating query and key vectors in a complex plane (or equivalently, by applying a block-diagonal rotation matrix). For token index $i$ , RoPE produces a position-dependent rotation $R(i)$ that modifies queries and keys:

$\mathbf{q}_i \leftarrow R(i)\mathbf{q}_i,\quad \mathbf{k}_i \leftarrow R(i)\mathbf{k}_i.$

Because $i$ is a fixed integer index, RoPE hard-codes a rigid notion of distance, regardless of meaning of tokens.

Why rigidity can hurt

This rigidity is not always a problem—natural language often has local dependencies. But in modern prompting and retrieval scenarios, rigidity can become a bottleneck:

Noisy context: relevant information may be buried among many irrelevant tokens, increasing the “distance” between evidence and the answer.
Structured data: tables and graphs linearized into text destroy the original structure; token index is a poor proxy for relational structure.
Long context: important facts might appear far from the generation point, and models can struggle to use them reliably.

Several approaches relax positional assumptions (e.g., NoPE, hybrid NoPE and RoPE layers , and $p$ -RoPE ), but most still treat positions as either fixed indices or absent.

Methods

The key idea of RePo is simple: instead of using the input index as the position signal, we let the model assign each token a real-valued position based on its semantics. These learned positions are then used inside a differentiable positional encoding function (e.g., RoPE).

Position Representation

Given a sequence of inputs, we first project the $i$ -th hidden state $\mathbf{h}_i^k$ at layer $k$ into a smaller position-representation vector with a small MLP $g_\phi^k$ :

$\mathbf{p}_i^k = g_\phi^k(\mathbf{h}_i^k).$

Position Assignment

For each attention head $h$ in layer $k$ , we produce a scalar position and use it in the positional encoding (e.g., RoPE):

$z_i^{k,h} = w_\phi^{k,h}(\mathbf{p}_i^k).$

We then replace the discrete index $i$ with this real value $z_i$ when applying positional encoding. For RoPE, that means using $R(z_i)$ instead of $R(i)$ :

$\mathbf{q}_i \leftarrow R(z_i)\mathbf{q}_i,\quad \mathbf{k}_i \leftarrow R(z_i)\mathbf{k}_i.$

Because $z_i$ is continuous and learned, tokens that are semantically related can end up closer in positional space, even if their original indices are far apart.

RePo Layer Position

We can apply RePo from a certain depth onward (e.g., starting at layer 5) to avoid disrupting early token representations, and to keep the module stable.

Training & Efficiency

The RePo parameters are trained end-to-end with the backbone model. In practice, the compute overhead is less than 1%.

Experiments

We evaluate whether learned context re-positioning helps most in the settings where a rigid token order tends to break down: (1) noisy context (lots of irrelevant tokens), (2) structured inputs that were flattened into text, and (3) long-context extrapolation beyond the 4K training length. Each subsection below includes the corresponding plot/table directly in the main text so you can skim the takeaway first, then dive into details.

Setup

Backbone and training

We use OLMo-2 (1B) as the backbone, whose performance is comparable to Qwen-2.5 . We start from the stage-1 OLMo-2 checkpoint and continually pre-train on the stage-2 dataset for 50B tokens with a training context length of 4096. We keep the configuration and codebase identical to those released by and train on 4 H100 GPUs.

RePo configuration

We apply RePo starting from the 5th layer, approximately $1/3$ of the total number of layers. In each layer that uses RePo, we share the parameters for the position-representation transformation, while learning head-specific position assignment. The learned position-representation hidden size is 256 (i.e., $1/8$ of the model hidden size).

Baselines

RoPE: standard RoPE positional encoding (identical to the original pre-training).
NoPE: remove explicit positional encoding during continual pre-training .
R2N1: interleave RoPE and NoPE layers (two RoPE + one NoPE repeatedly) .
N2R1: the opposite interleaving (two NoPE + one RoPE repeatedly).

Evaluation suites

We use the allenai/olmes evaluation codebase and group our main tasks into three stress tests:

Noisy Context (RULER ): contexts intentionally include irrelevant content, increasing extraneous load.
Structured Data (NLGraph , HybridQA ): tables/graphs are linearized into text, which can destroy structure.
Longer Context (RULER + LongBench ): evaluation sequences range from 4K to 16K tokens, testing extrapolation beyond the 4K training length. Notably, noisy-context and structured-data evaluations are within the 4K training length. For longer-context evaluation, we extend RoPE layers across all methods with YaRN using the extrapolation hyperparameters in .

Noisy Context

All subtasks inject irrelevant information into the prompt. Here we show that—even within the 4K training context length—RePo outperforms RoPE by 11.04 points. This supports the intuition that re-positioning can reduce extraneous load and help attention focus on what matters.

**Tab 1.** Noisy-context robustness evaluation: accuracy on RULER tasks.
Method	Avg.	NIAH	QA	AGG	VT	Δ vs RoPE
RoPE	44.64	82.56	57.00	37.98	1.00	0.00
NoPE	39.56	74.59	49.00	22.45	12.20	-5.08
R2N1	43.90	85.00	59.50	31.10	0.00	-0.74
N2R1	49.44	80.00	58.00	32.75	27.00	+4.80
RePo	55.68	88.25	61.00	35.05	38.40	+11.04

Structured Data

When structured inputs are flattened into text, preserving relational structure becomes hard. RePo improves over vanilla RoPE by an average of 1.94 EM points. A notable nuance: NoPE performs best on the graph dataset, suggesting that the “locality” prior encoded by standard positional strategies may not match graph-structured inputs.

**Tab 2.** Structured-data evaluation: Exact Match (EM) scores.
Method	NLGraph	HybridQA	Avg.	Δ vs RoPE
RoPE	27.43	24.43	25.93	0.00
NoPE	29.90	23.52	26.71	+0.78
R2N1	27.11	25.11	26.11	+0.18
N2R1	25.42	23.86	24.64	-1.29
RePo	29.03	26.70	27.87	+1.94

Longer Context

**Fig 1.** Long-context evaluation on RULER: YaRN is used for all RoPE layers to extend the context. We observe consistent results on the more realistic benchmark LongBench in Table 3.

The advantage of RePo grows with length. In Figure 1, RePo already leads at 4K, and the gap widens at 8K and 16K—lengths never seen during training. We also evaluate on LongBench to reduce the risk that gains come only from “synthetic” noise setups; Table 3 shows RePo beats other baselines by at least 5.48 points. Interestingly, R2N1 is the strongest baseline, consistent with the findings of .

**Tab 3.** Long-context evaluation on LongBench: average F1/Rouge scores
Method	Single-Doc QA	Multi-Doc QA	Sum.	Few-shot	LongBench
RoPE	12.94	23.32	7.96	22.00	21.07
NoPE	1.80	9.11	5.15	18.12	7.29
R2N1	16.24	25.88	8.74	22.00	22.83
N2R1	1.26	16.24	5.31	21.50	12.56
RePo	15.24	30.86	12.53	31.50	28.31

Analyses

Why does RePo help? This section peeks inside the model to connect the performance gains to two questions:

Where do the gains come from? (e.g., does attention move toward the truly relevant tokens?)
What kinds of position patterns does RePo learn? (e.g., how “dense” or “non-linear” is the learned position space?)

The analyses below focus on attention behavior in needle-in-a-haystack settings and on statistics/patterns of the learned positions.

Attention Mass on Relevant Tokens

Because RePo re-organizes context based on intrinsic structure, we expect it to better capture long-range dependencies, by making distant-but-relevant tokens closer in the model’s positional space. To test this, we analyze attention behavior across different position-assignment strategies on the needle-in-a-haystack (NIAH) task and quantitatively measure the attention mass, i.e., attention scores averaged across attention heads and layers, from generated tokens to three non-overlapping parts of the context, following :

Needle: tokens that correspond to the golden answer in the context. The “needle” tokens are generally distant from the generated tokens in the NIAH task.
Query: tokens that correspond to the user question and the continuation prefix in the context. Thus, they are closest to the generated tokens.
Rest: other tokens in the context.

We conduct our analysis on the NIAH dataset provided by RULER , where the context follows the format:

$\mathtt{Rest} \cdots \mathtt{Needle} \cdots \mathtt{Rest} \cdots \mathtt{Query}$

As shown in Table 4, for needle tokens that are distant yet critical for generation, our RePo method allocates substantially more attention mass than both the linear (i.e., RoPE) and constant (i.e., NoPE) position assignment strategies. Compared with RePo, the linear position assignment also exhibits a stronger locality bias, encouraging attention allocation to nearby query tokens. In addition, the constant position assignment, which treats all positions uniformly, produces an attention pattern with much lower variance across the three parts. These findings explain how our RePo method achieves performance gains on tasks involving noisy context, and also support our motivation based on Cognitive Load Theory (CLT), where the germane load (e.g., the attention mechanism) can better process the context information with context re-positioning.

**Tab 4.** Average attention mass on NIAH tokens (x 0.01)
Method	Needle	Query	Rest
RoPE	1.754	1.123	0.014
NoPE	1.572	1.135	0.014
RePo	2.013	1.046	0.015

Position Patterns Learned by RePo

To better understand the patterns learned by RePo, we analyze the characteristics of the assigned positions, first focusing on their ranges and then on their local patterns.

We first collect statistics on the distances between the maximum and minimum assigned positions for each attention head:

$d^{k, h} = \mathrm{max}(\bold{z}^{k,h}) - \mathrm{min}(\bold{z}^{k,h}),$

where $\bold{z}^{k,h} = \{z_1^{k,h}, z_2^{k,h}, \cdots, z_L^{k,h}\}$ , $L$ is the number of tokens in input $\bold{x}$ , and $k$ and $h$ represent the indices of the attention head and layer, respectively.

We find that different heads learn different position ranges, suggesting that heads specialize to different granularities of context structure. Some heads learn relatively small ranges (local re-organization), while others learn large ranges (global re-organization), consistent with the idea that attention heads capture diverse relational patterns.

We also visualize local position patterns by plotting assigned positions across tokens for selected heads and layers. These plots reveal non-linear patterns such as plateaus (many tokens mapped to similar positions) and abrupt jumps (tokens mapped far apart), indicating that the learned position space is not a simple rescaling of the input index.

Distribution of position distances across attention heads. The distances are measured as the difference between the maximum and minimum learned positions within a context.

Example learned position patterns (new position vs. input token index) across selected layers/heads, illustrating non-linear structures such as clusters and jumps.

Performance on General Tasks

As shown in Table 5, along with the noticeable performance gain in previous experiments, our RePo method still achieves performance comparable to the RoPE method on extensive general benchmarks. This occurs even though changing from linear position assignment to RePo causes an inconsistency between pre-training and post-training positional representations.

**Tab 5.** General benchmark performance
Benchmark suite	Metric	RoPE	RePo
ARC-C	Acc	47.99	47.61
ARC-E	Acc	75.25	74.87
BoolQ	Acc	72.12	73.58
CoQA	F1	56.87	57.44
Drop	F1	37.90	38.17
Hellaswag	Acc	70.68	70.08
MMLU-Pro	Acc	13.77	13.52
TriviaQA	F1	54.98	54.56
Average	—	53.70	53.73

Conclusion

We addressed high extraneous cognitive load for transformer-based LLMs with rigid positional structures, proposing RePo, a lightweight learned module that repositions tokens based on their semantics. Across noisy-context, structured-data, and long-context evaluations, RePo consistently improves performance and shows interpretable position patterns that capture the structure and dependencies in a prompt.

Acknowledgements

Citation

@techreport{sakana2025repo,
  author    = {Huayang Li and Tianyu Zhao and Richard Sproat},
  title     = {{RePo: Language Models with Context Re-Positioning}},
  institution = {Sakana AI},
  year      = {2025},
  month     = {December},
  note      = {Technical Report}
}

Open Source Code

We release our code for this project here.

Appendix

Please view the PDF version of the paper for the appendix, which contains additional details and analyses.