This page requires Javascript. Please enable it to view the website.

RePo: Language Models with Context Re-Positioning

tl;dr

Transformers read a prompt as one long, flat line of tokens, which can be lossy, especially when one is dealing with structured text. RePo adds a tiny learned module that assigns each token a real-valued position based on its semantics, which preserves important relationships between tokens. We demonstrate that this affords gains on a variety of tasks.

Interactive demonstration

Introduction

Large language models (LLMs) can do impressive things with in-context information—few-shot examples, retrieved passages, tool outputs, long instructions, even tables pasted as text. But inside the model, all of that arrives as a single flat sequence. The only “layout” signal it reliably gets is the token index: 0,1,2,,L10, 1, 2, \cdots, L-1.

For humans, presentation matters. Cognitive Load Theory (CLT) states that our working memory is limited, and performance drops when we waste capacity on extraneous load—effort caused by clutter or poor organization—rather than on germane load, the effort that actually helps solve the task . Putting related items together and removing distractions can make the same problem dramatically easier.

We argue modern LLMs have a similar bottleneck: their positional structure is usually rigid—either a strict linear index (as in RoPE) or effectively a constant position (NoPE). This rigidity bakes in a locality bias and can make it harder to (i) ignore irrelevant context (“noise”), (ii) reason over structured inputs that were flattened into text, and (iii) use far-away but crucial information in long contexts.

RePo is a lightweight module that lets the model reshape the geometry of the context for attention without changing the autoregressive order. For each token, a small network fϕf_\phi predicts a real-valued position ziz_i from the token’s hidden state. These learned positions are then plugged into a differentiable positional encoding function (e.g., RoPE), so attention can treat “semantically related” tokens as closer—even if they were far apart in the original prompt.

In our experiments, we continually pre-train OLMo-2 1B for 50B tokens and find consistent gains on noisy-context, structured-data, and long-context evaluations, while staying competitive on short general benchmarks.

Contributions (at a glance):

Background

Rigid positions in Transformers

In a standard Transformer, each token xix_i is mapped to an embedding or a bias value and then passed through KK layers. Each layer applies self-attention, which depends on pairwise interactions between tokens. Positional encoding provides the model with a notion of “where” a token is in the sequence.

RoPE is a widely used positional encoding. In RoPE, position information is incorporated by rotating query and key vectors in a complex plane (or equivalently, by applying a block-diagonal rotation matrix). For token index ii, RoPE produces a position-dependent rotation R(i)R(i) that modifies queries and keys:

qiR(i)qi,kiR(i)ki.\mathbf{q}_i \leftarrow R(i)\mathbf{q}_i,\quad \mathbf{k}_i \leftarrow R(i)\mathbf{k}_i.

Because ii is a fixed integer index, RoPE hard-codes a rigid notion of distance, regardless of meaning of tokens.

Why rigidity can hurt

This rigidity is not always a problem—natural language often has local dependencies. But in modern prompting and retrieval scenarios, rigidity can become a bottleneck:

Several approaches relax positional assumptions (e.g., NoPE, hybrid NoPE and RoPE layers , and pp-RoPE ), but most still treat positions as either fixed indices or absent.

Methods

The key idea of RePo is simple: instead of using the input index as the position signal, we let the model assign each token a real-valued position based on its semantics. These learned positions are then used inside a differentiable positional encoding function (e.g., RoPE).

Position Representation

Given a sequence of inputs, we first project the ii-th hidden state hik\mathbf{h}_i^k at layer kk into a smaller position-representation vector with a small MLP gϕkg_\phi^k:

pik=gϕk(hik).\mathbf{p}_i^k = g_\phi^k(\mathbf{h}_i^k).

Position Assignment

For each attention head hh in layer kk, we produce a scalar position and use it in the positional encoding (e.g., RoPE):

zik,h=wϕk,h(pik).z_i^{k,h} = w_\phi^{k,h}(\mathbf{p}_i^k).

We then replace the discrete index ii with this real value ziz_i when applying positional encoding. For RoPE, that means using R(zi)R(z_i) instead of R(i)R(i):

qiR(zi)qi,kiR(zi)ki.\mathbf{q}_i \leftarrow R(z_i)\mathbf{q}_i,\quad \mathbf{k}_i \leftarrow R(z_i)\mathbf{k}_i.

Because ziz_i is continuous and learned, tokens that are semantically related can end up closer in positional space, even if their original indices are far apart.

RePo Layer Position

We can apply RePo from a certain depth onward (e.g., starting at layer 5) to avoid disrupting early token representations, and to keep the module stable.

Training & Efficiency

The RePo parameters are trained end-to-end with the backbone model. In practice, the compute overhead is less than 1%.

Experiments

We evaluate whether learned context re-positioning helps most in the settings where a rigid token order tends to break down: (1) noisy context (lots of irrelevant tokens), (2) structured inputs that were flattened into text, and (3) long-context extrapolation beyond the 4K training length. Each subsection below includes the corresponding plot/table directly in the main text so you can skim the takeaway first, then dive into details.

Setup

Backbone and training

We use OLMo-2 (1B) as the backbone, whose performance is comparable to Qwen-2.5 . We start from the stage-1 OLMo-2 checkpoint and continually pre-train on the stage-2 dataset for 50B tokens with a training context length of 4096. We keep the configuration and codebase identical to those released by and train on 4 H100 GPUs.

RePo configuration

We apply RePo starting from the 5th layer, approximately 1/31/3 of the total number of layers. In each layer that uses RePo, we share the parameters for the position-representation transformation, while learning head-specific position assignment. The learned position-representation hidden size is 256 (i.e., 1/81/8 of the model hidden size).

Baselines

Evaluation suites

We use the allenai/olmes evaluation codebase and group our main tasks into three stress tests:

Noisy Context

All subtasks inject irrelevant information into the prompt. Here we show that—even within the 4K training context lengthRePo outperforms RoPE by 11.04 points. This supports the intuition that re-positioning can reduce extraneous load and help attention focus on what matters.

Tab 1. Noisy-context robustness evaluation: accuracy on RULER tasks.
MethodAvg.NIAHQAAGGVTΔ vs RoPE
RoPE44.6482.5657.0037.981.000.00
NoPE39.5674.5949.0022.4512.20-5.08
R2N143.9085.0059.5031.100.00-0.74
N2R149.4480.0058.0032.7527.00+4.80
RePo55.6888.2561.0035.0538.40+11.04

Structured Data

When structured inputs are flattened into text, preserving relational structure becomes hard. RePo improves over vanilla RoPE by an average of 1.94 EM points. A notable nuance: NoPE performs best on the graph dataset, suggesting that the “locality” prior encoded by standard positional strategies may not match graph-structured inputs.

Tab 2. Structured-data evaluation: Exact Match (EM) scores.
MethodNLGraphHybridQAAvg.Δ vs RoPE
RoPE27.4324.4325.930.00
NoPE29.9023.5226.71+0.78
R2N127.1125.1126.11+0.18
N2R125.4223.8624.64-1.29
RePo29.0326.7027.87+1.94

Longer Context

Fig 1. Long-context evaluation on RULER: YaRN is used for all RoPE layers to extend the context. We observe consistent results on the more realistic benchmark LongBench in Table 3.

The advantage of RePo grows with length. In Figure 1, RePo already leads at 4K, and the gap widens at 8K and 16K—lengths never seen during training. We also evaluate on LongBench to reduce the risk that gains come only from “synthetic” noise setups; Table 3 shows RePo beats other baselines by at least 5.48 points. Interestingly, R2N1 is the strongest baseline, consistent with the findings of .

Tab 3. Long-context evaluation on LongBench: average F1/Rouge scores
MethodSingle-Doc QAMulti-Doc QASum.Few-shotLongBench
RoPE12.9423.327.9622.0021.07
NoPE1.809.115.1518.127.29
R2N116.2425.888.7422.0022.83
N2R11.2616.245.3121.5012.56
RePo15.2430.8612.5331.5028.31

Analyses

Why does RePo help? This section peeks inside the model to connect the performance gains to two questions:

  1. Where do the gains come from? (e.g., does attention move toward the truly relevant tokens?)
  2. What kinds of position patterns does RePo learn? (e.g., how “dense” or “non-linear” is the learned position space?)

The analyses below focus on attention behavior in needle-in-a-haystack settings and on statistics/patterns of the learned positions.

Attention Mass on Relevant Tokens

Because RePo re-organizes context based on intrinsic structure, we expect it to better capture long-range dependencies, by making distant-but-relevant tokens closer in the model’s positional space. To test this, we analyze attention behavior across different position-assignment strategies on the needle-in-a-haystack (NIAH) task and quantitatively measure the attention mass, i.e., attention scores averaged across attention heads and layers, from generated tokens to three non-overlapping parts of the context, following :

We conduct our analysis on the NIAH dataset provided by RULER , where the context follows the format:

RestNeedleRestQuery\mathtt{Rest} \cdots \mathtt{Needle} \cdots \mathtt{Rest} \cdots \mathtt{Query}

As shown in Table 4, for needle tokens that are distant yet critical for generation, our RePo method allocates substantially more attention mass than both the linear (i.e., RoPE) and constant (i.e., NoPE) position assignment strategies. Compared with RePo, the linear position assignment also exhibits a stronger locality bias, encouraging attention allocation to nearby query tokens. In addition, the constant position assignment, which treats all positions uniformly, produces an attention pattern with much lower variance across the three parts. These findings explain how our RePo method achieves performance gains on tasks involving noisy context, and also support our motivation based on Cognitive Load Theory (CLT), where the germane load (e.g., the attention mechanism) can better process the context information with context re-positioning.

Tab 4. Average attention mass on NIAH tokens (x 0.01)
MethodNeedleQueryRest
RoPE1.7541.1230.014
NoPE1.5721.1350.014
RePo2.0131.0460.015

Position Patterns Learned by RePo

To better understand the patterns learned by RePo, we analyze the characteristics of the assigned positions, first focusing on their ranges and then on their local patterns.

We first collect statistics on the distances between the maximum and minimum assigned positions for each attention head:

dk,h=max(zk,h)min(zk,h),d^{k, h} = \mathrm{max}(\bold{z}^{k,h}) - \mathrm{min}(\bold{z}^{k,h}),

where zk,h={z1k,h,z2k,h,,zLk,h}\bold{z}^{k,h} = \{z_1^{k,h}, z_2^{k,h}, \cdots, z_L^{k,h}\}, LL is the number of tokens in input x\bold{x}, and kk and hh represent the indices of the attention head and layer, respectively.

We find that different heads learn different position ranges, suggesting that heads specialize to different granularities of context structure. Some heads learn relatively small ranges (local re-organization), while others learn large ranges (global re-organization), consistent with the idea that attention heads capture diverse relational patterns.

We also visualize local position patterns by plotting assigned positions across tokens for selected heads and layers. These plots reveal non-linear patterns such as plateaus (many tokens mapped to similar positions) and abrupt jumps (tokens mapped far apart), indicating that the learned position space is not a simple rescaling of the input index.

Distribution of position distances across attention heads. The distances are measured as the difference between the maximum and minimum learned positions within a context.
Example learned position patterns (new position vs. input token index) across selected layers/heads, illustrating non-linear structures such as clusters and jumps.

Performance on General Tasks

As shown in Table 5, along with the noticeable performance gain in previous experiments, our RePo method still achieves performance comparable to the RoPE method on extensive general benchmarks. This occurs even though changing from linear position assignment to RePo causes an inconsistency between pre-training and post-training positional representations.

Tab 5. General benchmark performance
Benchmark suiteMetricRoPERePo
ARC-CAcc47.9947.61
ARC-EAcc75.2574.87
BoolQAcc72.1273.58
CoQAF156.8757.44
DropF137.9038.17
HellaswagAcc70.6870.08
MMLU-ProAcc13.7713.52
TriviaQAF154.9854.56
Average53.7053.73

Conclusion

We addressed high extraneous cognitive load for transformer-based LLMs with rigid positional structures, proposing RePo, a lightweight learned module that repositions tokens based on their semantics. Across noisy-context, structured-data, and long-context evaluations, RePo consistently improves performance and shows interpretable position patterns that capture the structure and dependencies in a prompt.

Acknowledgements

Citation

@techreport{sakana2025repo,
  author    = {Huayang Li and Tianyu Zhao and Richard Sproat},
  title     = {{RePo: Language Models with Context Re-Positioning}},
  institution = {Sakana AI},
  year      = {2025},
  month     = {December},
  note      = {Technical Report}
}

Open Source Code

We release our code for this project here.

Appendix

Please view the PDF version of the paper for the appendix, which contains additional details and analyses.