Recent LLM agents have shown impressive capabilities on complex computer use and long-horizon tasks. Yet, they still struggle with long-term memory and adaptation--two of the most important cognitive capabilities that still limit LLMs today. Without long-term memory, users have to provide LLMs with relevant content at the start of every new session, creating friction, discontinuity, and longer time-to-response. Additionally, due to the lack of adaptation, they do not learn from mistakes or user preferences from previous sessions, making each interaction as cumbersome as the first. Traditionally, these two problems are tackled by "updating" the model.
1. LLM knowledge update (memory). When a user provides a long document, e.g., a policy, a report, or a private PDF, the standard solution is to put it in the context window. This works, but it means every new query re-reads the same document, paying the full latency and VRAM cost each time. Practical workarounds like KV-cache pre-filling help, but they do not eliminate the per-query overhead, and they break down entirely once the document exceeds the model's context window.
An alternative is context distillation
2. LLM fine-tuning (adaptation). When practitioners want a model to consistently follow a new format, handle a specialized task, or adopt a particular style, the standard solution is fine-tuning. This also works, but it requires collecting or generating data, curating it carefully, and running an expensive training pipeline. Furthermore, iterating on the traditional fine-tuning pipeline requires repetitive data collection and running fine-tuning jobs, slowing down experimentation speed.
Both kinds of model updates, fine-tuning and context distillation, share a common bottleneck. Specifically, we want to move information inside the model, but the path to get it there is slow and expensive. In this post, we introduce two complementary research papers that take a different approach to model updates. Instead of naively updating the model during deployment, our methods pay the update costs up front by learning an "update generator" that can be reused cheaply at deployment time. The key procedure is to train a hypernetwork--a network whose output is the parameters of another network--to instantly and cheaply generate compact LoRA adapters. Once trained, this hypernetwork acts as our update generator that produces task-specific updates for the target LLM on the fly.
In our papers, this concept is implemented as a two-phase update cost amortization workflow, with a clean separation of costs. First, at meta-training time (expensive, done once), we train a hypernetwork that learns how to generate effective LoRA updates from a given input. Then, at deployment time (cheap, done often), that input (a document or a task description) is fed to the hypernetwork, which returns a LoRA adapter in a single sub-second forward pass. Therefore, we can avoid the expensive per-task optimization pipeline entirely. While Doc-to-LoRA and Text-to-LoRA share this update cost amortization framework, they differ in what they learn and how they are used:
| Doc-to-LoRA | Text-to-LoRA | |
|---|---|---|
| Problem solved | Expensive context distillation for knowledge updates | Expensive fine-tuning pipeline for model adaptation |
| What the hypernetwork learns | Instantly adding new factual knowledge to an LLM | Instantly adapting an LLM for downstream tasks |
| Input to hypernetwork | Document (text or visual tokens via a VLM) | Natural-language task description |
| What LoRA stores | Factual knowledge from the document | Task-specific behavior or skill |
LoRA (Low-Rank Adaptation)
Hypernetworks
The lack of long-term memory in LLMs has real-world implications. For instance, every time a user asks about a document, the model re-reads it in full, paying the full latency and VRAM overhead. Alternatively, instead of keeping the document in the context window, we can distill it directly into the base model's weights as a LoRA adapter, acting as a long-term memory. The standard route for this is context distillation, but as we noted above, it requires per-document optimization with large memory requirements and takes a substantial amount of wall-clock time, which is not suitable for low-latency applications. Doc-to-LoRA asks whether a hypernetwork can meta-learn to perform that distillation step cheaply by mapping a document to a LoRA adapter in a single forward pass, with no per-document gradient updates. Unlike task adapters that usually encode task-specific behavior, LoRAs generated by Doc-to-LoRA act as factual storage. Once a document is internalized, the LLM can answer any number of questions without the document ever appearing in the context window again.
Doc-to-LoRA uses the teacher-student objective from context distillation, but amortizes the per-document training cost through meta-training. Instead of per-document optimization at deployment time, the hypernetwork learns to predict document-specific LoRA updates instantly. The training is conceptually very simple. We encode the document through a frozen LLM to get per-layer token activations. A Perceiver-based hypernetwork maps those activations to rank-8 LoRA matrices, trained to minimize the gap between teacher (full document context) and student (LoRA-adapted, no context) responses.
Implementation-wise, we use 8 cross-attention blocks, ~309M parameters, targeting MLP layers in Gemma-2-2b-it. At inference, it can be run in batched mode (all layers at once, faster) or iterative mode (one layer at a time, lower memory). Both finish in under a second.
One key design choice involves using a chunking mechanism to handle long documents. A single fixed-rank adapter can become a bottleneck for very long contexts, so we partition documents into contiguous chunks and process each independently with the same hypernetwork. Each chunk produces a rank-r LoRA, which we compose by concatenating along the rank dimension, yielding an effective rank of r × K. This design scales with document length without changing the hypernetwork architecture.
This matters especially for documents longer than the model's context window. On a Needle-in-a-Haystack task, Doc-to-LoRA achieves near-perfect accuracy on contexts up to 32K tokens--despite training only on sequences up to 256 tokens. The chunking mechanism generalizes well beyond training length, handling long sequences effectively.

In this experiment, we aim to show that D2L (i) successfully induces knowledge internalization, enabling the base model to recall the implanted information without reading the raw context, (ii) effectively bypasses the inherent context-length limitations of the base language model, and (iii) reduces the computational requirements for inference, especially when the inputs are long. For illustrative purposes, we evaluate D2L on a synthetic needle-in-a-haystack (NIAH) information retrieval task. During D2L’s meta-training, we use input contexts ranging from 32 to 256 tokens in length. The training inputs are randomly chunked from 1 to 8 chunks with a minimum chunk size of 25 tokens.
During evaluation, the baseline has direct access to both the haystack and the query. For D2L, the base LLM does not have direct access to any part of the original context but is simply given the query prompt. Doc-to-LoRA segments the haystack into 1,024-token chunks and composes them into a single adapter. Doc-to-LoRA's accuracy stays near-perfect up to ~40K tokens--despite training only up to 8 chunks. The base model's 8K context window fails beyond 8K tokens, while Doc-to-LoRA accesses information that is completely out of reach for direct in-context retrieval. There is also a huge efficiency gain. The base model needs 12+ GB of extra memory for a 128K-token haystack. With Doc-to-LoRA, internalized knowledge uses under 50 MB--constant regardless of document length. For real-world use, this is powerful: users can internalize private documents once, then chat without the memory overhead of keeping everything in context.

Next, we evaluate Doc-to-LoRA on real-world reading comprehension benchmarks. In this blog post, we show SQuAD performance relative to the base model with full in-context access and other relevant baselines.
Doc-to-LoRA reaches 83.5% of the full-context upper bound--without any document information in the context window--using less than one second of update time. For comparison, oracle context distillation

Long-context QA tasks pose a real challenge for standard distillation due to memory and compute constraints. We note that test samples can go up to 32K tokens, far beyond Doc-to-LoRA's longest training example (2,344 tokens). Doc-to-LoRA can generalize beyond its training length thanks to the chunking mechanism. Doc-to-LoRA achieves 85% relative accuracy, again with sub-second update latency. Oracle CD scores higher at 90% but takes 40 seconds and requires more than 7 GB VRAM. The longer the document, the bigger this advantage. For full technical details, see our paper.
The main motivation for Doc-to-LoRA is adding new knowledge to LLMs cheaply. Naturally, factual information usually comes in textual form, e.g., static manuals or textbooks.
However, we are not restricted to only textual information.
Thus, in this experiment, we test an extreme form of “internalization”: can a text-only model answer questions about an image processed by a VLM without ever giving the text model anything beyond the internalized information? Specifically, we train another hypernetwork instance using a VLM (Gemma-3-4b-it) as the document encoder and never include any images during the training phase of the hypernetwork. Thus, this experiment tests Doc-to-LoRA's zero-shot capability to encode visual information into LoRA.
We find that the answer can be “yes” to a surprising degree. On the 10‑class Imagenette subset of ImageNet, the target text-only model reaches 75.03% accuracy purely through information stored in the generated LoRA adapter. This result is remarkable as neither the hypernetwork nor the base model have seen any visual tokens during training.
More broadly, this result suggests that Doc-to-LoRA could be used as a general "Context-to-LoRA" mapping. An interesting research idea is to explore how hypernetworks can become a modality bridge that moves information extracted by one model into another model’s parameters, à la model merging

To make things more concrete, we include an interactive demo below that shows transcripts from real Doc-to-LoRA sessions. In each session, a document is internalized once into a generated adapter, and the model then answers questions about it with no raw context in the prompt. Toggle between No Context (the base model) and Internalized (the adapter active) to see the difference. Hover over highlighted spans to trace exactly which parts of the document each answer draws from.
While Doc-to-LoRA targets the long-term memory problem, Text-to-LoRA targets adaptation. Adapting LLMs via fine-tuning requires collecting or synthesizing data, curating it carefully, and running a training job. Each iteration through that loop is slow, manual, and expensive. Furthermore, the result is a single adapter tightly coupled to one specific dataset. Text-to-LoRA asks whether a hypernetwork can meta-learn the entire fine-tuning pipeline. Given only a natural-language description of a task, can it generate a useful LoRA adapter in a single forward pass? If so, then per-task adaptation cost disappears, replaced by a one-time upfront meta-training investment in the hypernetwork itself. During deployment, adaptation and task specialization become as simple as writing a description.
This training setup captures the same paradigm shift presented in the introduction. Instead of optimizing a new adapter for every task at deployment time, we train one generator that can predict adapters on demand. Concretely, a task description is first encoded into a task embedding and the hypernetwork outputs all target LoRA weights in one forward pass. In our paper, we explore two objectives. In reconstruction training, Text-to-LoRA learns to match existing task-specific LoRA adapters. In SFT training, Text-to-LoRA is trained end-to-end through downstream task loss. That is, generated adapters are applied to a frozen base model, and gradients update the hypernetwork directly without requiring intermediate target adapters.
In the main setup, the base model is Mistral-7B-Instruct, LoRA targets q_proj and v_proj at rank 8 across all layers (about 3.4M adapter parameters), and training data comes from the SNI-derived Lots-of-LoRAs collection (479 training tasks after filtering). This constrained output space keeps updates compact and modular while still enabling single-forward-pass generation at deployment.

Reconstruction training is essentially adapter-library compression. Instead of storing many task-specific adapters, we store one hypernetwork that can regenerate them from descriptions. It's a sanity check to confirm that Text-to-LoRA can learn to reproduce known weights. On seen tasks, reconstruction-trained Text-to-LoRA recovers most of the task-specific performance--sometimes even matching the original adapter. The issue is that reproducing known weights isn't the same as learning a robust description-to-LoRA mapping that generalizes to new tasks.
To test generalization, we move to end-to-end SFT training on a different set of tasks. The question becomes whether it can generate useful adapters for tasks it has never seen.


SFT training addresses the core question of whether we can generate useful adapters from unseen task descriptions. We train on 479 diverse tasks from the Lots-of-LoRAs dataset. This diversity teaches the hypernetwork a general mapping from natural-language task descriptions to corresponding model updates. Even without oracle adapters to guide training, SFT-trained Text-to-LoRA generates adapters that beat the base model and outperform baselines on held-out tasks.
We also observe a clear scaling trend where larger hypernetworks and more training data produce better, more generalizable instances. This confirms that the hypernetwork "adapter generator" scales well--larger models and more data lead to better text-to-adapter mappings. See our paper for full details on architecture, experiments, ablations, and analyses.
Doc-to-LoRA and Text-to-LoRA are built around a simple idea. Instead of treating model updates as slow and expensive training jobs, we can learn them as hypernetworks. The theme of our research in this direction is cost amortization, where we pay the meta-training cost upfront for one update generator that can produce many task- or document-specific LoRAs on demand, turning what used to be an engineering pipeline into a single forward pass.
Framing hypernetworks as update generators is powerful, and at first glance almost too good to be true for improving downstream performance on arbitrary tasks. Indeed, it is not a free lunch. Meta-training can be very expensive (days to weeks on multiple GPUs), and there are important trade-offs and limitations to consider. However, we are very excited about this research direction because it opens a new design space of instant, modular updates that can be generated and applied cheaply on demand.
Looking forward, we see update generators potentially being a foundational interface. By scaling both compute and data, one could train a foundation hypernetwork that unifies Doc-to-LoRA and Text-to-LoRA (and future modalities) so that the same system can ingest task descriptions, documents, or experiences and generate useful adapters for future interactions. We envision a shared “update API” for LLMs, where different sources of supervision are simply different inputs to the same generator, yielding modular, composable adapters.
Furthermore, instant update interfaces allow many new kinds of LLM memory architectures. For instance, instead of dumping all memory as external files, models can “nap”
@techreport{sakana2025doc-to-lora,
title = {{Doc-to-LoRA: Learning to Instantly Internalize Contexts}},
author = {Rujikorn Charakorn and Edoardo Cetin and Shinnosuke Uesaka and Robert Tjarko Lange},
institution = {Sakana AI},
year = {2026},
month = {Febuary},
note = {Technical Report}
}
@inproceedings{charakorn2025texttolora,
title = {Text-to-Lo{RA}: Instant Transformer Adaption},
author = {Rujikorn Charakorn and Edoardo Cetin and Yujin Tang and Robert Tjarko Lange},
booktitle = {Forty-second International Conference on Machine Learning},
year = {2025},
url = {https://openreview.net/forum?id=zWskCdu3QA}
}