Instant LLM Updates with Doc-to-LoRA and Text-to-LoRA

TL;DR

Long-term memory and continual adaptation of Large Language Models (LLMs) are two key challenges of current agentic systems. Here, we propose the usage of auxiliary modulator networks (so-called "hypernetworks") that modify LLM weights on the fly to compress document information and master new skills. Doc-to-LoRA enables knowledge updates by turning documents into LoRA adapters, allowing a model to internalize new factual content without retraining. Text-to-LoRA creates LoRA adapters for task-specific fine-tuning, using only a short task description.

Recent LLM agents have shown impressive capabilities on complex computer use and long-horizon tasks. Yet, they still struggle with long-term memory and adaptation--two of the most important cognitive capabilities that still limit LLMs today. Without long-term memory, users have to provide LLMs with relevant content at the start of every new session, creating friction, discontinuity, and longer time-to-response. Additionally, due to the lack of adaptation, they do not learn from mistakes or user preferences from previous sessions, making each interaction as cumbersome as the first. Traditionally, these two problems are tackled by "updating" the model.

1. LLM knowledge update (memory). When a user provides a long document, e.g., a policy, a report, or a private PDF, the standard solution is to put it in the context window. This works, but it means every new query re-reads the same document, paying the full latency and VRAM cost each time. Practical workarounds like KV-cache pre-filling help, but they do not eliminate the per-query overhead, and they break down entirely once the document exceeds the model's context window. An alternative is context distillation , which encodes new information directly into the model's parameters, allowing it to access that knowledge without re-reading the source. However, this update process tends to be slow and computationally expensive.

2. LLM fine-tuning (adaptation). When practitioners want a model to consistently follow a new format, handle a specialized task, or adopt a particular style, the standard solution is fine-tuning. This also works, but it requires collecting or generating data, curating it carefully, and running an expensive training pipeline. Furthermore, iterating on the traditional fine-tuning pipeline requires repetitive data collection and running fine-tuning jobs, slowing down experimentation speed.

Both kinds of model updates, fine-tuning and context distillation, share a common bottleneck. Specifically, we want to move information inside the model, but the path to get it there is slow and expensive. In this post, we introduce two complementary research papers that take a different approach to model updates. Instead of naively updating the model during deployment, our methods pay the update costs up front by learning an "update generator" that can be reused cheaply at deployment time. The key procedure is to train a hypernetwork--a network whose output is the parameters of another network--to instantly and cheaply generate compact LoRA adapters. Once trained, this hypernetwork acts as our update generator that produces task-specific updates for the target LLM on the fly.

In our papers, this concept is implemented as a two-phase update cost amortization workflow, with a clean separation of costs. First, at meta-training time (expensive, done once), we train a hypernetwork that learns how to generate effective LoRA updates from a given input. Then, at deployment time (cheap, done often), that input (a document or a task description) is fed to the hypernetwork, which returns a LoRA adapter in a single sub-second forward pass. Therefore, we can avoid the expensive per-task optimization pipeline entirely. While Doc-to-LoRA and Text-to-LoRA share this update cost amortization framework, they differ in what they learn and how they are used:

	Doc-to-LoRA	Text-to-LoRA
Problem solved	Expensive context distillation for knowledge updates	Expensive fine-tuning pipeline for model adaptation
What the hypernetwork learns	Instantly adding new factual knowledge to an LLM	Instantly adapting an LLM for downstream tasks
Input to hypernetwork	Document (text or visual tokens via a VLM)	Natural-language task description
What LoRA stores	Factual knowledge from the document	Task-specific behavior or skill

LoRA (Low-Rank Adaptation) is a parameter-efficient way to modify a large model without updating all of its weights, by learning a low-rank update instead. In practice, that means the model update is stored in a compact adapter that is much smaller than the model itself, fast to apply at inference time without changing the base weights, and modular enough that we can store many adapters (skills, customers, domains), swap them, and sometimes even compose them.

Hypernetworks are neural networks that output parameters for another neural network. Here, the hypernetwork outputs LoRA parameters. Once trained, generating an “update” becomes a single forward pass, replacing the traditional optimization loop, and the base model can remain frozen. Conceptually, the hypernetwork is learning an “update rule”--a function that maps from an input (description or document) to a structured weight change (i.e., a LoRA adapter) that specializes the model for a specific task or internalizes a specific document.

The lack of long-term memory in LLMs has real-world implications. For instance, every time a user asks about a document, the model re-reads it in full, paying the full latency and VRAM overhead. Alternatively, instead of keeping the document in the context window, we can distill it directly into the base model's weights as a LoRA adapter, acting as a long-term memory. The standard route for this is context distillation, but as we noted above, it requires per-document optimization with large memory requirements and takes a substantial amount of wall-clock time, which is not suitable for low-latency applications. Doc-to-LoRA asks whether a hypernetwork can meta-learn to perform that distillation step cheaply by mapping a document to a LoRA adapter in a single forward pass, with no per-document gradient updates. Unlike task adapters that usually encode task-specific behavior, LoRAs generated by Doc-to-LoRA act as factual storage. Once a document is internalized, the LLM can answer any number of questions without the document ever appearing in the context window again.

Training Overview

Doc-to-LoRA uses the teacher-student objective from context distillation, but amortizes the per-document training cost through meta-training. Instead of per-document optimization at deployment time, the hypernetwork learns to predict document-specific LoRA updates instantly. The training is conceptually very simple. We encode the document through a frozen LLM to get per-layer token activations. A Perceiver-based hypernetwork maps those activations to rank-8 LoRA matrices, trained to minimize the gap between teacher (full document context) and student (LoRA-adapted, no context) responses.

Implementation-wise, we use 8 cross-attention blocks, ~309M parameters, targeting MLP layers in Gemma-2-2b-it. At inference, it can be run in batched mode (all layers at once, faster) or iterative mode (one layer at a time, lower memory). Both finish in under a second. One key design choice involves using a chunking mechanism to handle long documents. A single fixed-rank adapter can become a bottleneck for very long contexts, so we partition documents into contiguous chunks and process each independently with the same hypernetwork. Each chunk produces a rank-r LoRA, which we compose by concatenating along the rank dimension, yielding an effective rank of r × K. This design scales with document length without changing the hypernetwork architecture. This matters especially for documents longer than the model's context window. On a Needle-in-a-Haystack task, Doc-to-LoRA achieves near-perfect accuracy on contexts up to 32K tokens--despite training only on sequences up to 256 tokens. The chunking mechanism generalizes well beyond training length, handling long sequences effectively.

Doc-to-LoRA Internalizes Needles Beyond the Base Context Window

In this experiment, we aim to show that D2L (i) successfully induces knowledge internalization, enabling the base model to recall the implanted information without reading the raw context, (ii) effectively bypasses the inherent context-length limitations of the base language model, and (iii) reduces the computational requirements for inference, especially when the inputs are long. For illustrative purposes, we evaluate D2L on a synthetic needle-in-a-haystack (NIAH) information retrieval task. During D2L’s meta-training, we use input contexts ranging from 32 to 256 tokens in length. The training inputs are randomly chunked from 1 to 8 chunks with a minimum chunk size of 25 tokens.

During evaluation, the baseline has direct access to both the haystack and the query. For D2L, the base LLM does not have direct access to any part of the original context but is simply given the query prompt. Doc-to-LoRA segments the haystack into 1,024-token chunks and composes them into a single adapter. Doc-to-LoRA's accuracy stays near-perfect up to ~40K tokens--despite training only up to 8 chunks. The base model's 8K context window fails beyond 8K tokens, while Doc-to-LoRA accesses information that is completely out of reach for direct in-context retrieval. There is also a huge efficiency gain. The base model needs 12+ GB of extra memory for a 128K-token haystack. With Doc-to-LoRA, internalized knowledge uses under 50 MB--constant regardless of document length. For real-world use, this is powerful: users can internalize private documents once, then chat without the memory overhead of keeping everything in context.

Efficient and Effective Internalization on Reading Comprehension Tasks

Next, we evaluate Doc-to-LoRA on real-world reading comprehension benchmarks. In this blog post, we show SQuAD performance relative to the base model with full in-context access and other relevant baselines. Doc-to-LoRA reaches 83.5% of the full-context upper bound--without any document information in the context window--using less than one second of update time. For comparison, oracle context distillationOracle CD uses the test query directly for distillation, unlike generated-query CD which requires expensive query generation and backprop. needs 40 seconds for the model update. Traditional CD exceeds 100 seconds due to a high-latency query-generation process. Memory efficiency is equally impressive. Doc-to-LoRA and oracle CD both use ~1 GB, while generated-query CD exceeds 40 GB. This represents a major practical advantage for Doc-to-LoRA, achieving competitive quality at sub-second speed with minimal memory. The gain only grows with longer documents.

Instant Zero-Shot Internalization of Long-Context Information

Long-context QA tasks pose a real challenge for standard distillation due to memory and compute constraints. We note that test samples can go up to 32K tokens, far beyond Doc-to-LoRA's longest training example (2,344 tokens). Doc-to-LoRA can generalize beyond its training length thanks to the chunking mechanism. Doc-to-LoRA achieves 85% relative accuracy, again with sub-second update latency. Oracle CD scores higher at 90% but takes 40 seconds and requires more than 7 GB VRAM. The longer the document, the bigger this advantage. For full technical details, see our paper.

Zero-Shot Generalization to Encoding Visual Information as LoRA

The main motivation for Doc-to-LoRA is adding new knowledge to LLMs cheaply. Naturally, factual information usually comes in textual form, e.g., static manuals or textbooks. However, we are not restricted to only textual information. Thus, in this experiment, we test an extreme form of “internalization”: can a text-only model answer questions about an image processed by a VLM without ever giving the text model anything beyond the internalized information? Specifically, we train another hypernetwork instance using a VLM (Gemma-3-4b-it) as the document encoder and never include any images during the training phase of the hypernetwork. Thus, this experiment tests Doc-to-LoRA's zero-shot capability to encode visual information into LoRA.But how can Doc-to-LoRA generalize if the hypernetwork and the target model have never seen any images? We hypothesize that, since modern VLM architectures map visual tokens into the same latent space as textual tokens, reading information from visual tokens is similar to reading information from non-English textual tokens (assuming that English provides a global anchor for other languages in VLMs and LLMs).

We find that the answer can be “yes” to a surprising degree. On the 10‑class Imagenette subset of ImageNet, the target text-only model reaches 75.03% accuracy purely through information stored in the generated LoRA adapter. This result is remarkable as neither the hypernetwork nor the base model have seen any visual tokens during training. More broadly, this result suggests that Doc-to-LoRA could be used as a general "Context-to-LoRA" mapping. An interesting research idea is to explore how hypernetworks can become a modality bridge that moves information extracted by one model into another model’s parameters, à la model merging.

**(V)QA Benchmark Accuracy.** Doc-to-LoRA successfully learns to map the VLM’s activations into LoRA matrices of the target LLM. Although using a VLM as the context encoder negatively impacts text-based QA performance, it enables us to directly communicate visual information extracted by the VLM.

To make things more concrete, we include an interactive demo below that shows transcripts from real Doc-to-LoRA sessions. In each session, a document is internalized once into a generated adapter, and the model then answers questions about it with no raw context in the prompt. Toggle between No Context (the base model) and Internalized (the adapter active) to see the difference. Hover over highlighted spans to trace exactly which parts of the document each answer draws from.

While Doc-to-LoRA targets the long-term memory problem, Text-to-LoRA targets adaptation. Adapting LLMs via fine-tuning requires collecting or synthesizing data, curating it carefully, and running a training job. Each iteration through that loop is slow, manual, and expensive. Furthermore, the result is a single adapter tightly coupled to one specific dataset. Text-to-LoRA asks whether a hypernetwork can meta-learn the entire fine-tuning pipeline. Given only a natural-language description of a task, can it generate a useful LoRA adapter in a single forward pass? If so, then per-task adaptation cost disappears, replaced by a one-time upfront meta-training investment in the hypernetwork itself. During deployment, adaptation and task specialization become as simple as writing a description.

Training Overview

This training setup captures the same paradigm shift presented in the introduction. Instead of optimizing a new adapter for every task at deployment time, we train one generator that can predict adapters on demand. Concretely, a task description is first encoded into a task embedding and the hypernetwork outputs all target LoRA weights in one forward pass. In our paper, we explore two objectives. In reconstruction training, Text-to-LoRA learns to match existing task-specific LoRA adapters. In SFT training, Text-to-LoRA is trained end-to-end through downstream task loss. That is, generated adapters are applied to a frozen base model, and gradients update the hypernetwork directly without requiring intermediate target adapters.

In the main setup, the base model is Mistral-7B-Instruct, LoRA targets q_proj and v_proj at rank 8 across all layers (about 3.4M adapter parameters), and training data comes from the SNI-derived Lots-of-LoRAs collection (479 training tasks after filtering). This constrained output space keeps updates compact and modular while still enabling single-forward-pass generation at deployment.

**Text-to-LoRA Training Overview.**
The hypernetwork generates LoRA weights (∆W) from a task embedding. Training can be done either via reconstruction loss (distill existing LoRAs) or via supervised fine-tuning loss.

Reconstruction-trained Text-to-LoRA Recovers Target Adapters

Reconstruction training is essentially adapter-library compression. Instead of storing many task-specific adapters, we store one hypernetwork that can regenerate them from descriptions. It's a sanity check to confirm that Text-to-LoRA can learn to reproduce known weights. On seen tasks, reconstruction-trained Text-to-LoRA recovers most of the task-specific performance--sometimes even matching the original adapter. The issue is that reproducing known weights isn't the same as learning a robust description-to-LoRA mapping that generalizes to new tasks.

To test generalization, we move to end-to-end SFT training on a different set of tasks. The question becomes whether it can generate useful adapters for tasks it has never seen.

SFT-trained Text-to-LoRA Zero-Shot Generates Useful Adapters for Unseen Tasks

SFT training addresses the core question of whether we can generate useful adapters from unseen task descriptions. We train on 479 diverse tasks from the Lots-of-LoRAs dataset. This diversity teaches the hypernetwork a general mapping from natural-language task descriptions to corresponding model updates. Even without oracle adapters to guide training, SFT-trained Text-to-LoRA generates adapters that beat the base model and outperform baselines on held-out tasks.

We also observe a clear scaling trend where larger hypernetworks and more training data produce better, more generalizable instances. This confirms that the hypernetwork "adapter generator" scales well--larger models and more data lead to better text-to-adapter mappings. See our paper for full details on architecture, experiments, ablations, and analyses.

Conclusion and outlook

Doc-to-LoRA and Text-to-LoRA are built around a simple idea. Instead of treating model updates as slow and expensive training jobs, we can learn them as hypernetworks. The theme of our research in this direction is cost amortization, where we pay the meta-training cost upfront for one update generator that can produce many task- or document-specific LoRAs on demand, turning what used to be an engineering pipeline into a single forward pass.

Framing hypernetworks as update generators is powerful, and at first glance almost too good to be true for improving downstream performance on arbitrary tasks. Indeed, it is not a free lunch. Meta-training can be very expensive (days to weeks on multiple GPUs), and there are important trade-offs and limitations to consider. However, we are very excited about this research direction because it opens a new design space of instant, modular updates that can be generated and applied cheaply on demand.

Looking forward, we see update generators potentially being a foundational interface. By scaling both compute and data, one could train a foundation hypernetwork that unifies Doc-to-LoRA and Text-to-LoRA (and future modalities) so that the same system can ingest task descriptions, documents, or experiences and generate useful adapters for future interactions. We envision a shared “update API” for LLMs, where different sources of supervision are simply different inputs to the same generator, yielding modular, composable adapters.

Furthermore, instant update interfaces allow many new kinds of LLM memory architectures. For instance, instead of dumping all memory as external files, models can “nap”We take the analogy from Lin, Kevin, et al. "Sleep-time compute: Beyond inference scaling at test-time."

between interactions, distill new information into adapters, and wake up with updated behavior. This memory architecture allows users to start new sessions to avoid high latency in long conversations without losing relevant information, as past conversations have already been distilled between sessions. In practice, the model update could happen overnight to refresh the model in a personalized way and be ready for the next day's user interactions. We believe that this approach can enable rapid personalization, mass model customization, and continual learning without paying the full cost of fine‑tuning every time.

This page requires JavaScript. Please enable it to view the website.

Instant LLM Updates with
Doc-to-LoRA and Text-to-LoRA

Training Overview

Doc-to-LoRA Internalizes Needles Beyond the Base Context Window

Efficient and Effective Internalization on Reading Comprehension Tasks

Instant Zero-Shot Internalization of Long-Context Information

Zero-Shot Generalization to Encoding Visual Information as LoRA

Training Overview

Reconstruction-trained Text-to-LoRA Recovers Target Adapters

SFT-trained Text-to-LoRA Zero-Shot Generates Useful Adapters for Unseen Tasks

Conclusion and outlook

Citation

This page requires JavaScript. Please enable it to view the website.

Instant LLM Updates withDoc-to-LoRA and Text-to-LoRA

Training Overview

Doc-to-LoRA Internalizes Needles Beyond the Base Context Window

Efficient and Effective Internalization on Reading Comprehension Tasks

Instant Zero-Shot Internalization of Long-Context Information

Zero-Shot Generalization to Encoding Visual Information as LoRA

Training Overview

Reconstruction-trained Text-to-LoRA Recovers Target Adapters

SFT-trained Text-to-LoRA Zero-Shot Generates Useful Adapters for Unseen Tasks

Conclusion and outlook

Citation

Instant LLM Updates with
Doc-to-LoRA and Text-to-LoRA