This page requires JavaScript to display interactive content.

Sparser, Faster, Lighter Transformer Language Models

tl;dr

In collaboration with NVIDIA, we introduce new sparse data structures and GPU kernels to leverage unstructured sparsity for efficient inference and training of LLMs. This work will be presented at ICML 2026.

Inference and training speedups together with downstream accuracy per sparsity levels.

Summary

Modern large language models (LLMs) are powerful. They can write code, reason through complex problems, and synthesize vast amounts of technical knowledge. However, providing such capabilities at scale comes with tremendous real-world costs.

A large part of these costs comes from the feedforward layers, a ubiquitous component of LLMs since their inception. Yet, inside these layers, an interesting phenomenon can be observed: For any given token, only a small fraction of the hidden activations actually matter, with the rest effectively approximating zero and wasting computation. With ReLU and L1 regularization, this sparsity can be even made to exceed 95% with little to no impact on downstream performance .

So, can we leverage this sparsity to make LLMs faster and lighter?

The difficulties of achieving this objective lie in the hardware. Modern NVIDIA GPUs are extraordinarily good at homogeneous workloads, with specialized units dedicated to dense matrix multiplications. Traditional algorithms for efficiently leveraging unstructured sparsity clearly do not fit these assumptions, and introduce non-trivial constant overheads and bookkeeping that cancel out their theoretical savings.

This work, a collaboration between Sakana AI and NVIDIA, is about resolving this paradox:

To substantiate these gains, we study sparse LLMs at billion-parameter scales and show that mild L1 regularization can induce high levels of sparsity after training with negligible impact on downstream performance. By leveraging our kernels, these sparsity levels translate into over 20% speedups for both batched inference and training on H100 GPUs, while also cutting energy consumption and memory requirements.

Get ready for a deep dive into GPUs, LLMs, and sparsity!

Transformers and Feedforward Layers

At a high level, the "body" of modern transformers is composed of a repeated stack of just two components: an attention block (linear or quadratic), followed by a feedforward block. While attention lets tokens communicate with each other, the feedforward block processes each input token independently, performing multiple matrix multiplications with a set of large learned weight matrices.

Gated feedforward block. The input x branches into an up projection and a gated projection, recombines through elementwise multiplication, and is projected back to the embedding dimension.

Formally, a modern feedforward block takes as input a matrix xRM×Kx \in \mathbb{R}^{M \times K}, where MM is the total number of tokens (number of batched sequences ×\times sequence length), and KK is the token embedding size. The input xx is then multiplied by three different weight matrices to first expand this representation into a much larger hidden space of size NKN \gg K, apply a simple non-linearity, and then project it back down for the next block. Formally, this computation looks as follows:

hu=xWu,hg=σ(xWg),h=huhg,y=hWd.h_u = x W_u, \quad h_g = \sigma(x W_g), \quad h = h_u \odot h_g, \quad y = h W_d.

Here, Wu,WgRK×NW_u, W_g \in \mathbb{R}^{K \times N} are commonly referred to as the weight matrices for the "up" and "gate" projections, and WdRN×KW_d \in \mathbb{R}^{N \times K} as the weight matrix for the "down" projection. The non-linearity σ\sigma, such as SiLU or ReLU, is what introduces sparsity in the gate activations hgh_g, which is carried over to the hidden activations hh following the elementwise product \odot. As the hidden dimension NN is typically much larger than KK (often 4x or more), these layers can often be responsible for the majority of both parameters and FLOPs in modern LLMs.

In this overparameterized regime, prior work has shown that pretrained transformers with ReLU activations can already exhibit high levels of unstructured sparsity in their feedforward blocks , with less than 5-10% non-zero elements in their hidden activations hh . Other works on SiLU-based LLMs have shown that a substantial fraction of feedforward activations can also be set to zero with little or no additional training, either through training-free thresholding or lightweight finetuning . In this work, we study how different levels of sparsity affect both performance and efficiency by combining ReLU activations with a simple auxiliary L1 regularization loss on the hidden activations, weighted by a coefficient L1L_1 and averaged across all neurons:

L1×1MNm=1Mn=1Nh[m,n].L_1 \times \frac{1}{MN} \sum_{m=1}^{M} \sum_{n=1}^{N} \lvert h[m,n] \rvert.

GPUs, Memory, and Tiling

To understand the challenges of leveraging sparsity to accelerate matrix multiplication, we provide a brief primer on how modern GPUs execute programs and the hardware characteristics.

Whenever executing any code on an NVIDIA GPU using PyTorch or any other library, this launches a GPU kernel, a massively parallel program executed across many threads on the device. The kernel's computation is first distributed across a grid of many independent units of work called cooperative thread arrays (CTAs)A CTA is also commonly referred to as a thread block in several tutorials and books.. Each CTA is scheduled on a single Streaming Multiprocessor (SM)An SM is the GPU execution unit that schedules and runs CTAs, somewhat analogous to a CPU core but built for massive parallelism. and consists of many threads working together to carry out the kernel operations on a set portion of the input data. This design allows the same optimized kernel, written from the perspective of the threads in a single CTA, to be easily ported to different input sizes by simply scaling the number of CTAs in the grid. The threads in the CTA are not entirely independent of each other but are grouped into units of 32 called warps, which execute the same instructions at each clock cycle on different inputs. All these constraints help reduce the cost and chip-space dedicated to each execution unit and enable GPUs to achieve massive parallelism: thousands of threads are scheduled across many CTAs, all performing arithmetic and memory operations concurrently.

However, not all operations are equal. As theoretical FLOPs on modern NVIDIA devices have risen dramatically in recent years, the bottleneck of many common kernels is often memory bandwidth.

GPU memory hierarchy. Registers sit closest to the compute unit, shared memory is CTA-local, and global memory is shared across the whole GPU.

Modern GPUs have a hierarchical memory system with three distinct levels. At the bottom level, we have a global memory available to all threads running on the GPU and residing separately from the compute units on a stack of high-bandwidth memory (HBM). While global memory can be large (80GB for an H100 GPU), it comes with high latency on the order of ~500 cycles and limited per-device bandwidth. Surrounding the compute units themselves, on the second level, we have on-chip shared memoryShared memory and registers are built from SRAM, which is extremely fast but physically large and expensive per bit.. Shared memory is much faster, with latencies closer to ~10–20 cycles and is private to each CTA. At the top level, we have on-chip register memory which can be accessed in a single cycle and is private to each thread. Moving up the hierarchy, each level becomes over an order of magnitude faster, but correspondingly smaller due to the physical constraints of on-chip storage. For instance, while an H100 GPU provides 80GB of HBM, each CTA is limited to at most 228KB of shared memory, and each thread to only a few kilobytes of register storage.

Thus, designing efficient kernels requires minimizing accesses to global memory. This is directly reflected in the design of modern matrix multiplication kernels. Suppose we want to compute h=xWh = x W, where xRM×Kx \in \mathbb{R}^{M \times K} and WRK×NW \in \mathbb{R}^{K \times N}. Each output element in this operation can be seen as an independent dot product:

h[m,n]=k=1Kx[m,k]W[k,n].h[m, n] = \sum_{k=1}^{K} x[m, k] W[k, n].

An intuitive but suboptimal approach to implement this operation would distribute work across threads at the granularity of individual output elements. For instance, each thread could be assigned a single output element h[m,n]h[m, n], load the row mm of xx and column nn of WW from global memory, perform the dot product, and store the results. The main issue with this approach is that it requires two separate loads for x[m,k]x[m, k] and W[k,n]W[k, n] before each partial accumulation. As a result, the same values of xx and WW are redundantly reloaded across many different threads, leading to a large number of expensive global memory transactions that dominate execution time.

Modern GPU kernels avoid this by using tiling. Instead of each thread operating on individual output elements, computation is reorganized by having the threads in each CTA cooperatively compute a separate tile of the output of shape Tm×TnT_m \times T_n. With this approach, each CTA iterates over the reduction dimension KK in chunks of size TkT_k, and at each step all its threads cooperatively load a tile of xx of shape Tm×TkT_m \times T_k and a tile of WW of shape Tk×TnT_k \times T_n from global memory into shared memory.

Once both input tiles are loaded in fast shared memory, the threads in the CTA can reuse them to perform all necessary multiply-accumulate operations, updating a full Tm×TnT_m \times T_n tile of the output and storing it back at the end:

Algorithm 1. A sampled CTA tile illustrates tiled matrix multiplication. The CTA advances through the KK dimension, reading the relevant tiles of xx and WW, performing a tiled matmul, and increasing the running-sum accumulator. Finally, the output tile hh is committed to global memory.

This reorganization allows amortizing the cost of each global memory load over many arithmetic operations: each element of xx is reused across TnT_n output columns, while each element of WW is reused across TmT_m output rows. As a result, the ratio of computation to memory traffic or the kernel's arithmetic intensityArithmetic intensity can be calculated as FLOPs performed / bytes moved to and from global memory. increases dramatically, which can often shift the computation from being memory-bound to compute-bound.

The hardware of modern GPUs has increasingly specialized to support tiling. Many common GPUs have dedicated hardware units called Tensor Cores, with which warps can cooperatively issue matrix multiply-accumulate operations that execute asynchronously on whole tiles of data stored in shared memory, achieving extremely high throughput. On modern devices like the H100, this is further complemented by the Tensor Memory Accelerator (TMA), which enables asynchronous loading and storing tiles of data across global memory and shared memory and allows pipelining expensive data transfers with compute.

TwELL: a Sparse Format for Kernel Fusion

In this work, we introduce new sparse data formats designed specifically for integration with the feedforward blocks of LLMs during training and inference.

Traditional ELLPACK (ELL) sparse data format.
ELL. Visualization of the traditional ELLPACK (ELL) sparse data format with row-wide alignment.

A natural starting point for exploiting unstructured sparsity in the activations hRM×Nh \in \mathbb{R}^{M \times N} is the ELLPACK format (ELL) . Together with its variants, ELL is generally considered state-of-the-art for fast and efficient sparse matrix multiplication and has a long history of GPU implementations . A sparse matrix hh in the ELL format is stored using two padded matrices hvh_v and hIh_I of size M×NnzM\times N_{nz}. Here, the entries of hvh_v contain all the non-zero values from the corresponding row of hh, packed contiguously and right-padded, with the same entries from hIh_I containing their original column indices within hh. Modern variants of ELL also store the number of non-zero entries in each row in a separate vector hnzRMh_{nz}\in \mathbb{R}^{M}. This structure makes ELL very convenient to write kernels for sparse matrix multiplications like y=hWy = hW: to process an input row of hh, a CTA only needs to iterate across the packed non-zero values from hvh_v, while using hIh_I to load the corresponding rows of WW, and accumulate their contribution while skipping all zero entries entirely:

y[m,:]=j=1hnz[m]hv[m,j]W[hI[m,j],:].y[m, :] = \sum_{j=1}^{h_{nz}[m]} h_v[m, j] W[h_I[m, j], :].

As introduced in the previous Section, one recurring principle in designing optimized kernels on modern NVIDIA devices is to minimize global memory accesses. Thus, let us consider a gated feed-forward block where sparsity patterns are determined by the gate activations hg=xWgh_g = xW_g. In this setting, representing hgh_g with sparse formats such as ELL suffers from an immediate downside: constructing the format requires examining all elements in each row of hgh_g to identify, count, and align the non-zero values and their indices. Ideally, we would want to avoid launching a separate kernel to convert the activation, which would introduce additional global accesses and overheads by fusing the two operations. However, ELL's row-wise packing structure is fundamentally misaligned with the tiled execution of modern dense matmul kernels, preventing direct fusion without expensive synchronization across CTAs.

Our new TwELL (Tile-wise ELLPACK) sparse data format design specifically for fusion with matrix multiplication kernels.
TwELL. Our new TwELL (Tile-wise ELLPACK) sparse data format design specifically for fusion with matrix multiplication kernels.

To address these limitations, we introduce TwELL. Rather than packing data across whole rows, TwELL partitions the columns of hgh_g into horizontal tiles of size TT and stores the non-zero values and indices within each tile using a local ELL-style layout. This results in two packed matrices containing locally aligned values hvRM×N/Ch_v \in \mathbb{R}^{M\times N/C} and indices hIRM×N/Ch_I \in \mathbb{R}^{M\times N/C}, where CC is a compression factor chosen so that T/CT/C is larger than the maximum number of non-zeros in any tile to avoid storage overflow. Akin to modern ELL variants, in our implementation of TwELL, we also store an additional matrix with the number of non-zero elements hnzRM×N/Th_{nz}\in \mathbb{R}^{M\times \lceil N/T \rceil} with as many columns as total tiles. The key advantage of this design is that it inherently matches the computation structure of tiled matrix multiplication kernels. By choosing T=TnT = T_n, each CTA can directly construct the TwELL representation for its tile of hgh_g before the results are written back to global memory, eliminating the need for synchronization, extra kernel launches, and additional memory accesses.

Algorithm 2. A sampled CTA tile illustrates gate-projection with TwELL storage. The walkthrough computes one dense ReLU tile, packs its nonzero elements row by row into hvh_v, hIh_I, and hnzh_{nz} TwELL tiles, and then commits those TwELL outputs to global memory.

We implement two different kernels precisely fitting this design to perform fused dense-to-sparse matrix multiplication, which only differ in the storage structure of the TwELL activations. After obtaining and converting the output, the first kernel simply stores the tiles for hvh_v, hIh_I, and hnzh_{nz} within three separate output matrices. The second kernel instead packs the TwELL activations into a single compacted 32-bit matrix for better data locality, with the first entry of the output tile containing the number of non-zeros and each subsequent entry encoding the 16-bit non-zero value together with its corresponding 16-bit column index. Our kernels are written entirely in CUDA and combine other modern techniques such as persistency, pipelining, and advanced swizzling, making them even slightly faster than optimized dense-to-dense implementations in PyTorch and CuDNN.

Kernels to Efficiently Leverage LLM Sparsity

We leverage the sparse gate activations hgh_g in our optimized TwELL formats via a set of highly efficient CUDA kernels specialized for batched LLM inference and training.

For batched inference, we design a single CUDA kernel to perform the rest of the computation in the feedforward block, leveraging sparsity to efficiently fuse the up and down projections together. Each CTA in the kernel handles a separate row mm and loads the input values x[m,:]x[m, :] in register memory at the beginning of the computation. The fused matrix multiplications are executed by traversing over the sparse gate activations with two nested loops: the first one over the number of column tiles t=1,,N/Tt=1, \dots, \lceil N/T \rceil and the second one over the corresponding number of non-zeros in each tile c=1,,hnz[m,t]c=1, \dots, h_{nz}[m, t]. For each non-zero activation at index n=hI[m,c]n=h_I[m, c], the CTA collectively loads the nthn^{th} column of WuW_u and row of WdW_d to perform a dot product, followed by a scalar-vector product, and accumulates its results directly into the output y[m,:]y[m, :].

Algorithm 3. A sampled row CTA illustrates fused up-and-down projection from TwELL gate activations. The widget first shows which gate entries survive in TwELL, then advances step by step over the packed nonzeros: each surviving entry computes one selected hu[m,n]h_u[m,n] scalar and immediately uses it to accumulate the full output row y[m,:]y[m,:]. In the actual implementation, neither hgh_g nor huh_u is materialized.

This design avoids the materialization of the dense hidden activations hh and skips all unnecessary computation and global memory accesses, loading only the weight entries corresponding to active hidden dimensions. We structure our kernel by assigning a single warp of 32 threads to each CTA, maximizing concurrency across the grid. In turn, this provides improved cache locality together with fast data exchanges and accumulations via warp-shuffling instructions, allowing performance to scale naturally with increasing sparsity.

For training, we design a set of CUDA kernels to minimize the storage size of the sparse activations and lower the cost of memory traffic and gradient computation throughout the backward pass. To this end, our first kernel converts the gate activations hgh_g from TwELL to a lightweight hybrid representation by fusing together the locally aligned tiles in each row into a single globally aligned matrix with a tight cap on the maximum number of stored non-zero values per row hgs=(hvs,hIs,hnnzs)h^s_g=(h^s_v, h^s_I, h^s_{nnz}), while routing the small number of overflow cases into a dense backup matrix hgdh^d_g. Building on this representation, we implement additional CUDA kernels for optimized hybrid-to-dense and dense-to-hybrid matrix multiplications together with lightweight kernels for sparse gradient injection and transposition, scheduling different CTAs on CUDA Cores to handle hgsh^s_g or Tensor Cores to handle hgdh^d_g. We directly use these kernels to compute the remaining operations during the forward pass while storing the intermediate representations hh and huh_u in the same hybrid representations, and then recover the input and weight gradients x\nabla x, Wd\nabla W_d, Wu\nabla W_u, and Wg\nabla W_g entirely through efficient sparse and hybrid kernels, without any dense-to-dense matrix multiplication during backpropagation.

Algorithm 4. Training-time row compaction from TwELL into the hybrid representation. The widget processes one row at a time: it first sums the three TwELL tile counts, then either merges the row into one capped sparse hybrid row when the total fits within one tile, or reconstructs that row in a dense backup matrix when it overflows.

This design allows us to aggressively compress the sparse activations without becoming brittle to the highly non-uniform row-wise sparsity patterns observed during training. In contrast to inference, the training pipeline is organized around the full training step, where costs are dominated by backpropagation. In this setting, the hybrid representation purposefully foregoes the benefits of fused execution to effectively minimize memory traffic and recomputation, maximizing throughput across the entire backward pass.

Evaluating the Performance and Efficiency of Sparse LLMs

For our quantitative evaluation, we train LLMs at billion-parameter scales across different sparsity levels and model sizes and evaluate their performance and efficiency on powerful H100 GPUs.

Training cross-entropy loss plot.
Training curves. Cross-entropy loss during training across L1L_1 regularization strengths.
Bar plot of downstream task accuracy and activation sparsity across L1 regularization strengths.
Accuracy vs. sparsity. Downstream accuracy and activation sparsity across L1L_1 regularization strengths.

We start by evaluating the effect of L1 regularization on the performance and activations of a 1.5B parameter model after training, evaluating on seven popular downstream tasks . Consistent with prior results , we find the unregularized model is already inherently sparse without regularization, with fewer than 20% non-zero activations on average across layers, and that moderate L1 regularization can quickly push this fraction orders of magnitude lower without destabilizing training. Moreover, we find that up until L1=2×105L_1=2\times 10^{-5}, this increase in sparsity leaves average downstream task performance essentially unchanged, with only minor alteration of the final cross-entropy. Interestingly, we find that the sparsity patterns remain highly heterogeneous, with a small fraction of tokens still activating over 100 times more neurons than the average, suggesting that sparse models can adaptively reallocate capacity where it is most needed.

Bar plot of inference speedups and energy savings across L1 regularization strengths.
Inference efficiency. Inference speedups and energy savings from our sparse LLM inference kernel across L1L_1 regularization strengths.
Bar plot of training speedups and peak memory reduction across L1 regularization strengths.
Training efficiency. Training speedups and peak memory reduction from our sparse LLM training kernel across L1L_1 regularization strengths.

We contrast our performance results by analyzing the efficiency of our sparse 1.5B LLMs powered by our optimized kernels. First, we examine the average throughput and total energy savings recorded during batched inference with our fused kernels. Across all sparsity levels, we observe significant speedups ranging up to 30%, which are compounded by nearly 3% less GPU power draw above L1=3×105L_1=3\times 10^{-5}, resulting in even higher energy savings. Second, we examine the average throughput and peak memory reductions achieved during training with our specialized hybrid kernels. In line with batched inference, we record significant training speedups ranging up to 24% with sparser models, together with peak GPU memory reductions exceeding 24% even for the lowest considered sparsity level. Taken together, these results validate that our optimized kernels provide meaningful efficiency improvements across the full LLM lifecycle, turning sparsity into a potentially viable axis for future LLM design.

Model scale comparison table.
Scaling analysis. Results across model sizes and training budgets.

Additionally, we analyze how the performance and efficiency of sparse LLMs are affected by model size and training budget, while fixing the regularization level to L1=2×105L_1=2\times 10^{-5}. In line with our previous results, we find no significant performance degradation in downstream accuracy beyond random deviations across all examined model sizes and training budgets. Furthermore, we find that LLMs become increasingly effective at supporting sparsity at larger scales, resulting in a 38% lower non-zero activations when going from the 0.5B to the 2B model. As a result, this makes all the aforementioned throughput and memory benefits of our kernels grow, with our 2B sparse model processing tokens 20.5% faster during inference and fitting double the micro-batch size to train 21.9% faster.

Conclusion

In this collaboration between Sakana AI and NVIDIA, we leverage unstructured sparsity to lessen the computational burdens of modern LLMs. We introduce the TwELL data format and a set of optimized CUDA kernels tailored to fit the execution patterns of LLMs on modern GPUs. By training and evaluating sparse models at billion-parameter scales, we show that our kernels can translate the sparsity levels induced by mild L1 regularization into higher throughput, lower memory requirements, and reduced energy expenditure. Moreover, these benefits appear to grow with model scale, highlighting its potential relevance for scaling future model development. We refer to our technical report for additional details and results, demonstrating how these gains become even more pronounced on less capable NVIDIA hardware and analyzing activation patterns across different layers and input tokens to understand what triggers sparsity in different LLMs. We will release and open-source all of our kernels to facilitate future research and development into hardware-aware algorithms and efficient LLMs.

Citing our work

To cite our work or this blog post, you can use the following BibTeX entry:

@article{sakanaXnvidia2026sparser,
  title={Sparser, Faster, Lighter Transformer Language Models},
  author={Cetin, Edoardo and Peluchetti, Stefano and Castillo, Emilio and Naruse, Akira and Murakami, Mana and Jones, Llion},
  journal={arXiv preprint arXiv:2603.23198},
  year={2026}
}