Sparser, Faster, Lighter Transformer Language Models

tl;dr

In collaboration with NVIDIA, we introduce new sparse data structures and GPU kernels to leverage unstructured sparsity for efficient inference and training of LLMs. This work will be presented at ICML 2026.

Summary

Modern large language models (LLMs) are powerful. They can write code, reason through complex problems, and synthesize vast amounts of technical knowledge. However, providing such capabilities at scale comes with tremendous real-world costs.

A large part of these costs comes from the feedforward layers, a ubiquitous component of LLMs since their inception. Yet, inside these layers, an interesting phenomenon can be observed: For any given token, only a small fraction of the hidden activations actually matter, with the rest effectively approximating zero and wasting computation. With ReLU and L1 regularization, this sparsity can be even made to exceed 95% with little to no impact on downstream performance .

The difficulties of achieving this objective lie in the hardware. Modern NVIDIA GPUs are extraordinarily good at homogeneous workloads, with specialized units dedicated to dense matrix multiplications. Traditional algorithms for efficiently leveraging unstructured sparsity clearly do not fit these assumptions, and introduce non-trivial constant overheads and bookkeeping that cancel out their theoretical savings.

This work, a collaboration between Sakana AI and NVIDIA, is about resolving this paradox:

To substantiate these gains, we study sparse LLMs at billion-parameter scales and show that mild L1 regularization can induce high levels of sparsity after training with negligible impact on downstream performance. By leveraging our kernels, these sparsity levels translate into over 20% speedups for both batched inference and training on H100 GPUs, while also cutting energy consumption and memory requirements.

Transformers and Feedforward Layers

At a high level, the "body" of modern transformers is composed of a repeated stack of just two components: an attention block (linear or quadratic), followed by a feedforward block. While attention lets tokens communicate with each other, the feedforward block processes each input token independently, performing multiple matrix multiplications with a set of large learned weight matrices.

Formally, a modern feedforward block takes as input a matrix

x \in \mathbb{R}^{M \times K}

, where

M

is the total number of tokens (number of batched sequences

\times

sequence length), and

K

is the token embedding size. The input

x

is then multiplied by three different weight matrices to first expand this representation into a much larger hidden space of size

N \gg K

, apply a simple non-linearity, and then project it back down for the next block. Formally, this computation looks as follows:

h_u = x W_u, \quad h_g = \sigma(x W_g), \quad h = h_u \odot h_g, \quad y = h W_d.

Here,

W_u, W_g \in \mathbb{R}^{K \times N}

are commonly referred to as the weight matrices for the "up" and "gate" projections, and

W_d \in \mathbb{R}^{N \times K}

as the weight matrix for the "down" projection. The non-linearity

\sigma

, such as SiLU or ReLU, is what introduces sparsity in the gate activations

h_g

, which is carried over to the hidden activations

h

following the elementwise product

\odot

. As the hidden dimension

N

is typically much larger than

K

(often 4x or more), these layers can often be responsible for the majority of both parameters and FLOPs in modern LLMs.

In this overparameterized regime, prior work has shown that pretrained transformers with ReLU activations can already exhibit high levels of unstructured sparsity in their feedforward blocks , with less than 5-10% non-zero elements in their hidden activations

h

. Other works on SiLU-based LLMs have shown that a substantial fraction of feedforward activations can also be set to zero with little or no additional training, either through training-free thresholding or lightweight finetuning . In this work, we study how different levels of sparsity affect both performance and efficiency by combining ReLU activations with a simple auxiliary L1 regularization loss on the hidden activations, weighted by a coefficient

L_1

and averaged across all neurons:

GPUs, Memory, and Tiling

To understand the challenges of leveraging sparsity to accelerate matrix multiplication, we provide a brief primer on how modern GPUs execute programs and the hardware characteristics.

Whenever executing any code on an NVIDIA GPU using PyTorch or any other library, this launches a GPU kernel, a massively parallel program executed across many threads on the device. The kernel's computation is first distributed across a grid of many independent units of work called cooperative thread arrays (CTAs). Each CTA is scheduled on a single Streaming Multiprocessor (SM) and consists of many threads working together to carry out the kernel operations on a set portion of the input data. This design allows the same optimized kernel, written from the perspective of the threads in a single CTA, to be easily ported to different input sizes by simply scaling the number of CTAs in the grid. The threads in the CTA are not entirely independent of each other but are grouped into units of 32 called warps, which execute the same instructions at each clock cycle on different inputs. All these constraints help reduce the cost and chip-space dedicated to each execution unit and enable GPUs to achieve massive parallelism: thousands of threads are scheduled across many CTAs, all performing arithmetic and memory operations concurrently.

However, not all operations are equal. As theoretical FLOPs on modern NVIDIA devices have risen dramatically in recent years, the bottleneck of many common kernels is often memory bandwidth.

Modern GPUs have a hierarchical memory system with three distinct levels. At the bottom level, we have a global memory available to all threads running on the GPU and residing separately from the compute units on a stack of high-bandwidth memory (HBM). While global memory can be large (80GB for an H100 GPU), it comes with high latency on the order of ~500 cycles and limited per-device bandwidth. Surrounding the compute units themselves, on the second level, we have on-chip shared memory. Shared memory is much faster, with latencies closer to ~10–20 cycles and is private to each CTA. At the top level, we have on-chip register memory which can be accessed in a single cycle and is private to each thread. Moving up the hierarchy, each level becomes over an order of magnitude faster, but correspondingly smaller due to the physical constraints of on-chip storage. For instance, while an H100 GPU provides 80GB of HBM, each CTA is limited to at most 228KB of shared memory, and each thread to only a few kilobytes of register storage.

Thus, designing efficient kernels requires minimizing accesses to global memory. This is directly reflected in the design of modern matrix multiplication kernels. Suppose we want to compute

h = x W

, where

x \in \mathbb{R}^{M \times K}

and

W \in \mathbb{R}^{K \times N}

. Each output element in this operation can be seen as an independent dot product:

An intuitive but suboptimal approach to implement this operation would distribute work across threads at the granularity of individual output elements. For instance, each thread could be assigned a single output element

h[m, n]

, load the row

m

x

and column

n

W

from global memory, perform the dot product, and store the results. The main issue with this approach is that it requires two separate loads for

x[m, k]

and

W[k, n]

before each partial accumulation. As a result, the same values of

x

and

W

are redundantly reloaded across many different threads, leading to a large number of expensive global memory transactions that dominate execution time.

Modern GPU kernels avoid this by using tiling. Instead of each thread operating on individual output elements, computation is reorganized by having the threads in each CTA cooperatively compute a separate tile of the output of shape

T_m \times T_n

. With this approach, each CTA iterates over the reduction dimension

K

in chunks of size

T_k

, and at each step all its threads cooperatively load a tile of

x

of shape

T_m \times T_k

and a tile of

W

of shape

T_k \times T_n

from global memory into shared memory.

Once both input tiles are loaded in fast shared memory, the threads in the CTA can reuse them to perform all necessary multiply-accumulate operations, updating a full

T_m \times T_n

tile of the output and storing it back at the end:

This reorganization allows amortizing the cost of each global memory load over many arithmetic operations: each element of

x

is reused across

T_n

output columns, while each element of

W

is reused across

T_m

output rows. As a result, the ratio of computation to memory traffic or the kernel's arithmetic intensity increases dramatically, which can often shift the computation from being memory-bound to compute-bound.

The hardware of modern GPUs has increasingly specialized to support tiling. Many common GPUs have dedicated hardware units called Tensor Cores, with which warps can cooperatively issue matrix multiply-accumulate operations that execute asynchronously on whole tiles of data stored in shared memory, achieving extremely high throughput. On modern devices like the H100, this is further complemented by the Tensor Memory Accelerator (TMA), which enables asynchronous loading and storing tiles of data across global memory and shared memory and allows pipelining expensive data transfers with compute.

TwELL: a Sparse Format for Kernel Fusion

In this work, we introduce new sparse data formats designed specifically for integration with the feedforward blocks of LLMs during training and inference.

A natural starting point for exploiting unstructured sparsity in the activations

h \in \mathbb{R}^{M \times N}

is the ELLPACK format (ELL) . Together with its variants, ELL is generally considered state-of-the-art for fast and efficient sparse matrix multiplication and has a long history of GPU implementations . A sparse matrix

h

in the ELL format is stored using two padded matrices

h_v

and

h_I

of size

M\times N_{nz}

. Here, the entries of

h_v

contain all the non-zero values from the corresponding row of

h

, packed contiguously and right-padded, with the same entries from

h_I

containing their original column indices within

h

. Modern variants of ELL also store the number of non-zero entries in each row in a separate vector

h_{nz}\in \mathbb{R}^{M}

. This structure makes ELL very convenient to write kernels for sparse matrix multiplications like

y = hW

: to process an input row of

h

, a CTA only needs to iterate across the packed non-zero values from

h_v

, while using

h_I

to load the corresponding rows of

W

, and accumulate their contribution while skipping all zero entries entirely:

As introduced in the previous Section, one recurring principle in designing optimized kernels on modern NVIDIA devices is to minimize global memory accesses. Thus, let us consider a gated feed-forward block where sparsity patterns are determined by the gate activations

h_g = xW_g

. In this setting, representing

h_g

with sparse formats such as ELL suffers from an immediate downside: constructing the format requires examining all elements in each row of

h_g

to identify, count, and align the non-zero values and their indices. Ideally, we would want to avoid launching a separate kernel to convert the activation, which would introduce additional global accesses and overheads by fusing the two operations. However, ELL's row-wise packing structure is fundamentally misaligned with the tiled execution of modern dense matmul kernels, preventing direct fusion without expensive synchronization across CTAs.

To address these limitations, we introduce TwELL. Rather than packing data across whole rows, TwELL partitions the columns of

h_g

into horizontal tiles of size

T

and stores the non-zero values and indices within each tile using a local ELL-style layout. This results in two packed matrices containing locally aligned values

h_v \in \mathbb{R}^{M\times N/C}

and indices

h_I \in \mathbb{R}^{M\times N/C}

, where

C

is a compression factor chosen so that

T/C

is larger than the maximum number of non-zeros in any tile to avoid storage overflow. Akin to modern ELL variants, in our implementation of TwELL, we also store an additional matrix with the number of non-zero elements

h_{nz}\in \mathbb{R}^{M\times \lceil N/T \rceil}

with as many columns as total tiles. The key advantage of this design is that it inherently matches the computation structure of tiled matrix multiplication kernels. By choosing

T = T_n

, each CTA can directly construct the TwELL representation for its tile of

h_g

before the results are written back to global memory, eliminating the need for synchronization, extra kernel launches, and additional memory accesses.

We implement two different kernels precisely fitting this design to perform fused dense-to-sparse matrix multiplication, which only differ in the storage structure of the TwELL activations. After obtaining and converting the output, the first kernel simply stores the tiles for

h_v

h_I

, and

h_{nz}

within three separate output matrices. The second kernel instead packs the TwELL activations into a single compacted 32-bit matrix for better data locality, with the first entry of the output tile containing the number of non-zeros and each subsequent entry encoding the 16-bit non-zero value together with its corresponding 16-bit column index. Our kernels are written entirely in CUDA and combine other modern techniques such as persistency, pipelining, and advanced swizzling, making them even slightly faster than optimized dense-to-dense implementations in PyTorch and CuDNN.

Kernels to Efficiently Leverage LLM Sparsity

We leverage the sparse gate activations

h_g

in our optimized TwELL formats via a set of highly efficient CUDA kernels specialized for batched LLM inference and training.

For batched inference, we design a single CUDA kernel to perform the rest of the computation in the feedforward block, leveraging sparsity to efficiently fuse the up and down projections together. Each CTA in the kernel handles a separate row

m

and loads the input values

x[m, :]

in register memory at the beginning of the computation. The fused matrix multiplications are executed by traversing over the sparse gate activations with two nested loops: the first one over the number of column tiles

t=1, \dots, \lceil N/T \rceil

and the second one over the corresponding number of non-zeros in each tile

c=1, \dots, h_{nz}[m, t]

. For each non-zero activation at index

n=h_I[m, c]

, the CTA collectively loads the

n^{th}

column of

W_u

and row of

W_d

to perform a dot product, followed by a scalar-vector product, and accumulates its results directly into the output

y[m, :]

This design avoids the materialization of the dense hidden activations

h

and skips all unnecessary computation and global memory accesses, loading only the weight entries corresponding to active hidden dimensions. We structure our kernel by assigning a single warp of 32 threads to each CTA, maximizing concurrency across the grid. In turn, this provides improved cache locality together with fast data exchanges and accumulations via warp-shuffling instructions, allowing performance to scale naturally with increasing sparsity.

For training, we design a set of CUDA kernels to minimize the storage size of the sparse activations and lower the cost of memory traffic and gradient computation throughout the backward pass. To this end, our first kernel converts the gate activations

h_g

from TwELL to a lightweight hybrid representation by fusing together the locally aligned tiles in each row into a single globally aligned matrix with a tight cap on the maximum number of stored non-zero values per row

h^s_g=(h^s_v, h^s_I, h^s_{nnz})

, while routing the small number of overflow cases into a dense backup matrix

h^d_g

. Building on this representation, we implement additional CUDA kernels for optimized hybrid-to-dense and dense-to-hybrid matrix multiplications together with lightweight kernels for sparse gradient injection and transposition, scheduling different CTAs on CUDA Cores to handle

h^s_g

or Tensor Cores to handle

h^d_g

. We directly use these kernels to compute the remaining operations during the forward pass while storing the intermediate representations

h

and

h_u

in the same hybrid representations, and then recover the input and weight gradients

\nabla x

\nabla W_d

\nabla W_u

, and

\nabla W_g

entirely through efficient sparse and hybrid kernels, without any dense-to-dense matrix multiplication during backpropagation.

This design allows us to aggressively compress the sparse activations without becoming brittle to the highly non-uniform row-wise sparsity patterns observed during training. In contrast to inference, the training pipeline is organized around the full training step, where costs are dominated by backpropagation. In this setting, the hybrid representation purposefully foregoes the benefits of fused execution to effectively minimize memory traffic and recomputation, maximizing throughput across the entire backward pass.

Evaluating the Performance and Efficiency of Sparse LLMs

For our quantitative evaluation, we train LLMs at billion-parameter scales across different sparsity levels and model sizes and evaluate their performance and efficiency on powerful H100 GPUs.

Training cross-entropy loss plot. — **Training curves.** Cross-entropy loss during training across $L_1$ regularization strengths.

Bar plot of downstream task accuracy and activation sparsity across L1 regularization strengths. — **Accuracy vs. sparsity.** Downstream accuracy and activation sparsity across $L_1$ regularization strengths.

We start by evaluating the effect of L1 regularization on the performance and activations of a 1.5B parameter model after training, evaluating on seven popular downstream tasks . Consistent with prior results , we find the unregularized model is already inherently sparse without regularization, with fewer than 20% non-zero activations on average across layers, and that moderate L1 regularization can quickly push this fraction orders of magnitude lower without destabilizing training. Moreover, we find that up until

L_1=2\times 10^{-5}

, this increase in sparsity leaves average downstream task performance essentially unchanged, with only minor alteration of the final cross-entropy. Interestingly, we find that the sparsity patterns remain highly heterogeneous, with a small fraction of tokens still activating over 100 times more neurons than the average, suggesting that sparse models can adaptively reallocate capacity where it is most needed.

Bar plot of inference speedups and energy savings across L1 regularization strengths. — **Inference efficiency.** Inference speedups and energy savings from our sparse LLM inference kernel across $L_1$ regularization strengths.

Bar plot of training speedups and peak memory reduction across L1 regularization strengths. — **Training efficiency.** Training speedups and peak memory reduction from our sparse LLM training kernel across $L_1$ regularization strengths.

We contrast our performance results by analyzing the efficiency of our sparse 1.5B LLMs powered by our optimized kernels. First, we examine the average throughput and total energy savings recorded during batched inference with our fused kernels. Across all sparsity levels, we observe significant speedups ranging up to 30%, which are compounded by nearly 3% less GPU power draw above

L_1=3\times 10^{-5}

, resulting in even higher energy savings. Second, we examine the average throughput and peak memory reductions achieved during training with our specialized hybrid kernels. In line with batched inference, we record significant training speedups ranging up to 24% with sparser models, together with peak GPU memory reductions exceeding 24% even for the lowest considered sparsity level. Taken together, these results validate that our optimized kernels provide meaningful efficiency improvements across the full LLM lifecycle, turning sparsity into a potentially viable axis for future LLM design.

Additionally, we analyze how the performance and efficiency of sparse LLMs are affected by model size and training budget, while fixing the regularization level to

L_1=2\times 10^{-5}

. In line with our previous results, we find no significant performance degradation in downstream accuracy beyond random deviations across all examined model sizes and training budgets. Furthermore, we find that LLMs become increasingly effective at supporting sparsity at larger scales, resulting in a 38% lower non-zero activations when going from the 0.5B to the 2B model. As a result, this makes all the aforementioned throughput and memory benefits of our kernels grow, with our 2B sparse model processing tokens 20.5% faster during inference and fitting double the micro-batch size to train 21.9% faster.

Conclusion

In this collaboration between Sakana AI and NVIDIA, we leverage unstructured sparsity to lessen the computational burdens of modern LLMs. We introduce the TwELL data format and a set of optimized CUDA kernels tailored to fit the execution patterns of LLMs on modern GPUs. By training and evaluating sparse models at billion-parameter scales, we show that our kernels can translate the sparsity levels induced by mild L1 regularization into higher throughput, lower memory requirements, and reduced energy expenditure. Moreover, these benefits appear to grow with model scale, highlighting its potential relevance for scaling future model development. We refer to our technical report for additional details and results, demonstrating how these gains become even more pronounced on less capable NVIDIA hardware and analyzing activation patterns across different layers and input tokens to understand what triggers sparsity in different LLMs. We will release and open-source all of our kernels to facilitate future research and development into hardware-aware algorithms and efficient LLMs.

Sparser, Faster, Lighter Transformer Language Models

Summary

Transformers and Feedforward Layers

GPUs, Memory, and Tiling

TwELL: a Sparse Format for Kernel Fusion

Kernels to Efficiently Leverage LLM Sparsity

Evaluating the Performance and Efficiency of Sparse LLMs

Conclusion

Citing our work