Hyena and Past · Hazy Analysis

New Fashions for Extremely-Lengthy Sequences

Hyena takes a LLaMa to the safari.
The search for architectures supporting extraordinarily lengthy sequences continues! There have been some thrilling developments on lengthy sequence fashions and options to Transformers. Anthropic launched an API for a mannequin supporting 100k context length and Magic introduced a mannequin with 5 million context length (notably not a Transformer!). On the open-source entrance the RWKV group collected their insights right into a paper and MosaicML bought a Transformer to 65k context length with ALiBi positional encodings. We now have additionally been laborious at work – and wish to use this chance to share a few of our analysis and views on environment friendly architectures for lengthy sequences primarily based on sign processing.
We are going to begin with a brief definition of Hyena , then, by means of a historic observe on environment friendly consideration variants, we are going to try to stipulate a sequence of important design rules for lengthy sequence layers.
Half 1: What’s Hyena?
The Hyena operator is a learnable nonlinear sequence processor. It may be utilized as a basic element within the building of deep sequence fashions, resembling for mixing info within the house (width) or time (sequence size) dimensions of the inputs.
The precise type of Hyena is dependent upon its order. Right here is the order 2 case:

Hyena order-2 operator. (We use x1,x2 as a substitute of q,okay for compatibility with the any-order case).
The core thought is to repeatedly apply quick linear operators (see 4.8 in Golub and Van Mortgage) — operators that may be evaluated in subquadratic time — to an enter sequence u∈RL, with each operator parametrized implicitly. Specifically, the projections q,okay, that are mapped to the diagonals of Dq,Dokay, are produced by a quick linear operator on time and a dense linear operator over house. Usually, we let Th be a Toeplitz matrix akin to an implicitly parametrized convolution filter h. Which means as a substitute of studying the values of the convolution filter straight, we be taught a map from a temporal positional encoding to the values. Such a map can belong to varied households; neural fields (e.g. SIRENs), state-space fashions (e.g S4) and rational features.
The final order n case follows straight:
In Hyena and its predecessor H3 we argue that the above is a adequate constructing block for large-scale language and imaginative and prescient fashions that rival Transformers in high quality, whereas decreasing the computational complexity to O~(L) — for instance, the price of evaluating Tv if T is Toeplitz.
After shut to some months from the preliminary launch, we now have made progress on dissecting these studying primitives into important components. Earlier than diving deeper, let’s take a step again. How did we get right here?
Our Purpose: Consideration
Consideration is the elemental operation within the Transformer structure, which has led to vital progress in machine studying in recent times. Consideration has some nice properties: it may possibly losslessly propagate info between any two entries within the sequence, no matter their distance (world reminiscence) and has the power to extract info from any single aspect (precision). It’s in some sense the very best of each worlds, having good reminiscence and ideal precision. Nevertheless, it merely does not scale to ultra-long sequences, because of its quadratic computational complexity O(L2).
Dense Consideration
The household of self-attention strategies might be outlined as
With ϕ(a,b)=eab, we get hold of commonplace self-attention. Certainly,
Linear Consideration for Linear Scaling
Many environment friendly options to handle the quadratic scaling of consideration have been proposed — we refer to this excellent survey.
One household of strategies employs a easy low-rank factorization trick ϕ(q,okay)=σ(q)ψ(okay) to attain linear time scaling. The core thought might be summarized as:
That is it!
Dissecting Linear Consideration
We are able to dissect a linear consideration layer into three steps: discount, normalization and gating:
At its core, linear consideration applies easy transformations to the enter sequence: a (normalized) weighted mixture, adopted by elementwise gating. It makes use of two projections of the enter, q and okay, to parametrize the 2 operations.
Whereas these approaches partially tackle the worldwide reminiscence property, they achieve this through a constrained parametrization (ψ). Furthermore, linear consideration struggles at extracting info with precision (e.g. composing an output with a single enter, far again up to now).
The AFT Variant
For instance, the Consideration-Free Transformer (AFT) taste of linear consideration proposes ψ(okayt′)=eokayt′, yielding:
the place σ is an elementwise nonlinear activation operate. Word how the softmax is impartial of the output index t, yielding complete linear time complexity.
The authors of AFT noticed that with further tweaks to ψ they might additional enhance the outcomes; specifically, they suggest introducing extra parameters to the exponential features: ψ(okayt′)t=eokayt′+wtt′. This new time period represents one of many first cases of a technique introducing further trainable parameters (past the projections) within the linear consideration decomposition. The selection of exponentials is pure given our start line; with the linear consideration factorization trick, these layers resemble easy state-space fashions.
Regardless of the modifications, AFT nonetheless lags dense consideration in high quality. What can we do subsequent?
RWKV for Precision
The RWKV group seen that AFT and related linear consideration approaches can not match quadratic consideration on language modeling at scale. They proposed enhancements to each the projections in addition to the parametrization for ψ. Let μ,w,d denote further trainable parameters. Then, an RWKV (time mixing) layer is
RWKV makes the next key modifications to AFT:
- Incorporation of earlier aspect ut−1 info within the projections qt, okayt and vt.
- Restructuring of the ψ parametrization in direction of exponential decay (intuitively: it seems to be a good suggestion to overlook the previous at some fee, and we now have an in depth literature on recurrent networks that helps this).
With the brand new parametrization, the blending operation might be proven to be equal to a linear state-space mannequin (SSMs) with a single state (for every channel!). The convolutional type of RWKV is quickly obtained as
the place we now have highlighted the position of ξ(okay)t′=eokayt′ as a pre-convolution gate.
Projections are sometimes neglected
However wait! The brand new projection is a particularly brief convolution with filter measurement, with the weather of filters hq, hokay and hv produced from a single parameter as μ and 1−μ:
These modified projections enable the mannequin to carry out native comparisons and implement induction heads.
Constructing New Layers from Easy Design Ideas
We now have damaged down linear attention-like approaches into 4 key elements: projections, discount, normalization, and gating. We at the moment are able to hyperlink again to Hyena! Allow us to focus on the weather that outline Hyena and extra broadly “safari” fashions.
Sifting for native modifications
A key lacking piece from earlier environment friendly consideration variants was a technique to detect native (high-frequency) modifications within the sequence. That is what the straightforward modification within the RWKV projections addressed, and might be generalized in varied methods. Apparently, an analogous thought was proposed within the influential work on in-context studying through induction heads (the “smeared key” variant) as a solution to allow even one-layer transformers to type induction heads.
In Hyena, we took step one in generalizing the sifting operators in every projection, with brief convolutional filters which excel at filtering the enter to extract native options, as they’ll e.g. implement finite-differences to estimate derivatives, compute native averages and native variations. It may be verified that with out the modified projections, any attention-free lengthy sequence mannequin performs worse on any activity requiring in-context studying.
Reminiscence over lengthy sequences
The second core aspect is the discount operator taking a historical past of values and aggregating them into an output. Fortunately, there is no such thing as a must reinvent the wheel, tweaking the particular parametrization (the alternatives of ψ in linear consideration variants)! We are able to leverage a whole line of analysis on layers for lengthy sequences, beginning with seminal work on deep state-space models and follow-ups, that examine precisely this query. Notably, these strategies generalize and enhance the expressivity of earlier approaches utilizing w as discount parameters. If the parametrization might be written in recurrent type, then all the mannequin can generate sequences at a relentless reminiscence value, sidestepping the necessity to cache intermediate outcomes.
After all, there are trade-offs concerned in several parametrizations, and in our expertise these concern the scaling in high quality over sequence size. In latest work, we try to quantify how a lot reminiscence every discount operator encodes, by measuring the minimal state dimension of the recurrence which approximates the convolution filter.
We discovered that for instance, the implicit Hyena parametrization results in bigger recurrences than the H3 parametrization leveraging diagonal state areas, offering some perception into our scaling outcomes.

Spectrum of lengthy convolution filters of Safari fashions (H3 and Hyena), alongside visualization at initialization and after pretraining. The decay fee is dependent upon the discount operator parametrization, with quicker decays indicating correspondence to an equal recurrence with smaller state.
That is thrilling as a result of (a) it permits compression of lengthy convolution layers into compact recurrences (smaller reminiscence footprint, increased throughput, coming quickly!) and (b) it offers a quantitative path in direction of additional enhancing the parametrization for reminiscence over ultra-long sequences.
We’re additionally pleased to focus on some latest outcomes from our S5 friends: they changed the present implicit parametrization of lengthy convolutions in Hyena with an expressive variant of state-space fashions, and achieved promising outcomes. It is a very lively space of analysis, and we totally count on to see enhancements to the reminiscence module that can additional enhance attention-free fashions!
Architectural issues
To this point, we now have mentioned the internals of the D=1 case. Extra selections open up as we contemplate a full Hyena block and operations that encompass it
- As a substitute of getting every operator act independently on every channel (house), we will type heads impressed by multi-head consideration.
- A Transformer block is outlined as consideration adopted by a small MLP. Seems we don’t want MLPs in Hyena fashions, offered we account for misplaced layers and floating level operations (FLOPs) when eradicating them. One possibility is to e.g. substitute every MLP with one other Hyena mixer, or introduce extra FLOPs through heads.
Coaching effectivity
Along with understanding the core machine studying concepts behind Hyena, it is very important characterize its effectivity, when mapped to trendy {hardware}. Regardless of the improved O~(L) scaling, Quick Fourier Rework (FFT) operations are poorly supported on trendy {hardware} in comparison with operations resembling matrix multiplication, which have devoted circuits (i.e. NVidia Tensor Cores). This creates a spot between theoretical speedup and precise runtime. If the quick discount operator of selection is a convolution, we now have completely different choices: (a) go for parallel scans (e.g. the S5 method) or (b) enhance effectivity of the FFT. In a few of our latest work, we tackle a few of these challenges through the use of a single primitive — a category of structured matrices, referred to as Monarch matrices, constructing on recent work from our lab. We develop a novel class of causal operators primarily based on Monarch matrices that generalizes lengthy convolutions and can be utilized as a drop-in alternative. Monarch matrices might be computed effectively on trendy {hardware}, utilizing normal matrix multiplications.
Monarch projections?
One other key query is how the parameter discount incurred by swapping dense for sparse structured matrics within the projections — past reductions — impacts mannequin high quality. That is thrilling, as a result of it leads us in direction of totally subquadratic architectures, in each width and sequence size. Right here we offer some preliminary outcomes and steerage for future work to increase Hyena.
We contemplate the associative recall artificial activity, by which sequences encompass key-value pairs (consider a dictionary). The keys and values are all single-character numbers and letters, and in every sequence the mapping varies. The mannequin is tasked with offering the worth for a key, given the sequence.
Intuitively, growing the vocabulary measurement V, i.e. the variety of distinctive keys and values, is correlated with elevated activity issue. Word that within the above instance, V=10. Under, we differ V∈10,50,100, with and with out changing the Hyena projections with sub-quadratic structured matrices, and measure the validation accuracy.
Vocabulary | Dense (2.2M params) | Sparse Structured ( 1.8M params) |
---|---|---|
10 | 100 | 100 |
50 | 100 | 80 |
100 | 20 | 10 |
We observe the hole in high quality between dense and sparse projections will increase with activity issue and are excited for future work to research this!
Inference effectivity
Inference workloads are extremely essential, particularly in relation to giant language fashions. In truth, numerous effort from the open-source neighborhood has gone into optimizing era workloads in order that they require much less reminiscence, can produce outputs quicker, and might run on smaller gadgets. We examine these questions in latest work that appears at distilling Hyena operators and different lengthy convolutions right into a recurrent type, utilizing concepts from rational operate approximation and mannequin order discount.

Throughput over batch measurement. Hyenas chortle once you distill them into a quick recurrence!
We’re additionally in a position to take present state-space approaches — which have already got a recurrent type — and use our strategies to prune redundant state dimensions, growing throughput and decreasing reminiscence value.
What’s subsequent?
That is solely a snapshot of what we have been doing. We maintain scaling in mannequin sizes and sequence size. In different latest work, we pretrain Hyenas on as much as 1 million sequence size on genomics (at a “character” degree), outperforming Transformers and environment friendly Transformers on downstream duties, with a lot smaller fashions. Past these efforts, we’re exploring varied different parametrizations, for much longer sequences, and looking out additional into character-level coaching. We proceed exploring questions associated to {hardware} effectivity and new functions of longer context lengths.
As all the time, we’re most enthusiastic about discovering deeper connections between our strategies and classical sign processing and structured linear algebra.

Pretraining on very lengthy DNA sequences. In direction of new scaling legal guidelines over sequence size.
Appendix: Hyena as (weak) time variance
Contemplate a easy one-dimensional, discrete, weakly time various
The closed-form resolution (input-to-output) reads (see this post or the classical ebook “Linear System Concept and Design” for a step-by-step reference within the time-invariant case):
What occurred? For those who have a look at the above, you would possibly see one thing acquainted: we ended up with the gate, lengthy convolution, gate decomposition from order-2 Hyena operators. The principle variations are in fact because of the parametrization of every module on this input-output map. By utilizing implicit parametrizations (for projections, and for the lengthy convolutions) Hyena generalizes the above system.
With longer recurrences — extra gates and lengthy convolutions — we’re in essence utilizing chains of those programs. And there’s something basic about operators that may be decomposed as chains of diagonal and circulant matrices.
Acknowledgments
Because of all of the readers who offered suggestions on this put up and launch: Avanika Narayan, Hermann Kumbong, Michael Zhang, Sabri Eyuboglu, Eric Nguyen, Krista Opsahl-Ong, David W. Romero.