# Hyena and Past · Hazy Analysis

*by*Phil Tadros

## New Fashions for Extremely-Lengthy Sequences

The search for architectures supporting extraordinarily lengthy sequences continues! There have been some thrilling developments on lengthy sequence fashions and options to Transformers. Anthropic launched an API for a mannequin supporting 100k context length and Magic introduced a mannequin with 5 million context length (notably **not a Transformer**!). On the open-source entrance the RWKV group collected their insights right into a paper and MosaicML bought a Transformer to 65k context length with ALiBi positional encodings. We now have additionally been laborious at work – and wish to use this chance to share a few of our analysis and views on environment friendly architectures for lengthy sequences primarily based on sign processing.

We are going to begin with a brief definition of Hyena , then, by means of a historic observe on environment friendly consideration variants, we are going to try to stipulate a sequence of important design rules for lengthy sequence layers.

## Half 1: What’s Hyena?

The Hyena operator is a learnable nonlinear sequence processor. It may be utilized as a basic element within the building of deep sequence fashions, resembling for mixing info within the *house* (width) or *time* (sequence size) dimensions of the inputs.

The precise type of Hyena is dependent upon its order. Right here is the order $2$ case:

The core thought is to repeatedly apply *quick linear operators* (see 4.8 in Golub and Van Mortgage) — operators that may be evaluated in subquadratic time — to an enter sequence $u in mathbb{R}^{L}$

The final order $n$ case follows straight^{:}

In Hyena and its predecessor H3 we argue that the above is a adequate constructing block for large-scale language and imaginative and prescient fashions that rival Transformers in high quality, whereas decreasing the computational complexity to $tilde{mathcal{O}}(L)$

After shut to some months from the preliminary launch, we now have made progress on dissecting these studying primitives into important components. Earlier than diving deeper, let’s take a step again. How did we get right here?

## Our Purpose: Consideration

Consideration is the elemental operation within the Transformer structure, which has led to vital progress in machine studying in recent times. Consideration has some nice properties: it may possibly losslessly propagate info between any two entries within the sequence, no matter their distance (world reminiscence) and has the power to extract info from any single aspect (precision). It’s in some sense the very best of each worlds, having good reminiscence and ideal precision. Nevertheless, it merely does not scale to ultra-long sequences, because of its quadratic computational complexity $O(L^2)$

## Dense Consideration

The household of self-attention strategies^{ might be outlined as}

With $phi(a, b) = e^{ab}$

### Linear Consideration for Linear Scaling

Many environment friendly options to handle the quadratic scaling of consideration have been proposed — we refer to this excellent survey.

One household of strategies employs a easy low-rank factorization trick $phi(q,okay) = sigma(q)psi(okay)$

That is it!^{}

### Dissecting Linear Consideration

We are able to dissect a linear consideration layer into three steps: discount, normalization and gating:

At its core, linear consideration applies easy transformations to the enter sequence: a (normalized) weighted mixture, adopted by elementwise gating. It makes use of two projections of the enter, $q$ and $okay$, to parametrize the 2 operations.

Whereas these approaches partially tackle the worldwide reminiscence property, they achieve this through a constrained parametrization ($psi$). Furthermore, linear consideration struggles at extracting info with precision (e.g. composing an output with a single enter, far again up to now).

### The AFT Variant

For instance, the *Consideration-Free Transformer* (AFT) taste of linear consideration proposes $psi(k_{t’}) = e^{k_{t’}}$

the place $sigma$ is an elementwise nonlinear activation operate. Word how the softmax is impartial of the output index $t$, yielding complete linear time complexity.

The authors of AFT noticed that with further tweaks to $psi$ they might additional enhance the outcomes; specifically, they suggest introducing extra parameters to the exponential features: $psi(k_{t’})_t = e^{k_{t’} + w_{tt’}}$

Regardless of the modifications, AFT nonetheless lags dense consideration in high quality. What can we do subsequent?

### RWKV for Precision

The RWKV group seen that AFT and related linear consideration approaches can not match quadratic consideration on language modeling at scale. They proposed enhancements to each the projections in addition to the parametrization for $psi$. Let $mu, w, d$

RWKV makes the next key modifications to AFT:

- Incorporation of earlier aspect $u_{t-1}$
- Restructuring of the $psi$ parametrization in direction of exponential decay (intuitively: it seems to be a good suggestion to overlook the previous at some fee, and we now have an in depth literature on recurrent networks that helps this).

With the brand new parametrization, the blending operation might be proven to be equal to a linear *state-space mannequin* (SSMs) with a single state (for every channel!). The convolutional type of RWKV is quickly obtained as

the place we now have highlighted the position of $xi(okay)_{t’} = e^{k_{t’}}$

### Projections are sometimes neglected

However wait! The brand new projection is a particularly brief convolution with filter measurement, with the weather of filters $h_q$

These modified projections enable the mannequin to carry out native comparisons and implement induction heads.

## Constructing New Layers from Easy Design Ideas

We now have damaged down linear attention-like approaches into 4 key elements: projections, discount, normalization, and gating. We at the moment are able to hyperlink again to Hyena! Allow us to focus on the weather that outline Hyena and extra broadly “safari” fashions.

### Sifting for native modifications

A key lacking piece from earlier environment friendly consideration variants was a technique to detect native (high-frequency) modifications within the sequence. That is what the straightforward modification within the RWKV projections addressed, and might be generalized in varied methods. Apparently, an analogous thought was proposed within the influential work on in-context studying through induction heads (the “smeared key” variant) as a solution to allow even one-layer transformers to type induction heads.

In Hyena, we took step one in generalizing the sifting operators in every projection, with brief convolutional filters which excel at filtering the enter to extract native options, as they’ll e.g. implement finite-differences to estimate derivatives, compute native averages and native variations. It may be verified that with out the modified projections, any attention-free lengthy sequence mannequin performs worse on any activity requiring in-context studying.

### Reminiscence over lengthy sequences

The second core aspect is the discount operator taking a historical past of values and aggregating them into an output. Fortunately, there is no such thing as a must reinvent the wheel, tweaking the particular parametrization (the alternatives of $psi$ in linear consideration variants)! We are able to leverage a whole line of analysis on layers for lengthy sequences, beginning with seminal work on deep state-space models and follow-ups, that examine precisely this query. Notably, these strategies generalize and enhance the expressivity of earlier approaches utilizing $w$ as discount parameters. If the parametrization might be written in recurrent type, then all the mannequin can generate sequences at a relentless reminiscence value, sidestepping the necessity to cache intermediate outcomes.

After all, there are trade-offs concerned in several parametrizations, and in our expertise these concern the scaling in high quality over sequence size. In latest work, we try to quantify how a lot reminiscence every discount operator encodes, by measuring the minimal state dimension of the recurrence which approximates the convolution filter.

We discovered that for instance, the implicit Hyena parametrization results in bigger recurrences than the H3 parametrization leveraging diagonal state areas, offering some perception into our scaling outcomes.

That is thrilling as a result of (a) it permits compression of lengthy convolution layers into compact recurrences (smaller reminiscence footprint, increased throughput, coming quickly!) and (b) it offers a quantitative path in direction of additional enhancing the parametrization for reminiscence over ultra-long sequences.

We’re additionally pleased to focus on some latest outcomes from our S5 friends: they changed the present implicit parametrization of lengthy convolutions in Hyena with an expressive variant of state-space fashions, and achieved promising outcomes. It is a very lively space of analysis, and we totally count on to see enhancements to the reminiscence module that can additional enhance attention-free fashions!

### Architectural issues

To this point, we now have mentioned the internals of the $D=1$

- As a substitute of getting every operator act independently on every channel (house), we will type
*heads*impressed by multi-head consideration. - A Transformer block is outlined as consideration adopted by a small MLP. Seems we don’t want MLPs in Hyena fashions, offered we account for misplaced layers and floating level operations (FLOPs) when eradicating them. One possibility is to e.g. substitute every MLP with one other Hyena mixer, or introduce extra FLOPs through heads.

### Coaching effectivity

Along with understanding the core machine studying concepts behind Hyena, it is very important characterize its effectivity, when mapped to trendy {hardware}. Regardless of the improved $tilde{O}(L)$

### Monarch projections?

One other key query is how the parameter discount incurred by swapping dense for sparse structured matrics within the projections — past reductions — impacts mannequin high quality. That is thrilling, as a result of it leads us in direction of totally subquadratic architectures, in each width and sequence size. Right here we offer some preliminary outcomes and steerage for future work to increase Hyena.

We contemplate the associative recall artificial activity, by which sequences encompass key-value pairs (consider a dictionary). The keys and values are all single-character numbers and letters, and in every sequence the mapping varies. The mannequin is tasked with offering the worth for a key, given the sequence.

Intuitively, growing the vocabulary measurement $V$, i.e. the variety of distinctive keys and values, is correlated with elevated activity issue. Word that within the above instance, $V = 10$

Vocabulary | Dense (2.2M params) | Sparse Structured ( 1.8M params) |
---|---|---|

10 | 100 | 100 |

50 | 100 | 80 |

100 | 20 | 10 |

We observe the hole in high quality between dense and sparse projections will increase with activity issue and are excited for future work to research this!

### Inference effectivity

Inference workloads are extremely essential, particularly in relation to giant language fashions. In truth, numerous effort from the open-source neighborhood has gone into optimizing era workloads in order that they require much less reminiscence, can produce outputs quicker, and might run on smaller gadgets. We examine these questions in latest work that appears at distilling Hyena operators and different lengthy convolutions right into a recurrent type, utilizing concepts from rational operate approximation and mannequin order discount.

We’re additionally in a position to take present state-space approaches — which have already got a recurrent type — and use our strategies to prune redundant state dimensions, growing throughput and decreasing reminiscence value.

## What’s subsequent?

That is solely a snapshot of what we have been doing. We maintain scaling in mannequin sizes and sequence size. In different latest work, we pretrain Hyenas on as much as $1$ million sequence size on genomics (at a “character” degree), outperforming Transformers and environment friendly Transformers on downstream duties, with a lot smaller fashions. Past these efforts, we’re exploring varied different parametrizations, for much longer sequences, and looking out additional into character-level coaching. We proceed exploring questions associated to {hardware} effectivity and new functions of longer context lengths.

As all the time, we’re most enthusiastic about discovering deeper connections between our strategies and classical sign processing and structured linear algebra.

### Appendix: Hyena as (weak) time variance

Contemplate a easy one-dimensional, discrete, weakly time various^{}

The closed-form resolution (input-to-output) reads (see this post or the classical ebook “Linear System Concept and Design” for a step-by-step reference within the time-invariant case):

What occurred? For those who have a look at the above, you would possibly see one thing acquainted: we ended up with the gate, lengthy convolution, gate decomposition from order-2 Hyena operators. The principle variations are in fact because of the parametrization of every module on this input-output map. By utilizing implicit parametrizations (for projections, and for the lengthy convolutions) Hyena generalizes the above^{ system.}

With longer recurrences — extra gates and lengthy convolutions — we’re in essence utilizing chains of those programs. And there’s something basic about operators that may be decomposed as chains of diagonal and circulant matrices.

## Acknowledgments

Because of all of the readers who offered suggestions on this put up and launch: Avanika Narayan, Hermann Kumbong, Michael Zhang, Sabri Eyuboglu, Eric Nguyen, Krista Opsahl-Ong, David W. Romero.