StripedHyena-7B, open supply fashions providing a glimpse right into a world past Transformers


One of many focus areas at Collectively Analysis is new architectures for lengthy context, improved coaching, and inference efficiency over the Transformer structure. Spinning out of a analysis program from our crew and educational collaborators, with roots in signal processing-inspired sequence models, we’re excited to introduce the StripedHyena fashions. This launch contains StripedHyena-Hessian-7B (SH 7B), a base mannequin, and StripedHyena-Nous-7B (SH-N 7B), a chat mannequin. StripedHyena builds on the numerous classes discovered previously yr on designing environment friendly sequence modeling architectures: H3, Hyena, HyenaDNA, and Monarch Mixer.
Play with the models on Together Playground!

- StripedHyena is the primary different mannequin aggressive with the most effective open-source Transformers briefly and long-context evaluations. The identical base mannequin achieves comparable efficiency with Llama-2, Yi and Mistral 7B on OpenLLM leaderboard duties, outperforming on long-context summarization.
- StripedHyena is quicker and extra reminiscence environment friendly for lengthy sequence coaching, fine-tuning, and era. Beside consideration, one core computational primitive for environment friendly inference is a state-space mannequin (SSM) layer, constructing on pioneering work equivalent to S4 (Gu el al.), enabling conversion of convolutional layers into recurrences. Utilizing our newest analysis on quick kernels for gated convolutions (FlashFFTConv) and on efficient Hyena inference, StripedHyena is >30%, >50%, and >100% sooner in end-to-end coaching on sequences of size 32k, 64k and 128k respectively, in comparison with an optimized Transformer baseline utilizing FlashAttention v2 and customized kernels. StripedHyena caches for autoregressive era are >50% smaller than an equivalently-sized Transformer utilizing grouped-query consideration.
- StripedHyena is designed utilizing our newest analysis on scaling legal guidelines of environment friendly architectures. Particularly, StripedHyena is a hybrid of consideration and gated convolutions organized in Hyena operators. By way of a compute-optimal scaling protocol, we establish a number of methods to enhance on baseline scaling legal guidelines for Transformers (Chinchilla) on the structure degree, equivalent to hybridization. With these strategies, we’re in a position to acquire greater high quality fashions than Transformers at every coaching compute finances, with extra advantages at inference time.
- StripedHyena is optimized utilizing a set of recent mannequin grafting strategies, enabling us to vary the mannequin structure throughout coaching. StripedHyena was obtained by grafting architectural parts of Transformers and Hyena, and educated on a mixture of the RedPajama dataset, augmented with longer-context knowledge.
We sit up for additional pushing the boundaries of mannequin architectures for quick coaching and inference, permitting us to enhance on present scaling legal guidelines and to acquire greater high quality base fashions at every compute finances.
A single structure for brief and longer context duties
For the previous yr, the Collectively AI crew and our collaborators at Hazy Analysis have been engaged on the design of recent sequence fashions for language and different domains. We’ve been particularly excited by architectures that exchange consideration as the first operator liable for mixing sequences of embeddings, and changing them with alternate options which might be computationally cheaper. We’ve developed a number of fashions (H3, Hyena) that exchange consideration with implicit gated convolutions and gated SSMs, and educated among the first different architectures rivaling Transformers on language.
Analysis
We consider StripedHyena on a collection of benchmarks to ascertain efficiency on short-context duties, in addition to to probe its capability to course of lengthy prompts.
First, we take a look at perplexity scaling on a subset of books from Challenge Gutenberg. We compute the perplexity on the final 2048 tokens of every pattern, and repeat the experiment by together with more and more longer prefixes within the immediate of StripedHyena. We observe totally different behaviors relying on the construction of the pattern; perplexity both saturates at 32k (the enter size seen over the last phases of coaching), or retains lowering previous 32k, suggesting that the mannequin is ready to incorporate some info from the longer immediate.

High quality on longer context makes StripedHyena an environment friendly baseline generalist mannequin, aggressive with Mistral 7B on summarization and longer context duties:
Benchmark (shot) | SH 7B | Mistral 7B |
---|---|---|
GovReport, F1 (0) | 27.9 | 17.5 |
NarrativeQA, F1 (0) | 25.8 | 24.7 |
Qasper, F1 (0) | 28.8 | 30.3 |
Common | 27.5 | 24.2 |
Benchmarking StripedHyena-Hessian-7B (SH 7B) and Mistral 7B on zero-shot, long-context duties from ZeroScrolls.
StripedHyena is the primary different structure aggressive with sturdy Transformer base fashions of the identical dimension or bigger, at scale. On short-context duties, together with OpenLLM leaderboard duties, StripedHyena outperforms Llama-2 7B, Yi 7B and the strongest Transformer alternate options equivalent to RWKV 14B:

With our collaborators at Nous Research, we’re excited to additionally launch StripedHyena-Nous- 7B (SH-N 7B), a chat mannequin, constructed with new fine-tuning recipes tailor-made to the StripedHyena structure.
Understanding the structure design area: Some ways to enhance scaling
Hybridization
Early in our scaling experiments, we seen a constant development: given a compute finances, architectures constructed out of mixtures of various key layers at all times outperform homogenous architectures. These observations echo findings described in varied papers: H3, MEGA (amongst others) and discover even earlier connections to the hybrid global-local consideration design of GPT-3 (from Sparse Transformers) and GPTNeo.
To grasp this phenomenon, in addition to the advance on scaling coefficients – for instance, the anticipated loss discount per floating level operation (FLOP) within the finances – for a category of architectures, we carried out an in depth compute-optimal scaling evaluation. We discovered hybrids composed of multi-head consideration, gated MLPs and gated convolutions to outperform sturdy Transformer architectures equivalent to Llama throughout compute finances, and recognized optimum methods to combine these parts, in each ordering and amount.


With our educational companions, we’ve got been growing concept and artificial duties to grasp how and why this happens. We have now recognized a wide range of regimes the place layers specialize to explicit sub-tasks of sequence modeling, offering priceless alerts for additional structure optimization. This continues our basic line of labor on mechanistic design, which includes small-scale, artificial duties fastidiously constructed to stress-test structure capabilities.
Multi-head gated convolutions
One other means to enhance over Transformer charges is to introduce extra computation within the type of a number of heads in gated convolutions. This design, impressed by linear consideration, has been validated in architectures equivalent to H3 and MultiHyena, and is provably more efficient at encoding associative recall circuits.

These are two of many architectural design strategies that may end up in improved scaling over Transformers. With StripedHyena, we focus particularly on the synergy between consideration and gated convolutions.
A shift within the computational footprint of language fashions: Cheaper fine-tuning, sooner inference
Optimizing fashions constructed on a distinct structure requires rethinking computational trade-offs of coaching, fine-tuning, and inference. For these new architectures, a distinct set of computational bottlenecks emerges, and we’re proud to have developed the important thing applied sciences enabling these fashions as effectively, equivalent to FlashFFTConv and its predecessors.
For coaching and fine-tuning workloads on lengthy sequences, SH 7B is at all times sooner than optimized Transformers (>10%, >20% and >50% end-to-end sooner than FlashAttention v2 processing sequences of lengths 32k, 64k and 128k, at batch dimension 1), with the speedup rising bigger on longer sequences and with bigger batches. StripedHyena fashions are optimum candidates for fine-tuning on long-context duties – for duties at size 128k, SH 7B can finetune on greater than twice as many tokens as a Transformer given the identical finances.

These enhancements are pushed by the totally different asymptotic scaling and computational profile of layers within the hybrid structure. We sit up for even sooner fashions with refined mixing ratios and layers.
Decreasing reminiscence for inference
One extra benefit of SH 7B is a >50% diminished reminiscence footprint throughout autoregressive era, in comparison with a Transformer (each with grouped-query consideration). In Transformers, the important thing and worth entries of every layer are cached throughout the prefilling section to keep away from recomputation and velocity up incremental decoding. Gated-convolution layers introduce considerably extra levels of freedom throughout inference, as there are a number of methods to characterize and optimize computation. We discover these trade-offs in our current analysis on distillation and representation of convolutions in Hyena. These new strategies have been utilized in SH 7B to additional scale back the reminiscence footprint by figuring out and pruning redundant states.

From sign processing to language fashions
The flexibleness of StripedHyena is a direct consequence of the existence of a number of equal representations and parametrization of linear techniques: convolutional, modal, canonical. Every kind is best suited to a selected coaching and inference workload.
Borrowing terminology from classical sign processing, SH convolutional filters may be broadly handled as both finite (FIR) or infinite (IIR). FIR caching for a convolution is analogous to straightforward key-value caching (kv caching) in consideration – significantly the sliding window form – and grows in reminiscence footprint (till a most worth).
IIR filters however may be utilized by caching a constant-size state, then up to date throughout decoding by way of a recurrent step. The IIR illustration can itself be custom-made. Fastened-size states, in distinction to kv caches, open up a number of key optimizations on the inference stack degree, streamlining strategies equivalent to steady batching and speculative decoding.
What’s forward
Our main goal with the StripedHyena fashions is to push the frontier of structure design past Transformers. StripedHyena solely scratches the floor of what’s attainable with cautious structure design by way of mechanistic design and scaling legal guidelines, and with concepts such hybridization, implicit convolutions and state caching. We hope these fashions can encourage the open supply neighborhood to discover new thrilling builds with numerous architectures.
In future variations we are going to discover:
- Bigger fashions with longer context.
- Multi-modal help.
- Additional efficiency optimizations.
- Integration of StripedHyena into retrieval pipelines for full utilization of longer context.
Acknowledgments
This work wouldn’t have been attainable with out our collaborators at HazyResearch, Hessian.AI, Nous Research, MILA, HuggingFace, DFKI.
We’re grateful to Hessian.AI Innovation Lab (funded by the Hessian Ministry for Digital Technique and Innovation) and the hessian.AISC Service Middle (funded by the Federal Ministry of Schooling and Analysis (BMBF)) for the collaboration and joint use of their AI supercomputer forty-two. Particular thanks additionally go to the German Middle for Synthetic Intelligence (DFKI).