# The Secret Sauce behind 100K context window in LLMs: all methods in a single place | by Galina Alperovich | Might, 2023

*by*Phil Tadros

tldr;methods to hurry up coaching and inference of LLMs to make use of massive context window as much as 100K enter tokens throughout coaching and inference: ALiBi positional embedding, Sparse Consideration, FlashAttention, Multi-Question consideration, Conditional computation, and 80GB A100 GPUs.

Not too long ago there have been a number of bulletins about new Massive Language Fashions (LLMs) that may eat a particularly massive context window, akin to **65K tokens **(MPT-7B-StoryWriter-65k+ by MosaicML) or **even 100K tokens** (Introducing 100K Context Windows by Antropic). Within the Palm-2 technical report, Google doesn’t reveal the context dimension however mentions that they “*enhance the context size of the mannequin considerably*.”

For comparability, the present GPT-4 mannequin can work with the context size of **32K enter tokens**. And many of the open-source LLMs have a context size of **2K tokens**.

That’s spectacular since having such a big context size means **the immediate could be actually a dimension of a guide**. The Nice Gatsby is 72K tokens, 210 pages, and 6 hours of studying at a 1.7 min/web page pace. So the mannequin can scan and maintain this quantity of “customized” info to course of queries!

I used to be making an attempt to wrap my head round how that’s technically doable, so on this weblog put up, I gather scattered items of data (this thread was the primary clue) and canopy the next:

- Why
**context size**issues and why it may be a recreation changer - What are the
**predominant limitations**within the original Transformer architecture when working with massive context lengths **The computational complexity**of the transformer structure- What
**optimization methods**at the moment exist to hurry up the transformer and enhance the context size as much as 100K

Right here and later, we use the “context size,” “context window,” and “the variety of enter tokens” interchangeably, denoting them as **n**.

The weblog put up is a bit lengthy, so there’s a abstract with the details and methods:

**1st downside**is the**quadratic**time and area complexity of consideration layer computations w.r.t. the variety of enter tokens**n**.- When the embedding dimension
**d > n**, the**2nd downside**is the**quadratic**time complexity of linear layers w.r.t. embedding dimension**d**. **third downside**is**Positional Sinusoidal Embedding**used within the authentic structure.- In Transformer structure, the shapes of
**learnable**matrix weights**are agnostic to the variety of enter tokens****n**. - So, a skilled Transformer in 2K context lengths
**can eat tokens of any size**, even 100K. However the mannequin won’t produce significant outcomes on 100K tokens throughout inference if it isn’t skilled - Coaching the vanilla Transformer on an enormous corpus and solely on a big context size is
**unfeasibly costly**as a result of quadratic complexity w.r.t to**n**and**d**. LLaMA on 2K context size was estimated to be skilled for ~$3M. Thus, LLaMA on 100K would value ~**$150M.** - One choice is to coach the mannequin on 2K tokens context after which fine-tune it in longer contexts (for instance, 65K). However
**it gained’t work**with the unique Transformer due to the**Positional Sinusoidal Encoding**. **[Trick #1]**To handle this, take away Positional Sinusoidal Encoding and use ALiBi, a easy and chic positional embedding that doesn’t harm accuracy.**Then you possibly can prepare on 2K and fine-tune on 100K.****[Trick #2]**You don’t must calculate consideration scores between all tokens. Some tokens are extra vital than others, so Sparse Consideration can be utilized.**It is going to pace up each coaching and inference**.**[Trick #3]**Flash Attention effectively implements the eye layer for GPU. It makes use of tiling and avoids materialization of huge intermediate matrices**(n, n)**that doesn’t match into GPU SRAM.**It is going to pace up each coaching and inference**.**[Trick #4]**Multi-Query attention as a substitute of Multi-Head consideration. Which means you share weights throughout all heads when linearly projecting Okay and V. It**dramatically****hurries up incremental inference.****[Trick #5]****Conditional computation**avoids making use of all mannequin parameters to all tokens from the enter sequence. CoLT5 applies heavy computations solely to an important tokens and processes the remainder of the tokens with a lighter model of layers.**It is going to pace up each coaching and inference**.**[Trick #6]**To suit a big context, you want a whole lot of RAM in GPU, so individuals use 80GB A100 GPUs.

To sum up, the extra you **pace up the coaching and inference**, **the bigger the context size you should utilize**.

Let’s now focus on all these factors in additional element.

Context size is likely one of the vital limitations of LLMs. And growing it to already 100K is an unbelievable achievement (I’m wondering how this assertion will look in a 12 months).

One of many vital use instances the place individuals wish to apply LLMs is “dropping a big pile of customized information into an LLM” (paperwork associated to the corporate or a selected downside, numerous heterogeneous texts, and so forth) and asking questions on this explicit information, not some summary information from the web that LLM noticed throughout coaching.

To beat this limitation now, individuals do numerous issues:

- Attempting summarization methods and complex chained prompts
- Sustaining vector databases to maintain embeddings for customized paperwork after which “looking out” throughout them by some similarity metric
- Wonderful-tuning the LLM with customized information when doable (not all industrial LLMs permit that, and it’s not an apparent job for open-source LLMs)
- Creating customized smaller LLMs for this explicit information (once more, not an apparent job)

Having a big context size permits an already highly effective LLM (that noticed the entire web) to have a look at your context and information and work together with you on a totally completely different stage **with the next personalization**. And all these **with out altering the mannequin’s weights** and doing all of your “coaching” on the fly, “in reminiscence.” And total, a big context window brings extra accuracy, fluency, and creativity to the mannequin.

One analogy right here is perhaps pc RAM, the place the working system retains the real-time context of all of your purposes. With a considerable context size, LLM could be like a “reasoning pc,” maintaining a whole lot of person context.

It’s vital to notice that in Transformer structure, **the shapes of all learnable matrix weights will not be depending on the variety of enter tokens** **n**. All trainable parameters (embedding lookup, projection layers, softmax layer, and a focus layers) don’t rely on enter size and should deal with variable-length inputs. **That’s nice **that** **we now have this out-of-the-box property of the structure**.**

Which means should you skilled a Transformer mannequin with a context size of 2K, you would infer token sequences of any dimension. The one downside is that the mannequin won’t produce significant outcomes on 100K tokens throughout inference if it isn’t skilled** **on 100K context size. On this case, the coaching information distribution will likely be removed from the one throughout the inference, so the mannequin will fail as any machine studying mannequin on this setup.

One resolution to coach a big context size Transformer is to coach it **in two levels**: prepare the bottom mannequin on 2K tokens context size after which proceed coaching (fine-tuning) on longer contexts (for instance, 65K or 100K). That’s exactly what MosaicML did. However the issue is that **it gained’t work with the unique Transformer structure**, so it is advisable to use some methods (see Trick #1 later within the put up).

## Recap on Multi-Head Consideration

Challenges of a big context size are associated to the computational complexity of the transformer structure. To debate the complexity, first, let’s recap how the eye layer works.

**Q — **queries, **Okay** — keys and **V** — values, notations from the paper regarding the data retrieval, the place you insert a “question” to the system and search the closest “key”**n** —the enter variety of tokens**d** — textual content embedding dimension**h** — the variety of consideration heads**ok**— linear projection dimension for Q and Okay**v** — linear projection dimension for V

**Multi-Head Consideration:**

- Now we have a lookup Embedding layer that, for a given token, returns a vector of dimension
**(1, d)**. Thus, for a sequence of**n**tokens, we get the textual content embeddings matrix**X**of dimension**(n, d)**. Then we sum it up with the Positional Sinusoidal Embedding. - The Multi-Head Consideration layer goals to calculate the
**new embedding**for this sequence of tokens that may be thought of as an authentic textual content encoding X**however weighted**(1) by relative significance between tokens close to the context and (2) by relative positions of tokens. - We course of this embedding matrix X
**(n, d)**in parallel with**h**consideration layers (heads). To get**Q**,**Okay,**and**V**for all consideration heads, you linearly challenge X to**ok**,**ok,**and**v**dimensions, respectively. You do it by multiplying X by**h**matrices of form**(d, ok)**,**(d, ok),**and**(d, v)**. You may give it some thought as multiplying**(n, d)**by**(h, d, ok)**,**(h, d, ok),**and**(h, d, v)**. - Consideration Heads return
**h**consideration scores matrices of dimension**(n, v).**Then we concatenate items from all heads**(n, h*v)**and linearly challenge it for the subsequent steps.

**Scaled Dot-Product Consideration**:

Now, let’s zoom in on **one consideration head**.

**Q**,**Okay**,**V**are 3 linear projections of**X**of dimension**(n, ok)**,**(n, ok)**, and**(n, v)**obtained by multiplying to learnable weights**separate for every head**.- We get consideration scores by calculating the space (dot product) between the
**Q**and the**Okay**(transposed). You multiply matrix**(n, ok)**by**(ok, n)**and get the matrix**(n, n)**. Then we multiply it by the masks matrix to zero down a number of the tokens (required within the decoder). Then we scale it and apply softmax to be from 0 to 1. This fashion, we get the matrix of form**(n, n)**with**n_ij**– a relative consideration rating from 0 to 1 between the**i**-th and**j**-th token that exhibits how “shut” these tokens are on this explicit context of size**n**. - Then we multiply this consideration rating matrix
**(n, n)**by “values” V of dimension**(n, d)**to get the textual content embedding weighted by these relative consideration scores.

Let’s have a look at this piece of code from the Multi-Query attention paper. It exhibits how the Multi-Head Consideration is calculated with batching, and the shapes are clear on each step. In addition they embody masking multiplication used throughout decoding.

## The complexity of the Transformer & context size

The complexity of two matrix multiplication (a,b)*(b,c) is **O(a*b*c)**.

We assume that **ok*h = O(d)** for simplicity**, **and we’ll use this to derive the complexity of the eye.

The complexity of the eye layer consists of two elements:

- Linear projections to get Q, Okay, V: multiplication of embedding matrix of dimension
**(n, d)**by**h**learnable matrices**(d, ok)**,**(d, ok),**and**(d, v)**. Thus, the complexity ~**O(nd²)** - Multiplications of
**Q**by**Okay**reworked after which multiplication by**V: (n,ok) * (ok,n) = (n,n)**and**(n,n)*(n,v) = (n,v).**The complexity ~**O(n²d)**

So, the complexity of the eye layer is **O(n²d + nd²)**, the place **n** — is the context size (variety of enter tokens) and **d** — embedding dimension. So from right here, we see that the complexity of the eye layer computation is **quadratic w.r.t the variety of enter tokens** n and **quadratic w.r.t embedding dimension d**.

The time period **O(nd²) **is vital when d > n (for instance, in LLaMa, n=2K and d=4K).

The time period **O(n²d) **is** **vital when n > d (for instance, coaching MosaicML with n=65K and d=4K).

Simply to remind you the way dangerous the quadratic development is:

2 000² = 4 000 000, 100 000² = 10 000 000 000.

Let me offer you an instance of how this quadratic complexity influences the worth of mannequin coaching. The estimated value of coaching LLaMa was **~$3M, **and it has 65B parameters, 2K context size, and 4K embedding dimension. The estimated time is generally GPU coaching time. If we enhance the context size from 2K to 100K (50x), the coaching time **will enhance ~50x** as properly (we want fewer iterations as a result of the context is bigger, however it takes longer time on every). So, coaching LLaMA on 100K context would value round ~**$150M**.

A little bit of particulars on this calculation:

For the variety of tokens equals

n, the complexity of the eye isO(n²d + nd²)and it takesMiterations to coach. If we enhance the contex size fromn→p*n,it should requireM/piterations for the reason that context size grew to become bigger (let’s assume for simplicyty it’s linear, it is perhaps an overestimation or underestimation relying on job). Now we now have 2 equations:

(1) Complexity forn~M * (n²d + nd²)(2)Complexity forp*n~M/p * ((p*n)²d + (p*n)d²)After a collection of simplifiations and divisions, the ratio (2)/(1) ~(d + p*n)/(d + n)If

d << n, growingnby an element ofpwill result in ~p occasions extra iterations.

Ifd ~ n, growingnby an element ofpwill result in ~p/2 occasions extra iterations.

## Distinction between coaching and inference levels in Transformer

The very last thing to debate earlier than digging into optimization methods is the distinction in computation throughout coaching and inference.

Throughout coaching, you run issues in parallel, whereas for textual content era throughout inference, it is advisable to do it sequentially as a result of the subsequent token depends upon earlier ones. The simple strategy to implement the inference is to calculate consideration scores **incrementally** and **cache** earlier outcomes for future tokens.

This distinction brings completely different approaches to dashing up coaching and inference. That’s the reason some methods beneath will optimize each levels, however some will optimize solely the inference.

Now, let’s discuss how researchers overcame all these challenges and had been capable of prepare an LLM with a big context size.

**[**Trick** #1] **Higher positional encoding — ALiBi

One resolution to coach a big context size Transformer is to **prepare it in two levels**: prepare the bottom mannequin on 2K tokens context size after which fine-tune on longer contexts (for instance, 65K). However earlier, we mentioned it wouldn’t work with the unique Transformer structure. Why?

Due to the **Positional Sinusoidal Encoding**, which **has no “extrapolation” capability. **Within the ALiBI[4] paper, the authors confirmed that Positional Sinusoidal Encoding just isn’t sturdy to the extension of the context window throughout inference. After a couple of extra tokens, the efficiency begins degrading. So, lack of **“extrapolation” capability **mainly means you possibly can’t use bigger context lengths throughout inference/fine-tuning than throughout coaching. The time period “extrapolation” and the comparability of varied positional encodings are described in [4].

Within the authentic transformer paper, Positional Sinusoidal Embedding has summed with the tokens Embeddings on the backside of the structure so as to add details about the order of phrases. If you wish to learn the way the Positional Sinusoidal Embedding is calculated, I like to recommend this fun video, the place it’s defined intuitively and in good element.

**So, the primary trick is to take away Positional Sinusoidal Embedding** and exchange it with one other place embedding — Attention with Linear Biases (ALiBI).

It’s **utilized within the consideration head** (not on the underside of the community), and it biases query-key consideration scores with a penalty that’s **proportional to their distance** (earlier than softmax).

**This trick hurries up coaching**.

## [Trick #2] Sparse Consideration

Not all tokens within the context of dimension 100K are related to one another. One strategy to scale back the variety of computations is to **think about just some tokens when calculating the eye scores**. The aim of including the sparsity is to **make the computation to be linear to n, not quadratic**. There are a number of approaches learn how to choose the connection between tokens, and there is a superb illustration of this within the Google blog post:

For instance, the Sliding Window Attention (additionally known as Native) employs a fixed-size window consideration surrounding every token. On this consideration sample, given a hard and fast window dimension of **w**, every token attends to **w**/2 tokens on all sides. The computational complexity of this sample is **O(n*w)**, which scales linearly with enter sequence size **n**. To make it environment friendly, **w** must be small in contrast with **n**. The trick is that the eye info “flows” the entire context window inside close to tokens, approximating the complete graph.

The BigBird consideration rating technique combines world, native, and random mechanisms. Within the paper, the authors confirmed an important statement that **there may be an inherent pressure between how few similarity scores one computes and the circulate of data between completely different nodes** (i.e., the power of 1 token to affect one another).

**This trick hurries up each coaching and inference**.

## [Trick #3] FlashAttention — environment friendly implementation of the eye layer for GPU

There are a number of computational operations within the consideration layer are repeated time and again:

**S**= Q*Okay**P**= softmax(S)**O**= P*V

Keep in mind the notion for **P**, **S** and **O** outcomes; we’ll use it later. FlashAttention authors “fused” these operations: they carried out an consideration layer algorithm that utilized the GPU reminiscence effectively and calculated the precise consideration.

For a GPU to make an operation, the enter information should be current within the “fast” reminiscence named SRAM. The information is copied from “gradual” HBM reminiscence to SRAM and returned again to HBM as soon as the computation is over. **SRAM reminiscence is way quicker than HBM however a lot smaller in dimension** (20MB vs 40GB in A100 40GB GPU).

So, **accessing the HBM is an costly operation**.

The principle downside within the consideration layer w.r.t the GPU reminiscence utilization is “intermediate” multiplication outcomes, **P, S, **and** O, **which might be massive in dimension **(n, n)**. We have to save them to HBM and browse them once more between consideration operations. **Shifting P, S, and O from HBM to SRAM again and pressure is the bottleneck**, which the authors solved within the paper.

The principle concept behind the FlashAttention algorithm is to **cut up the inputs Q, Okay, and V matrices into blocks**, loading these blocks from HBM to SRAM after which computing the eye output w.r.t these blocks. This process is known as *tiling*.

The “*matrix multiplication*” operation is already optimized for GPU. You would possibly consider this FlashAttention algorithm as implementing the “*consideration layer*” operation optimized for GPU. The authors “fused” operations of a number of multiplications and softmax with tiling and optimized HBM accessing.

There’s a good overview of the FlashAttention paper.

Since recently, PyTorch 2.0 has flash-attention built-in. That is the FlashAttention implementation in Triton language by the authors.

**This trick hurries up each coaching and inference**.

## [Trick #4] Multi-Question consideration (MQA)

The unique Multi-Head Consideration (MHA) has a separate linear layer for Okay and V matrices in each head.

Throughout inference, the keys and values of earlier tokens within the **decoder** are cached to forestall re-computing them, so **GPU reminiscence utilization grows with every generated token**.

Multi-Query attention (MQA) is the optimization that implies sharing weights throughout all consideration heads when linearly projecting Okay and V, so we would wish to maintain solely 2 matrices of dimension **(n, ok)** and **(n, v)**. An enormous mannequin can have as much as 96 heads (akin to GPT-3) which implies **utilizing MQA can save 96x the reminiscence consumption of the important thing/worth decoder cache.**

This optimization is very useful when producing lengthy texts. For instance, having a big context size and asking for a protracted, significant evaluation or summarization.

The principle benefit of this method is the numerous **dashing up of the incremental consideration scores calculation throughout inference. **Coaching pace stays principally the identical. For instance, PaLM is using it.

## [Trick #5] Conditional computation

When **d > n**, the bottleneck in pace just isn’t the eye layer **however the feedforward and projection layers**. A standard method to lowering the FLOPs is using some type of **conditional computation** that avoids making use of all mannequin parameters to all tokens from the enter sequence.

Within the Sparse Consideration part, we’ve mentioned that some tokens are extra vital than others. Following the identical instinct, within the CoLT5 paper, authors separated all feedforward and a focus computations into two branches: **heavy** and **gentle. **Lite layers are utilized to all tokens, and the heavy ones solely to vital ones.

“The sunshine and heavy feedforward branches differ solely of their hidden dimension, with the sunshine department having a smaller hidden dimension than the usual T5 feedforward layer and the heavy department bigger”.

This method has been proven to outperform each the pace and accuracy of the prevailing LongT5 mannequin for very lengthy sequences as much as 64K enter tokens.

## [Trick #6] Massive RAM GPUs

It’s not a trick however a necessity. To suit a big context, you want massive RAM in GPU, so individuals use 80GB A100 GPUs.

Wow, that is so much. I didn’t count on to finish up with such a protracted weblog put up 😀

I hope it was useful! I discovered so much, and I hope you probably did too, and now we will guess how these Massive Language Fashions with billions of parameters had been skilled in unprecedented context home windows of 65-100K tokens.

Inspiring to see how completely different sensible individuals deal with the identical downside from completely different sides, optimize right here and there, and provide you with cool concepts. All these result in a significant and chic resolution.

I like what one Researcher said about coaching the LLM with a big context: “*No secret sauce, simply well-vetted analysis*.”

[1] Introducing 100K Context Windows by Antropic

[2] MPT-7B by MosaicML

[3] Palm-2 Technical report by Google

[4] ALiBI: Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

[5] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

[6] Multi-Query attention: Fast Transformer Decoding: One Write-Head is All You Need

[8] Attention is All You Need

[9] Video on Positional Sinusoidal Embedding

[10] Overview of the FlashAttention paper

[11] Sliding Window Attention

[12] Constructing Transformers For Longer Sequences with Sparse Attention Methods

[13] FlashAttention implementation in Triton language

[14] How to Accelerate HuggingFace Throughput by 193% with Triton and ClearML

[15] ClearML Serving

[16] Analyzing the Pros and Cons of NVIDIA Triton Inference Server vs. Other Inference Engines

[17] COLT5: Faster Long-Range Transformers with Conditional Computation

[18] LongT5: Efficient Text-To-Text Transformer for Long Sequences

[19] PaLM

[20] BigBird consideration mechanism