Now Reading
Unbelievable! Run 70B LLM Inference on a Single 4GB GPU with This NEW Method | by Gavin Li | Nov, 2023

Unbelievable! Run 70B LLM Inference on a Single 4GB GPU with This NEW Method | by Gavin Li | Nov, 2023

2023-12-03 11:04:07

Giant language fashions require big quantities of GPU reminiscence. Is it attainable to run inference on a single GPU? If that’s the case, what’s the minimal GPU reminiscence required?

The 70B massive language mannequin has parameter dimension of 130GB. Simply loading the mannequin into the GPU requires 2 A100 GPUs with 100GB reminiscence every.

Throughout inference, the complete enter sequence additionally must be loaded into reminiscence for advanced “consideration” calculations. The reminiscence requirement of this consideration mechanism scales quadratically with the enter size. On prime of the 130GB mannequin dimension, much more reminiscence is required.

So what methods can save a lot reminiscence and allow inference on a single 4GB GPU?

Observe that right here the reminiscence optimization methods don’t require any mannequin compression like quantization, distillation, pruning that will sacrifice mannequin efficiency.

Immediately we are going to clarify the important thing methods for excessive reminiscence optimization of enormous fashions.

On the finish of the article we additionally shared the open supply library to attain this with a number of strains of codes!

01

Layer-wise Inference

Probably the most crucial approach is layer-wise inference. That is basically the essential divide and conquer strategy in laptop science.

Let’s first have a look at the structure of enormous language fashions. Immediately’s massive language fashions all undertake the Multi-head self-attention construction proposed in Google’s paper “Consideration is all you want”. That is what individuals later name the Transformer construction.

The big language mannequin first has an embedding projection layer. After that there are 80 utterly equivalent transformer layers. Lastly there’s a normalization and totally linked layer to foretell the token ID chances.

Throughout inference, layers are executed sequentially. The output of the earlier layer is the enter to the subsequent. Just one layer executes at a time.

Due to this fact, it’s utterly pointless to maintain all layers in GPU reminiscence. We will load whichever layer is required from disk when executing that layer, do all of the calculations, after which utterly free the reminiscence after.

This manner, the GPU reminiscence required per layer is simply in regards to the parameter dimension of 1 transformer layer, 1/80 of the total mannequin, round 1.6GB.

As well as, some output caches are additionally saved in GPU reminiscence, the biggest being the KV cache to keep away from repeated computations.

A easy calculation, for the 70B mannequin this KV cache dimension is about:

2 * input_length * num_layers * num_heads * vector_dim * 4

With enter size 100, this cache = 2 * 100 * 80 * 8 * 128 * 4 = 30MB GPU reminiscence.

In line with our monitoring, the complete inference course of makes use of lower than 4GB GPU reminiscence!

02

Single Layer Optimization — Flash Consideration

Flash consideration is maybe one of the necessary and significant optimizations within the growth of enormous language fashions immediately.

All the assorted massive language fashions use basically the identical underlying code, with flash consideration being the most important enchancment.

The concept of flash consideration optimization shouldn’t be fully novel although, now we have to say one other paper “Self-attention Does Not Want O(n²) Reminiscence”.

Initially self consideration requires O(n²) reminiscence (n being sequence size).

This paper proposes that we don’t truly have to hold the O(n²) intermediate outcomes. We will compute them sequentially, constantly replace one intermediate end result and discard every little thing else. This reduces the reminiscence complexity to O(logn).

Flash consideration is analogous in essence, with barely larger reminiscence complexity O(n), however flash consideration deeply optimizes cuda reminiscence entry to attain multi-fold speedups for inference and coaching.

Because the determine exhibits, initially self consideration computes and shops O(n²) intermediate outcomes. Flash consideration splits the computation into many small blocks, computing block by block and lowering reminiscence to the dimensions of 1 block.

03

Mannequin File Sharding

The unique mannequin file is normally sharded into a number of chunks, usually 10GB every.

Our execution processes layer by layer. Every layer is only one.6GB. If we load primarily based on the unique 10GB shards, each layer execution would require reloading the complete 10GB file however solely utilizing 1.6GB.

This course of wastes quite a lot of reminiscence for loading and disk studying. Disk studying pace is definitely the slowest bottleneck in the entire inference course of, so we need to decrease it as a lot as attainable.

Due to this fact, we first pre-process the unique HuggingFace mannequin file and shard it by layers.

For storage we use safetensor know-how (https://github.com/huggingface/safetensors).

Safetensor ensures the storage format and in-memory format match intently, and makes use of reminiscence mapping for loading to maximise pace.

04

Meta System

In implementation we use the meta machine function supplied by HuggingFace Speed up (https://huggingface.co/docs/accelerate/usage_guides/big_modeling).

Meta machine is a digital machine designed particularly for working extremely massive fashions. Once you load a mannequin by way of meta machine, the mannequin knowledge shouldn’t be truly learn in, solely the code is loaded. Reminiscence utilization is 0.

You possibly can dynamically switch elements of the mannequin from meta machine to an actual machine like CPU or GPU throughout execution. Solely then is it truly loaded into reminiscence.

Utilizing init_empty_weights() permits mannequin loading by way of meta machine.


from speed up import init_empty_weights
with init_empty_weights():
my_model = ModelClass(...)

05

Open Supply Library

We open sourced all of the code — AirLLM. Means that you can obtain this with a number of strains of code.

See Also

It may be discovered within the Anima github: https://github.com/lyogavin/Anima/tree/main/air_llm.

Utilization could be very easy. First set up the bundle:

pip set up airllm

Then layered inference might be carried out like a standard Transformer mannequin:


from airllm import AirLLMLlama2

MAX_LENGTH = 128
# might use hugging face mannequin repo id:
mannequin = AirLLMLlama2("garage-bAInd/Platypus2-70B-instruct")

# or use mannequin's native path...
#mannequin = AirLLMLlama2("/residence/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

input_text = [
'What is the capital of United States?',
]

input_tokens = mannequin.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH,
padding=True)

generation_output = mannequin.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=20,
use_cache=True,
return_dict_in_generate=True)

output = mannequin.tokenizer.decode(generation_output.sequences[0])

print(output)

Now we have examined this code on a 16GB Nvidia T4 GPU. Your complete inference course of makes use of lower than 4GB GPU reminiscence.

Observe that decrease finish GPUs like T4 will likely be fairly gradual for inference. Not very appropriate for interactive situations like chatbots. Extra fitted to some offline knowledge analytics like RAG, PDF evaluation and many others.

At the moment solely Llam2 primarily based fashions are supported. Go away a remark in case you want help for different fashions!

06

Can 70B Coaching Match on a Single GPU?

Whereas inference might be optimized with layering, can coaching work equally on a single GPU?

Inference solely wants the output of the earlier layer when executing the subsequent transformer layer, so layered execution with restricted knowledge is feasible.

Coaching requires extra knowledge. The coaching course of first computes the ahead propagation to get the output of each layer and tensor. Then does backpropagation to compute the gradient of each tensor.

Gradient calculation wants to avoid wasting the outcomes of earlier ahead layers, so layered execution doesn’t cut back reminiscence.

There are another methods like gradient checkpointing that may obtain comparable results.

If you’re serious about how gradient checkpointing can considerably cut back coaching reminiscence necessities, go away a remark!

07

Our code references quite a bit from SIMJEG’s implementation on Kaggle: https://www.kaggle.com/code/simjeg/platypus2-70b-with-wikipedia-rag/notebook. Shout out to the superior Kaggle group for his or her contributions!

We are going to proceed open sourcing the most recent and simplest new strategies and advances in AI, contributing to the open supply group. Please comply with us.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top