# Transformer Math 101 | EleutherAI Weblog

*by*Phil Tadros

A number of primary, essential details about transformer language fashions will be computed fairly merely. Sadly, the equations for this aren’t extensively recognized within the NLP neighborhood. The aim of this doc is to gather these equations together with associated data about the place they arrive from and why they matter.

**Word:** This publish is primarily involved with coaching prices, that are dominated by VRAM issues. For a similar dialogue of inference prices with a concentrate on latency, try this excellent blog post by Kipply.

The fundamental equation giving the fee to coach a transformer mannequin is given by:

$$

Capproxtau T = 6PD

$$

the place:

- $C$ is the compute required to coach the transformer mannequin, in complete floating level operations
- $C=C_{textual content{ahead}}+C_{textual content{backward}}$
- $C_{textual content{ahead}}approx2PD$
- $C_{textual content{backward}}approx4PD$
- $tau$ is the combination throughput of your {hardware} setup ($tau=(textual content{No. GPUs}) occasions (textual content{Precise FLOPs}/textual content{GPU})$), in FLOPs
- $T$ is the time spent coaching the mannequin, in seconds
- $P$ is the variety of parameters within the transformer mannequin
- $D$ is the dataset dimension, in tokens

These equations are proposed and experimentally validated in OpenAI’s scaling laws paper and DeepMind’s scaling laws paper. Please see every paper for extra info.

It’s value taking an apart and discussing the items of $C$. $C$ is a measure of complete compute, however will be measured by many items resembling:

- FLOP-seconds, which is in items of $[frac{text{Floating Point Operations}}{text{Second}}] occasions [text{Seconds}]$
- GPU-hours, which is in items of $[text{No. GPUs}]occasions[text{Hours}]$
- Scaling legal guidelines papers are likely to report values in PetaFLOP-days, or $10^{15}times24times3600$ complete floating level operations

One helpful distinction to bear in mind is the idea of $textual content{Precise FLOPs}$. Whereas GPU accelerator whitepapers often promote their theoretical FLOPs, these are by no means met in apply (particularly in a distributed setting!). Some widespread reported values of $textual content{Precise FLOPs}$ in a distributed coaching setting are reported under within the Computing Prices part.

Word that we use the throughput-time model of the fee equation as utilized in this wonderful blog post on LLM training costs.

## Parameter vs Dataset Tradeoffs#

Though strictly talking you possibly can prepare a transformer for as many tokens as you want, the variety of tokens skilled can extremely impression each the computing prices and the ultimate mannequin efficiency making placing the suitable stability essential.

**Let’s begin with the elephant within the room: “compute optimum” language fashions.** Typically known as “Chinchilla scaling legal guidelines” after the mannequin collection within the paper that gave rise to present beliefs in regards to the variety of parameters, a compute optimum language mannequin has a **variety of parameters** and a **dataset dimension** that satisfies the approximation $D=20P$. That is optimum in a single very particular sense: in a useful resource regime the place utilizing 1,000 GPUs for 1 hour and 1 GPU for 1,000 hours price you a similar quantity, in case your objective is to maximise efficiency whereas minimizing the fee in GPU-hours to coach a mannequin you need to use the above equation.

**We don’t advocate coaching a LLM for lower than 200B tokens.** Though that is “chinchilla optimum” for a lot of fashions, the ensuing fashions are sometimes fairly poor. For nearly all functions, we advocate figuring out what inference price is appropriate to your usecase and coaching the most important mannequin you possibly can to remain below that inference price for as many tokens as you possibly can.

## Engineering Takeaways for Compute Prices#

Computing prices for transformers are sometimes listed in GPU-hours or FLOP-seconds.

- GPT-NeoX achieves 150 TFLOP/s/A100 with regular consideration and 180 FLOP/s/A100 with Flash Consideration. That is according to different extremely optimized libraries at scale, for instance Megatron-DS studies between 137 and 163 TFLOP/s/A100.
- As a basic rule of thumb, you need to all the time be capable to obtain roughly 120 FLOP/s/A100. In case you are seeing under 115 FLOP/s/A100 there may be most likely one thing incorrect together with your mannequin or {hardware} configuration.
- With high-quality interconnect resembling InfiniBand, you possibly can obtain linear or sublinear scaling throughout the information parallel dimension (i.e. rising the information parallel diploma ought to improve the general throughput practically linearly). Proven under is a plot from testing the GPT-NeoX library on Oak Ridge Nationwide Lab’s Summit supercomputer. Word that V100s are on the x-axis, whereas a lot of the numerical examples within the publish are for A100s.

Transformers are sometimes described when it comes to their *dimension in parameters*. Nevertheless, when figuring out what fashions can match on a given set of computing assets that you must know **how a lot area in bytes** the mannequin will take up. This may inform you how massive a mannequin will match in your native GPU for inference, or how massive a mannequin you possibly can prepare throughout your cluster with a certain quantity of complete accelerator reminiscence.

## Inference#

### Mannequin Weights#

Most transformers are skilled in **blended precision**, both fp16 + fp32 or bf16 + fp32. This cuts down on the quantity of reminiscence required to coach the fashions, and likewise the quantity of reminiscence required to run inference. We will solid language fashions from fp32 to fp16 and even int8 with out struggling a considerable efficiency hit. These numbers seek advice from the dimensions *in bits* a single parameter requires. Since there are 8 bits in a Byte, we divide this quantity by 8 to see what number of Bytes every parameter requires

- In int8, $textual content{reminiscence}_{textual content{mannequin}}=(1 textual content{ byte} /textual content{param})cdot ( textual content{No. params})$
- In fp16 and bf16, $textual content{reminiscence}_{textual content{mannequin}}=(2 textual content{ bytes} /textual content{param})cdot ( textual content{No. params})$
- In fp32, $textual content{reminiscence}_{textual content{mannequin}}=(4 textual content{ bytes} /textual content{param})cdot (textual content{No. params})$

There’s additionally a small quantity of extra overhead, which is often irrelevant to figuring out the most important mannequin that may match in your GPU. In our expertise this overhead is ≤ 20%.

### Complete Inference Reminiscence#

Along with the reminiscence wanted to retailer the mannequin weights, there may be additionally a small quantity of extra overhead throughout the precise ahead go. In our expertise this overhead is ≤ 20% and is often irrelevant to figuring out the most important mannequin that may match in your GPU.

In complete, an excellent heuristic reply for “will this mannequin match for inference” is:

$textual content{Complete Reminiscence}_{textual content{Inference}}approx(1.2) occasions textual content{Mannequin Reminiscence}$

We won’t examine the sources of this overhead on this weblog publish and go away it to different posts or places for now, as an alternative specializing in reminiscence for mannequin coaching in the remainder of this publish. In the event you’re fascinated by studying extra in regards to the calculations required for inference, try this fantastic blog post covering inference in depth. Now, on to coaching!

## Coaching#

Along with the mannequin parameters, coaching requires the storage of optimizer states and gradients in system reminiscence. Because of this asking “how a lot reminiscence do I would like to suit mannequin X” instantly results in the reply “this will depend on coaching or inference.” Coaching all the time requires extra reminiscence than inference, usually very rather more!

### Mannequin Parameters#

First off, fashions will be skilled in pure fp32 or fp16:

- Pure fp32, $textual content{reminiscence}_{textual content{mannequin}}=(4 textual content{ bytes} /textual content{param})cdot (textual content{No. params})$
- Pure fp16, $textual content{reminiscence}_{textual content{mannequin}}=(2 textual content{ bytes} /textual content{param})cdot (textual content{No. params})$

Along with the widespread mannequin weight datatypes mentioned in Inference, coaching introduces **mixed-precision** coaching resembling AMP. This method seeks to maximise the throughput of GPU tensor cores whereas sustaining convergence. The fashionable DL coaching panorama often makes use of mixed-precision coaching as a result of: 1) fp32 coaching is steady, however has a excessive reminiscence overhead and doesn’t exploit NVIDIA GPU tensor cores, and a couple of) fp16 coaching is steady however tough to converge. For extra info on mixed-precision coaching, we advocate studying this notebook by tunib-ai. Word that mixed-precision requires an fp16/bf16 and fp32 model of the mannequin to be saved in reminiscence, requiring:

- Blended-precision (fp16/bf16 and fp32), $textual content{reminiscence}_{textual content{mannequin}}=(2 textual content{ bytes} /textual content{param})cdot (textual content{No. params})$

plus a further dimension $(4text{ bytes/param}) cdot (textual content{#params})$ copy of the mannequin **that we rely inside our optimizer states**.

### Optimizer States#

Adam is magic, however it’s extremely reminiscence inefficient. Along with requiring you to have a duplicate of the mannequin parameters and the gradient parameters, you additionally must hold a further three copies of the gradient parameters. Due to this fact,

- For vanilla AdamW, $textual content{reminiscence}_{textual content{optimizer}}=(12 textual content{ bytes}/textual content{param})cdot (textual content{No. params})$
- fp32 copy of parameters: 4 bytes/param
- Momentum: 4 bytes/param
- Variance: 4 bytes/param

- For 8-bit optimizers like bitsandbytes, $textual content{reminiscence}_{textual content{optimizer}}=(6 textual content{ bytes} /textual content{param})cdot (textual content{No. params})$
- fp32 copy of parameters: 4 bytes/param
- Momentum: 1 byte/param
- Variance: 1 byte/param

- For SGD-like optimizers with momentum, $textual content{reminiscence}_{textual content{optimizer}}=(8 textual content{ bytes} /textual content{param})cdot (textual content{No. params})$
- fp32 copy of parameters: 4 bytes/param
- Momentum: 4 bytes/param

### Gradients#

Gradients will be saved in fp32 or fp16 (Word that the gradient datatype usually matches the mannequin datatype. We see that it subsequently is saved in fp16 for fp16 mixed-precision coaching), so their contribution to reminiscence overhead is given by:

- In fp32, $textual content{reminiscence}_{textual content{gradients}}=(4 textual content{ bytes} /textual content{param})cdot (textual content{No. params})$
- In fp16, $textual content{reminiscence}_{textual content{gradients}}=(2 textual content{ bytes} /textual content{param})cdot (textual content{No. params})$

### Activations and Batch Measurement#

Fashionable GPUs are sometimes bottlenecked by reminiscence, not FLOPs, for LLM coaching. Due to this fact activation recomputation/checkpointing is a particularly common methodology of buying and selling lowered reminiscence prices for additional compute prices. Activation recomputation/checkpointing works by recomputing activations of sure layers as an alternative of storing them in GPU reminiscence. The discount in reminiscence will depend on how selective we’re when deciding which layers to clear, however Megatron’s selective recomputation scheme is depicted within the determine under:

The place the dashed purple line signifies the reminiscence capability of an A100-80GB GPU, and “current work” signifies the reminiscence necessities after making use of selective activation recomputation. See Reducing Activation Recomputation in Large Transformer Models for additional particulars and the derivation of the equations under

The fundamental equation giving the reminiscence required to retailer activations for a transformer mannequin is given by:

$$

start{align*}textual content{reminiscence}^{textual content{No Recomputation}}_{textual content{activations}}=sbhL(10+frac{24}{t}+5frac{a cdot s}{hcdot t}) textual content{ bytes}finish{align*}

$$

$$

start{align*}textual content{reminiscence}^{textual content{Selective Recomputation}}_{textual content{activations}}=sbhL(10+frac{24}{t}) textual content{ bytes}finish{align*}

$$

$$

start{align*}textual content{reminiscence}^{textual content{Full Recomputation}}_{textual content{activations}}=2 cdot sbhL textual content{ bytes}finish{align*}

$$

the place:

- $s$ is the sequence size, in tokens
- $b$ is the batch dimension per GPU
- $h$ is the dimension of the hidden dimension inside every transformer layer
- $L$ is the variety of layers within the transformer mannequin
- $a$ is the variety of consideration heads within the transformer mannequin
- $t$ is the diploma of tensor parallelism getting used (1 if not)
- We assume no sequence parallelism is getting used
- We assume that activations are saved in fp16

The extra recomputation needed additionally will depend on the selectivity of the tactic, however it’s bounded above by a full extra ahead go. Therefore the up to date price of the ahead go is given by:

$$

2PDleq C_{textual content{ahead}}leq4PD

$$

### Complete Coaching Reminiscence#

Due to this fact, an excellent heuristic reply for “will this mannequin match for coaching” is:

$$

start{align*}textual content{Complete Reminiscence}_{textual content{Coaching}} = textual content{Mannequin Reminiscence}+textual content{Optimiser Reminiscence}+textual content{Activation Reminiscence}+textual content{Gradient Reminiscence}finish{align*}

$$

## Distributed Coaching#

### Sharded Optimizers#

The large reminiscence overheads for optimizers is the first motivation for sharded optimizers resembling ZeRO and FSDP. Such sharding methods cut back the optimizer overhead by an element of $textual content{No. GPUs}$, which is why a given mannequin configuration might match at massive scale however OOM at small scales. In the event you’re trying to calculate the reminiscence overhead required by coaching utilizing a sharded optimizer, you’ll need to incorporate the equations from the determine under. For some pattern calculations of sharded optimization, see the next determine from the ZeRO paper (Word that $P_{os}$ $P_{os+g}$ and $P_{os+g+p}$ are generally denoted as ZeRO-1, ZeRO-2, ZeRO-3, respectively. ZeRO-0 generally means “ZeRO disabled”):

Within the language of this weblog publish (assuming mixed-precision and the Adam optimizer):

$$

start{align*}textual content{Complete Reminiscence}_{textual content{Coaching}}approxtext{Mannequin Reminiscence}+frac{textual content{Optimizer reminiscence}}{(textual content{No. GPUs})}+textual content{Activation Reminiscence}+textual content{Gradient Reminiscence}finish{align*}

$$

$$

start{align*}textual content{Complete Reminiscence}_{textual content{Coaching}}approxtext{Mannequin Reminiscence}+textual content{Activation Reminiscence}+frac{textual content{Optimizer Reminiscence}+textual content{Gradient Reminiscence}}{(textual content{No. GPUs})}finish{align*}

$$

$$

start{align*}textual content{Complete Reminiscence}_{textual content{Coaching}}approx textual content{Activation Reminiscence}+frac{textual content{Mannequin Reminiscence}+textual content{Optimizer Reminiscence}+textual content{Gradient Reminiscence}}{(textual content{No. GPUs})} + textual content{(ZeRO-3 Dwell Params)}finish{align*}

$$

The place $(textual content{DP Diploma})$ is simply $(textual content{No. GPUs})$ except pipeline and/or tensor parallelism are utilized. See Sharded Optimizers + 3D Parallelism for particulars.

Word that ZeRO-3 introduces a set of reside parameters. It’s because ZeRO-3 introduces a set of config choices (* stage3_max_live_parameters, stage3_max_reuse_distance, stage3_prefetch_bucket_size, stage3_param_persistence_threshold*) that management what number of parameters are inside GPU reminiscence at a time (bigger values take extra reminiscence however require much less communication). Such parameters can have a major impact on complete GPU reminiscence.

Word that ZeRO can even partition activations over knowledge parallel ranks by way of **ZeRO-R**. This could additionally carry the $textual content{reminiscence}_text{activations}$ above the tensor parallelism diploma $t$. For extra particulars, learn the related ZeRO paper and config options (notice in GPT-NeoX, that is the `partition_activations`

flag). In case you are coaching an enormous mannequin, you want to commerce some reminiscence overhead for added communication price, and activations change into a bottleneck. For instance of utilizing ZeRO-R together with ZeRO-1:

$$

start{align*}textual content{Complete Reminiscence}_{textual content{Coaching}}approxtext{Mannequin Reminiscence}+frac{textual content{Optimizer Reminiscence}}{(textual content{No. GPUs})}+frac{textual content{Activation Reminiscence}}{textual content{(Tensor-Parallel-Measurement)}}+textual content{Gradient Reminiscence}finish{align*}

$$

### 3D Parallelism#

Parallelism for LLMs is available in 3 main kinds:

**Knowledge parallelism:** Break up the information amongst (probably model-parallel) replicas of the mannequin

**Pipeline or Tensor/Mannequin parallelism:** These parallelism schemes break up the parameters of the mannequin throughout GPUs. Such schemes require vital communication overhead, however their reminiscence discount is roughly:

$$

start{align*}textual content{reminiscence}^{textual content{w/ parallelism}}_{textual content{mannequin}}approxfrac{textual content{Mannequin Reminiscence}}{textual content{(Pipe-Parallel-Measurement})timestext{(Tensor-Parallel-Measurement)}}finish{align*}

$$

$$

start{align*}textual content{reminiscence}^{textual content{w/ parallelism}}_{textual content{gradients}}approxfrac{textual content{Gradient Reminiscence}}{textual content{(Pipe-Parallel-Measurement})}finish{align*}

$$

Word that this equation is approximate because of the details that (1) pipeline parallelism doesn’t cut back the reminiscence footprint of activations, (2) pipeline parallelism requires that each one GPUs retailer the activations for all micro-batches in-flight, which turns into vital for giant fashions, and (3) GPUs must briefly retailer the extra communication buffers required by parallelism schemes.

### Sharded Optimizers + 3D Parallelism#

When ZeRO is mixed with tensor and/or pipeline parallelism, the ensuing parallelism technique kinds a mesh like the next:

As an essential apart, the DP diploma is important to be used in calculating the worldwide batch dimension of coaching. The information-parallel diploma will depend on the variety of full mannequin replicas:

$$

start{align*}textual content{DP Diploma = }frac{textual content{No. GPUs}}{textual content{(Pipe-Parallel-Measurement})timestext{(Tensor-Parallel-Measurement)}}finish{align*}

$$

Whereas pipeline parallelism and tensor parallelism are suitable with all phases of ZeRO (e.g. ZeRO-3 with tensor parallelism would result in us first slicing the tensors, then making use of ZeRO-3 inside every tensor-parallel unit), solely ZeRO-1 tends to carry out properly along with tensor and/or pipeline parallelism. That is because of the conflicting parallelism methods for gradients (pipeline parallelism and ZeRO-2 each break up gradients) (tensor parallelism and ZeRO-3 each break up mannequin parameters), which results in a major communication overhead.

Placing all the pieces collectively for a typical 3D-parallel ZeRO-1 run with activation partitioning:

$$

start{align*}textual content{Complete Reminiscence}_{textual content{Coaching}}approxfrac{textual content{Mannequin Reminiscence}}{textual content{(Pipe-Parallel-Measurement})timestext{(Tensor-Parallel-Measurement)}}+frac{textual content{Optimizer Reminiscence}}{(textual content{No. GPUs})}+frac{textual content{Activation Reminiscence}}{textual content{(Tensor-Parallel-Measurement)}}+frac{textual content{Gradient Reminiscence}}{textual content{(Pipe-Parallel-Measurement})}finish{align*}

$$

EleutherAI engineers often use heuristics just like the above to plan environment friendly mannequin coaching and to debug distributed runs. We hope to offer some readability on these often-overlooked implementation particulars, and would love to listen to your suggestions at contact@eleuther.ai if you need to debate or assume we’ve missed something!

To quote this weblog publish, please use:

```
@misc{transformer-math-eleutherai,
title = {Transformer Math 101},
creator = {Anthony, Quentin and Biderman, Stella and Schoelkopf, Hailey},
howpublished = url{weblog.eleuther.ai/},
yr = {2023}
}
```