hackerllama – The Random Transformer
On this weblog submit, we’ll do an end-to-end instance of the mathematics inside a transformer mannequin. The aim is to get a very good understanding of how the mannequin works. To make this manageable, we’ll do a lot of simplification. As we’ll be doing fairly a little bit of the mathematics by hand, we’ll scale back the size of the mannequin. For instance, fairly than utilizing embeddings of 512 values, we’ll use embeddings of 4 values. This can make the mathematics simpler to observe! We’ll use random vectors and matrices, however you should utilize your individual values if you wish to observe alongside.
As you’ll see, the mathematics shouldn’t be that difficult. The complexity comes from the variety of steps and the variety of parameters. I like to recommend you to learn the The Illustrated Transformer weblog earlier than studying this weblog submit (or studying in parallel). It’s an excellent weblog submit that explains the transformer mannequin in a really intuitive (and illustrative!) means and I don’t intend to elucidate what it’s already defined there. My aim is to elucidate the “how” of the transformer mannequin, not the “what”. If you wish to dive even deeper, try the well-known unique paper: Attention is all you need.
Stipulations
A fundamental understanding of linear algebra is required – we’ll principally do easy matrix multiplications, so no should be an skilled. Aside from that, fundamental understanding of Machine Studying and Deep Studying might be helpful.
What is roofed right here?
- An end-to-end instance of the mathematics inside a transformer mannequin throughout inference
- An evidence of consideration mechanisms
- An evidence of residual connections and layer normalization
- Some code to scale it up!
With out additional ado, let’s get began! The unique transformer mannequin has two components: encoder and decoder. Our aim might be to make use of this mannequin as a translation instrument! We’ll first give attention to the encoder half.
Encoder
The entire aim of the encoder is to generate a wealthy embedding illustration of the enter textual content. This embedding will seize semantic details about the enter, and can then be handed to the decoder to generate the output textual content. The encoder consists of a stack of N layers. Earlier than we soar into the layers, we have to see the best way to cross the phrases (or tokens) into the mannequin.
Embeddings are a considerably overused time period. We’ll first create an embedding that would be the enter to the encoder. The encoder additionally outputs an embedding (additionally referred to as hidden states generally). The decoder can even obtain an embedding! 😅 The entire level of an embedding is to symbolize a token as a vector.
1. Embedding the textual content
Let’s say that we need to translate “Hiya World” from English to Spanish. Step one is to show every enter token right into a vector utilizing an embedding algorithm. This can be a realized encoding. Often we use an enormous vector dimension resembling 512, however let’s do 4 for our instance so we are able to maintain the maths manageable.
Hiya -> [1,2,3,4] World -> [2,3,4,5]
This permits us to symbolize our enter as a single matrix
[
E = begin{bmatrix}
1 & 2 & 3 & 4
2 & 3 & 4 & 5
end{bmatrix}
]
Though we might handle the 2 embeddings as separate vectors, it’s simpler to handle them as a single matrix. It is because we’ll be doing matrix multiplications as we transfer ahead!
2 Positional encoding
The embedding above has no details about the place of the phrase within the sentence, so we have to feed some positional data. The way in which we do that is by including a positional encoding to the embedding. There are totally different selections on the best way to receive these – we might use a realized embedding or a hard and fast vector. The unique paper makes use of a hard and fast vector as they see virtually no distinction between the 2 approaches (see part 3.5 of the unique paper). We’ll use a hard and fast vector as nicely. Sine and cosine features have a wave-like sample, and so they repeat over time. Through the use of these features, every place within the sentence will get a singular but constant sample of numbers. These are the features they use within the paper (part 3.5):
[
PE(pos, 2i) = sinleft(frac{pos}{10000^{2i/d_{text{model}}}}right)
]
[
PE(pos, 2i+1) = cosleft(frac{pos}{10000^{2i/d_{text{model}}}}right)
]
The concept is to interpolate between sine and cosine for every worth within the embedding (even indices will use sine, odd indices will use cosine). Let’s calculate them for our instance!
For “Hiya”
- i = 0 (even): PE(0,0) = sin(0 / 10000^(0 / 4)) = sin(0) = 0
- i = 1 (odd): PE(0,1) = cos(0 / 10000^(2*1 / 4)) = cos(0) = 1
- i = 2 (even): PE(0,2) = sin(0 / 10000^(2*2 / 4)) = sin(0) = 0
- i = 3 (odd): PE(0,3) = cos(0 / 10000^(2*3 / 4)) = cos(0) = 1
For “World”
- i = 0 (even): PE(1,0) = sin(1 / 10000^(0 / 4)) = sin(1 / 10000^0) = sin(1) ≈ 0.84
- i = 1 (odd): PE(1,1) = cos(1 / 10000^(2*1 / 4)) = cos(1 / 10000^0.5) ≈ cos(0.01) ≈ 0.99
- i = 2 (even): PE(1,2) = sin(1 / 10000^(2*2 / 4)) = sin(1 / 10000^1) ≈ 0
- i = 3 (odd): PE(1,3) = cos(1 / 10000^(2*3 / 4)) = cos(1 / 10000^1.5) ≈ 1
So concluding
- “Hiya” -> [0, 1, 0, 1]
- “World” -> [0.84, 0.99, 0, 1]
Observe that these encodings have the identical dimension as the unique embedding.
3. Add positional encoding and embedding
We now add the positional encoding to the embedding. That is achieved by including the 2 vectors collectively.
“Hiya” = [1,2,3,4] + [0, 1, 0, 1] = [1, 3, 3, 5] “World” = [2,3,4,5] + [0.84, 0.99, 0, 1] = [2.84, 3.99, 4, 6]
So our new matrix, which would be the enter to the encoder, is:
[
E = begin{bmatrix}
1 & 3 & 3 & 5
2.84 & 3.99 & 4 & 6
end{bmatrix}
]
Should you take a look at the unique paper’s picture, what we simply did is the underside left a part of the picture (the embedding + positional encoding).
4. Self-attention
4.1 Matrices Definition
We’ll now introduce the idea of multi-head consideration. Consideration is a mechanism that enables the mannequin to give attention to sure components of the enter. Multi-head consideration is a method to permit the mannequin to collectively attend to data from totally different illustration subspaces. That is achieved through the use of a number of consideration heads. Every consideration head can have its personal Okay, V, and Q matrices.
Let’s use 2 consideration heads for our instance. We’ll use random values for these matrices. Every matrix might be a 4×3 matrix. With this, every matrix will rework the four-dimensional embeddings into three-dimensional keys, values, and queries. This reduces the dimensionality for consideration mechanism, which helps in managing the computational complexity. Observe that utilizing a too small consideration dimension will damage the efficiency of the mannequin. Let’s use the next values (simply random values):
For the primary head
[
begin{align*}
WK1 &= begin{bmatrix}
1 & 0 & 1
0 & 1 & 0
1 & 0 & 1
0 & 1 & 0
end{bmatrix}, quad
WV1 &= begin{bmatrix}
0 & 1 & 1
1 & 0 & 0
1 & 0 & 1
0 & 1 & 0
end{bmatrix}, quad
WQ1 &= begin{bmatrix}
0 & 0 & 0
1 & 1 & 0
0 & 0 & 1
1 & 1 & 0
end{bmatrix}
end{align*}
]
For the second head
[
begin{align*}
WK2 &= begin{bmatrix}
0 & 1 & 1
1 & 0 & 1
1 & 0 & 1
0 & 1 & 0
end{bmatrix}, quad
WV2 &= begin{bmatrix}
1 & 0 & 0
0 & 1 & 1
0 & 0 & 1
1 & 0 & 0
end{bmatrix}, quad
WQ2 &= begin{bmatrix}
1 & 0 & 1
0 & 1 & 0
1 & 0 & 0
0 & 1 & 1
end{bmatrix}
end{align*}
]
4.2 Keys, queries, and values calculation
We now have to multiply our enter embeddings with the load matrices to acquire the keys, queries, and values.
Key calculation
[
begin{align*}
E times WK1 &= begin{bmatrix}
1 & 3 & 3 & 5
2.84 & 3.99 & 4 & 6
end{bmatrix}
begin{bmatrix}
1 & 0 & 1
0 & 1 & 0
1 & 0 & 1
0 & 1 & 0
end{bmatrix}
&= begin{bmatrix}
(1 times 1) + (3 times 0) + (3 times 1) + (5 times 0) & (1 times 0) + (3 times 1) + (3 times 0) + (5 times 1) & (1 times 1) + (3 times 0) + (3 times 1) + (5 times 0)
(2.84 times 1) + (3.99 times 0) + (4 times 1) + (6 times 0) & (2.84 times 0) + (4 times 1) + (4 times 0) + (6 times 1) & (2.84 times 1) + (4 times 0) + (4 times 1) + (6 times 0)
end{bmatrix}
&= begin{bmatrix}
4 & 8 & 4
6.84 & 9.99 & 6.84
end{bmatrix}
end{align*}
]
Okay, I really don’t need to do the mathematics by hand for all of those – it will get a bit repetitive plus it breaks the location. So let’s cheat and use NumPy to do the calculations for us.
We first outline the matrices
import numpy as np
WK1 = np.array([[1, 0, 1], [0, 1, 0], [1, 0, 1], [0, 1, 0]])
WV1 = np.array([[0, 1, 1], [1, 0, 0], [1, 0, 1], [0, 1, 0]])
WQ1 = np.array([[0, 0, 0], [1, 1, 0], [0, 0, 1], [1, 0, 0]])
WK2 = np.array([[0, 1, 1], [1, 0, 1], [1, 1, 0], [0, 1, 0]])
WV2 = np.array([[1, 0, 0], [0, 1, 1], [0, 0, 1], [1, 0, 0]])
WQ2 = np.array([[1, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1]])
And let’s confirm that I didn’t make any mistakes in the calculations above.
array([[4. , 8. , 4. ],
[6.84, 9.99, 6.84]])
Phew! Let’s now get the values and queries
Value calculations
array([[6. , 6. , 4. ],
[7.99, 8.84, 6.84]])
Query calculations
array([[8. , 3. , 3. ],
[9.99, 3.99, 4. ]])
Let’s skip the second head for now and focus on the first head final score. We’ll come back to the second head later.
4.3 Attention calculation
Calculating the attention score requires a couple of steps:
- Calculate the dot product of the query with each key
- Divide the result by the square root of the dimension of the key vector
- Apply a softmax function to obtain the attention weights
- Multiply each value vector by the attention weights
4.3.1 Dot product of query with each key
The score for “Hello” requires calculating the dot product of q1 with each key vector (k1 and k2)
[
begin{align*}
q1 cdot k1 &= begin{bmatrix} 8 & 3 & 3 end{bmatrix} cdot begin{bmatrix} 4 8 4 end{bmatrix}
&= 8 cdot 4 + 3 cdot 8 + 3 cdot 4
&= 68
end{align*}
]
In matrix world, that would be Q1 multiplied by the transpose of K1
[begin{align*}
Q1 times K1^top &= begin{bmatrix} 8 & 3 & 3 9.99 & 3.99 & 4 end{bmatrix} times begin{bmatrix} 4 & 6.84 8 & 9.99 4 & 6.84 end{bmatrix}
&= begin{bmatrix}
8 cdot 4 + 3 cdot 8 + 3 cdot 4 & 8 cdot 6.84 + 3 cdot 9.99 + 3 cdot 6.84
9.99 cdot 4 + 3.99 cdot 8 + 4 cdot 4 & 9.99 cdot 6.84 + 3.99 cdot 9.99 + 4 cdot 6.84
end{bmatrix}
&= begin{bmatrix}
68 & 105.21
87.88 & 135.5517
end{bmatrix}
end{align*}]
I’m prone to do mistakes, so let’s confirm with Python once again
4.3.2 Divide by square root of dimension of key vector
We then divide the scores by the square root of the dimension (d) of the keys (3 in this case, but 64 in the original paper). Why? For large values of d, the dot product grows too large (we’re adding the multiplication of a bunch of numbers, after all, leading to high values). And large values are bad! We’ll discuss soon more about this.
4.3.3 Apply softmax function
We then softmax to normalize so they are all positive and add up to 1.
Softmax is a function that takes a vector of values and returns a vector of values between 0 and 1, where the sum of the values is 1. It’s a nice way of obtaining probabilities. It’s defined as follows:
[
text{softmax}(x_i) = frac{e^{x_i}}{sum_{j=1}^n e^{x_j}}
]
Don’t be intimidated by the formula – it’s actually quite simple. Let’s say we have the following vector:
[
x = begin{bmatrix} 1 & 2 & 3 end{bmatrix}
]
The softmax of this vector would be:
[
text{softmax}(x) = begin{bmatrix} frac{e^1}{e^1 + e^2 + e^3} & frac{e^2}{e^1 + e^2 + e^3} & frac{e^3}{e^1 + e^2 + e^3} end{bmatrix} = begin{bmatrix} 0.09 & 0.24 & 0.67 end{bmatrix}
]
As you can see, the values are all positive and add up to 1.
4.3.4 Multiply value matrix by attention weights
We then multiply times the value matrix
Let’s combine 4.3.1, 4.3.2, 4.3.3, and 4.3.4 into a single formula using matrices (this is from section 3.2.1 of the original paper):
[
Attention(Q,K,V) = text{softmax}left(frac{QK^top}{sqrt{d}}right)V
]
Yes, that’s it! All the math we just did can easily be encapsulated in the attention formula above! Let’s now translate this to code!
We confirm we got same values as above. Let’s chear and use this to obtain the attention scores the second attention head:
array([[8.84, 3.99, 7.99],
[8.84, 3.99, 7.99]])
If you’re wondering how come the attention is the same for the two embeddings, it’s because the softmax is taking our scores to 0 and 1. See this:
array([[1.10613872e-14, 1.00000000e+00],
[4.95934510e-20, 1.00000000e+00]])
This is due to bad initialization of the matrices and small vector sizes. Large differences in the scores before applying softmax will just be amplified with softmax, leading to one value being close to 1 and others close to 0. In practice, our initial embedding matrices’ values were maybe too high, leading to high values for the keys, values, and queries, which just grew larger as we multiplied them.
Remember when we were dividing by the square root of the dimension of the keys? This is why we do that. If we don’t do that, the values of the dot product will be too large, leading to large values after the softmax. In this case, though, it seems it wasn’t enough given our small values! As a short-term hack, we can scale down the values by a larger amount than the square root of 3. Let’s redefine the attention function but scaling down by 30. This is not a good long-term solution, but it will help us get different values for the attention scores. We’ll get back to a better solution later.
array([[7.54348784, 8.20276657, 6.20276657],
[7.65266185, 8.35857269, 6.35857269]])
4.3.5 Heads’ attention output
The next layer of the encoder will expect a single matrix, not two. The first step will be to concatenate the two heads’ outputs (section 3.2.2 of the original paper)
array([[7.54348784, 8.20276657, 6.20276657, 8.45589591, 3.85610456,
7.72085664],
[7.65266185, 8.35857269, 6.35857269, 8.63740591, 3.91937741,
7.84804146]])
We finally multiply this concatenated matrix by a weight matrix to obtain the final output of the attention layer. This weight matrix is also learned! The dimension of the matrix ensures we go back to the same dimension as the embedding (4 in our case).
# Just some random values
W = np.array(
[
[0.79445237, 0.1081456, 0.27411536, 0.78394531],
[0.29081936, -0.36187258, -0.32312791, -0.48530339],
[-0.36702934, -0.76471963, -0.88058366, -1.73713022],
[-0.02305587, -0.64315981, -0.68306653, -1.25393866],
[0.29077448, -0.04121674, 0.01509932, 0.13149906],
[0.57451867, -0.08895355, 0.02190485, 0.24535932],
]
)
Z = attentions @ W
Z
array([[ 11.46394285, -13.18016471, -11.59340253, -17.04387829],
[ 11.62608573, -13.47454936, -11.87126395, -17.4926367 ]])
The image from The Ilustrated Transformer encapsulates all of this in a single picture
5. Feed-forward layer
5.1 Fundamental feed-forward layer
After the self-attention layer, the encoder has a feed-forward neural community (FFN). This can be a easy community with two linear transformations and a ReLU activation in between. The Illustrated Transformer weblog submit doesn’t dive into it, so let me briefly clarify a bit extra. The aim of the FFN is to course of and transformer the illustration produced by the eye mechanism. The stream is often as follows (see part 3.3 of the unique paper):
- First linear layer: this often expands the dimensionality of the enter. For instance, if the enter dimension is 512, the output dimension is perhaps 2048. That is achieved to permit the mannequin to be taught extra advanced features. In our easy of instance with dimension of 4, we’ll increase to eight.
- ReLU activation: This can be a non-linear activation perform. It’s a easy perform that returns 0 if the enter is damaging, and the enter if it’s constructive. This permits the mannequin to be taught non-linear features. The mathematics is as follows:
[
text{ReLU}(x) = max(0, x)
]
- Second linear layer: That is the alternative of the primary linear layer. It reduces the dimensionality again to the unique dimension. In our instance, we’ll scale back from 8 to 4.
We are able to symbolize all of this as follows
[
text{FFN}(x) = text{ReLU}(xW_1 + b_1)W_2 + b_2
]
Simply as a reminder, the enter for this layer is the Z we calculated within the self-attention above. Listed here are the values as a reminder
[
Z =
begin{bmatrix}
11.46394281 & -13.18016469 & -11.59340253 & -17.04387833
11.62608569 & -13.47454934 & -11.87126395 & -17.49263674
end{bmatrix}
]
Let’s now outline some random values for the load matrices and bias vectors. I’ll do it with code, however you are able to do it by hand when you really feel affected person!
And now let’s write the forward pass function
5.2 Encapsulating everything: The Random Encoder
Let’s now write some code to have the multi-head attention and the feed-forward, all together in the encoder block.
The code optimizes for understanding and educational purposes, not for performance! Don’t judge too hard!
d_embedding = 4
d_key = d_value = d_query = 3
d_feed_forward = 8
n_attention_heads = 2
def attention(x, WQ, WK, WV):
K = x @ WK
V = x @ WV
Q = x @ WQ
scores = Q @ K.T
scores = scores / np.sqrt(d_key)
scores = softmax(scores)
scores = scores @ V
return scores
def multi_head_attention(x, WQs, WKs, WVs):
attentions = np.concatenate(
[attention(x, WQ, WK, WV) for WQ, WK, WV in zip(WQs, WKs, WVs)], axis=1
)
W = np.random.randn(n_attention_heads * d_value, d_embedding)
return attentions @ W
def feed_forward(Z, W1, b1, W2, b2):
return relu(Z.dot(W1) + b1).dot(W2) + b2
def encoder_block(x, WQs, WKs, WVs, W1, b1, W2, b2):
Z = multi_head_attention(x, WQs, WKs, WVs)
Z = feed_forward(Z, W1, b1, W2, b2)
return Z
def random_encoder_block(x):
WQs = [
np.random.randn(d_embedding, d_query) for _ in range(n_attention_heads)
]
WKs = [
np.random.randn(d_embedding, d_key) for _ in range(n_attention_heads)
]
WVs = [
np.random.randn(d_embedding, d_value) for _ in range(n_attention_heads)
]
W1 = np.random.randn(d_embedding, d_feed_forward)
b1 = np.random.randn(d_feed_forward)
W2 = np.random.randn(d_feed_forward, d_embedding)
b2 = np.random.randn(d_embedding)
return encoder_block(x, WQs, WKs, WVs, W1, b1, W2, b2)
Recall that our input is the matrix E which has the positional encoding and the embedding.
array([[1. , 3. , 3. , 5. ],
[2.84, 3.99, 4. , 6. ]])
Let’s now pass this to our random_encoder_block
function
array([[ -71.76537515, -131.43316885, 13.2938131 , -4.26831998],
[ -72.04253781, -131.84091347, 13.3385937 , -4.32872015]])
Nice! This was just one encoder block. The original paper uses 6 encoders. The output of one encoder goes to the next, and so on:
/tmp/ipykernel_11906/1045810361.py:2: RuntimeWarning: overflow encountered in exp
return np.exp(x)/np.sum(np.exp(x),axis=1, keepdims=True)
/tmp/ipykernel_11906/1045810361.py:2: RuntimeWarning: invalid value encountered in divide
return np.exp(x)/np.sum(np.exp(x),axis=1, keepdims=True)
array([[nan, nan, nan, nan],
[nan, nan, nan, nan]])
5.3 Residual and Layer Normalization
Uh oh! We’re getting NaNs! It seems our values are too high, and when being passed to the next encoder, they end up being too high and exploding! This is called gradient explosion. Without any kind of normalization, small changes in the input of early layers end up being amplified in later layers. This is a common problem in deep neural networks. There are two common techniques to mitigate this problem: residual connections and layer normalization (section 3.1 of the paper, barely mentioned).
- Residual connections: Residual connections are simply adding the input of the layer to it output. For example, we add the initial embedding to the output of the attention. Residual connections mitigate the vanishing gradient problem. The intuition is that if the gradient is too small, we can just add the input to the output and the gradient will be larger. The math is very simple:
[
text{Residual}(x) = x + text{Layer}(x)
]
That’s it! We’ll do this to the output of the attention and the output of the feed-forward layer.
- Layer normalization Layer normalization is a technique to normalize the inputs of a layer. It normalizes across the embedding dimension. The intuition is that we want to normalize the inputs of a layer so that they have a mean of 0 and a standard deviation of 1. This helps with the gradient flow. The math does not look so simple at a first glance.
[
text{LayerNorm}(x) = frac{x – mu}{sqrt{sigma^2 + epsilon}} times gamma + beta
]
Let’s explain each parameter:
- (mu) is the mean of the embedding
- (sigma) is the standard deviation of the embedding
- (epsilon) is a small number to avoid division by zero. In case the standard deviation is 0, this small epsilon saves the day!
- (gamma) and (beta) are learned parameters that control scaling and shifting steps.
Unlike batch normalization (no worries if you don’t know what it is), layer normalization normalizes across the embedding dimension – that means that each embedding will not be affected by other samples in the batch. The intuition is that we want to normalize the inputs of a layer so that they have a mean of 0 and a standard deviation of 1.
Why do we add the learnable parameters (gamma) and (beta)? The reason is that we don’t want to lose the representational power of the layer. If we just normalize the inputs, we might lose some information. By adding the learnable parameters, we can learn to scale and shift the normalized values.
Combining the equations, the equation for the whole encoder could look like this
[
text{Z}(x) = text{LayerNorm}(x + text{Attention}(x))
]
[
text{FFN}(x) = text{ReLU}(xW_1 + b_1)W_2 + b_2
]
[
text{Encoder}(x) = text{LayerNorm}(Z(x) + text{FFN}(Z(x) + x))
]
Let’s try with our example! Let’s go with E and Z values from before
[
begin{align*}
text{E} + text{Attention(E)} &= begin{bmatrix}
1.0 & 3.0 & 3.0 & 5.0
2.84 & 3.99 & 4.0 & 6.0
end{bmatrix} + begin{bmatrix}
11.46394281 & -13.18016469 & -11.59340253 & -17.04387833
11.62608569 & -13.47454934 & -11.87126395 & -17.49263674
end{bmatrix}
&= begin{bmatrix}
12.46394281 & -10.18016469 & -8.59340253 & -12.04387833
14.46608569 & -9.48454934 & -7.87126395 & -11.49263674
end{bmatrix}
end{align*}
]
Let’s now calculate the layer normalization, we can divide it into three steps:
- Compute mean and variance for each embedding.
- Normalize by substracting the mean of its row and dividing by the square root of its row variance (plus a small number to avoid division by zero).
- Scale and shift by multiplying by gamma and adding beta.
5.3.1 Mean and variance
For the first embedding
[
begin{align*}
mu_1 &= frac{12.46394281-10.18016469-8.59340253-12.04387833}{4} = -4.58837568
sigma^2 &= frac{sum (x_i – mu)^2}{N}
&= frac{(12.46394281 – (-4.588375685))^2 + ldots + (-12.04387833 – (-4.588375685))^2}{4}
&= frac{393.67443005013}{4}
&= 98.418607512533
sigma &= sqrt{98.418607512533}
&= 9.9206152789297
end{align*}
]
We can do the same for the second embedding. We’ll skip the calculations but you get the hang of it.
[
begin{align*}
mu_2 &= -3.59559109
sigma_2 &= 10.50653018
end{align*}
]
Let’s confirm with Python
Amazing! Let’s now normalize
5.3.2 Normalize
For normalization, for each value in the embedding, we subsctract the mean and divide by the standard deviation. Epsilon is a very small value, such as 0.00001. We’ll assume (gamma=1) and (beta=0), it simplifies things.
[begin{align*}
text{normalized}_1 &= frac{12.46394281 – (-4.58837568)}{sqrt{98.418607512533 + epsilon}}
&= frac{17.05231849}{9.9206152789297}
&= 1.718
text{normalized}_2 &= frac{-10.18016469 – (-4.58837568)}{sqrt{98.418607512533 + epsilon}}
&= frac{-5.59178901}{9.9206152789297}
&= -0.564
text{normalized}_3 &= frac{-8.59340253 – (-4.58837568)}{sqrt{98.418607512533 + epsilon}}
&= frac{-4.00502685}{9.9206152789297}
&= -0.404
text{normalized}_4 &= frac{-12.04387833 – (-4.58837568)}{sqrt{98.418607512533 + epsilon}}
&= frac{-7.45550265}{9.9206152789297}
&= -0.752
end{align*}]
We’ll skip the calculations by hand for the second embedding. Let’s confirm with code! Let’s re-define our encoder_block
function with this change
def layer_norm(x, epsilon=1e-6):
mean = x.mean(axis=-1, keepdims=True)
std = x.std(axis=-1, keepdims=True)
return (x - mean) / (std + epsilon)
def encoder_block(x, WQs, WKs, WVs, W1, b1, W2, b2):
Z = multi_head_attention(x, WQs, WKs, WVs)
Z = layer_norm(Z + x)
output = feed_forward(Z, W1, b1, W2, b2)
return layer_norm(output + Z)
array([[ 1.71887693, -0.56365339, -0.40370747, -0.75151608],
[ 1.71909039, -0.56050453, -0.40695381, -0.75163205]])
It works! Let’s retry to pass the embedding through the six encoders.
array([[-0.335849 , -1.44504571, 1.21698183, 0.56391289],
[-0.33583947, -1.44504861, 1.21698606, 0.56390202]])
Amazing! These values make sense and we don’t get NaNs! The idea of the stack of encoders is that they output a continuous representation, z, that captures the meaning of the input sequence. This representation is then passed to the decoder, which will genrate an output sequence of symbols, one element at a time.
Before diving into the decoder, here’s an image from Jay’s amazing blog post:
You should be able to explain each component at the left side! Quite impressive, right? Let’s now move to the decoder.
Decoder
Most of the thing we learned for encoders will be used in the decoder as well! The decoder has two self-attention layers, one for the encoder and one for the decoder. The decoder also has a feed-forward layer. Let’s go through each of these.
The decoder block receives two inputs: the output of the encoder and the generated output sequence. The output of the encoder is the representation of the input sequence. During inference, the generated output sequence starts with a special start-of-sequence token (SOS). During training, the target output sequence is the actual output sequence, shifted by one position. This will be clearer soon!
Given the embedding generated by the encoder and the SOS token, the decoder will then generate the next token of the sequence, e.g. “hola”. The decoder is autoregressive, that means that the decoder will take the previously generated tokens and again generate the second token.
- Iteration 1: Input is SOS, output is “hola”
- Iteration 2: Input is SOS + “hola”, output is “mundo”
- Iteration 3: Input is SOS + “hola” + “mundo”, output is EOS
Here, SOS is the start-of-sequence token and EOS is the end-of-sequence token. The decoder will stop when it generates the EOS token. It generates one token at a time. Note that all iterations use the embedding generated by the encoder.
This autoregressive design makes decoder slow. The encoder is able to generate its embedding in a single forward pass while the decoder needs to do many forward passes. This is one of the reasons why architectures that only use the encoder (such as BERT or sentence similarity models) are much faster than decoder-only architectures (such as GPT-2 or BART).
Let’s dive into each step! Just as the encoder, the decoder is composed of a stack of decoder blocks. The decoder block is a bit more complex than the encoder block. The general structure is:
- (Masked) Self-attention layer
- Residual connection and layer normalization
- Encoder-decoder attention layer
- Residual connection and layer normalization
- Feed-forward layer
- Residual connection and layer normalization
We’re already familiar with all the math from 1, 2, 3, 5 and 6. See the right side of the image below, you’ll see that all these blocks you already know (the right part):
1. Embedding the textual content
The primary textual content of the decoder is to embed the enter tokens. The enter token is SOS
, so we’ll embed it. We’ll use the identical embedding dimension because the encoder. Let’s assume the embedding vector is the next:
[
E = begin{bmatrix}
1 & 0 & 0 & 0
end{bmatrix}
]
2. Positional encoding
We’ll now add the positional encoding to the embedding, simply as we did for the encoder. Given it’s the identical place as “Hiya”, we’ll have identical positional encoding as we did earlier than:
- i = 0 (even): PE(0,0) = sin(0 / 10000^(0 / 4)) = sin(0) = 0
- i = 1 (odd): PE(0,1) = cos(0 / 10000^(2*1 / 4)) = cos(0) = 1
- i = 2 (even): PE(0,2) = sin(0 / 10000^(2*2 / 4)) = sin(0) = 0
- i = 3 (odd): PE(0,3) = cos(0 / 10000^(2*3 / 4)) = cos(0) = 1
3. Add positional encoding and embedding
Including the positional encoding to the embedding is completed by including the 2 vectors collectively:
[
E = begin{bmatrix}
1 & 1 & 0 & 1
end{bmatrix}
]
4. Self-attention
Step one inside the decoder block is the self-attention mechanism. Fortunately, we have now some code for this and may simply use it!
d_embedding = 4
n_attention_heads = 2
E = np.array([[1, 1, 0, 1]])
WQs = [np.random.randn(d_embedding, d_query) for _ in range(n_attention_heads)]
WKs = [np.random.randn(d_embedding, d_key) for _ in range(n_attention_heads)]
WVs = [np.random.randn(d_embedding, d_value) for _ in range(n_attention_heads)]
Z_self_attention = multi_head_attention(E, WQs, WKs, WVs)
Z_self_attention
array([[ 2.19334924, 10.61851198, -4.50089666, -2.76366551]])
Things are quite simple for inference. For training, things are a bit tricky. During training, we use unlabeled data: just a bunch of text data, frequentyl scraped from the web. While the encoder’s goal is to capture all information of the input, the decoder’s goal is to predict the most likely next token. This means that the decoder can only use the tokens that have been generated so far (it cannot cheat and see the next tokens).
Because of this, we use masked self-attention: we mask the tokens that have not been generated yet. This is done by setting the attention scores to -inf. This is done in the original paper (section 3.2.3.1). We’ll skip this for now, but it’s important to keep in mind that the decoder is a bit more complex during training.
5. Residual connection and layer normalization
Nothing magical here, we just add the input to the output of the self-attention and apply layer normalization. We’ll use the same code as before.
6. Encoder-decoder attention
This part is the new one! If you were wondering where do the encoder-generated embeddings come in, this is their moment to shine!
Let’s assume the output of the encoder is the following matrix
[
begin{bmatrix}
-1.5 & 1.0 & -0.8 & 1.5
1.0 & -1.0 & -0.5 & 1.0
end{bmatrix}
]
In the self-attention mechanism, we calculate the queries, keys, and values from the input embedding.
In the encoder-decoder attention, we calculate the queries from the previous decoder layer and the keys and values from the encoder output! All the math is the same as before; the only difference is what embedding to use for the queries. Let’s look at some code
def encoder_decoder_attention(encoder_output, attention_input, WQ, WK, WV):
# The next three lines are the key difference!
K = encoder_output @ WK # Note that now we pass the previous encoder output!
V = encoder_output @ WV # Note that now we pass the previous encoder output!
Q = attention_input @ WQ # Same as self-attention
# This stays the same
scores = Q @ K.T
scores = scores / np.sqrt(d_key)
scores = softmax(scores)
scores = scores @ V
return scores
def multi_head_encoder_decoder_attention(
encoder_output, attention_input, WQs, WKs, WVs
):
# Note that now we pass the previous encoder output!
attentions = np.concatenate(
[
encoder_decoder_attention(
encoder_output, attention_input, WQ, WK, WV
)
for WQ, WK, WV in zip(WQs, WKs, WVs)
],
axis=1,
)
W = np.random.randn(n_attention_heads * d_value, d_embedding)
return attentions @ W
WQs = [np.random.randn(d_embedding, d_query) for _ in range(n_attention_heads)]
WKs = [np.random.randn(d_embedding, d_key) for _ in range(n_attention_heads)]
WVs = [np.random.randn(d_embedding, d_value) for _ in range(n_attention_heads)]
encoder_output = np.array([[-1.5, 1.0, -0.8, 1.5], [1.0, -1.0, -0.5, 1.0]])
Z_encoder_decoder = multi_head_encoder_decoder_attention(
encoder_output, Z_self_attention, WQs, WKs, WVs
)
Z_encoder_decoder
array([[ 1.57651431, 4.92489307, -0.08644448, -0.46776051]])
This worked! You might be asking “why do we do this?”. The reason is that we want the decoder to focus on the relevant parts of the input text (e.g., “hello world”). The encoder-decoder attention allows each position in the decoder to attend over all positions in the input sequence. This is very helpful for tasks such as translation, where the decoder needs to focus on the relevant parts of the input sequence. The decoder will learn to focus on the relevant parts of the input sequence by learning to generate the correct output tokens. This is a very powerful mechanism!
7. Residual connection and layer normalization
Same as before!
8. Feed-forward layer
Once again, same as before! I’ll also do the residual connection and layer normalization after it.
9. Encapsulating everything: The Random Decoder
Let’s write the code for a single decoder block. The main change is that we now have an additional attention mechanism.
d_embedding = 4
d_key = d_value = d_query = 3
d_feed_forward = 8
n_attention_heads = 2
encoder_output = np.array([[-1.5, 1.0, -0.8, 1.5], [1.0, -1.0, -0.5, 1.0]])
def decoder_block(
x,
encoder_output,
WQs_self_attention, WKs_self_attention, WVs_self_attention,
WQs_ed_attention, WKs_ed_attention, WVs_ed_attention,
W1, b1, W2, b2,
):
# Same as before
Z = multi_head_attention(
x, WQs_self_attention, WKs_self_attention, WVs_self_attention
)
Z = layer_norm(Z + x)
# The next three lines are the key difference!
Z_encoder_decoder = multi_head_encoder_decoder_attention(
encoder_output, Z, WQs_ed_attention, WKs_ed_attention, WVs_ed_attention
)
Z_encoder_decoder = layer_norm(Z_encoder_decoder + Z)
# Same as before
output = feed_forward(Z_encoder_decoder, W1, b1, W2, b2)
return layer_norm(output + Z_encoder_decoder)
def random_decoder_block(x, encoder_output):
# Just a bunch of random initializations
WQs_self_attention = [
np.random.randn(d_embedding, d_query) for _ in range(n_attention_heads)
]
WKs_self_attention = [
np.random.randn(d_embedding, d_key) for _ in range(n_attention_heads)
]
WVs_self_attention = [
np.random.randn(d_embedding, d_value) for _ in range(n_attention_heads)
]
WQs_ed_attention = [
np.random.randn(d_embedding, d_query) for _ in range(n_attention_heads)
]
WKs_ed_attention = [
np.random.randn(d_embedding, d_key) for _ in range(n_attention_heads)
]
WVs_ed_attention = [
np.random.randn(d_embedding, d_value) for _ in range(n_attention_heads)
]
W1 = np.random.randn(d_embedding, d_feed_forward)
b1 = np.random.randn(d_feed_forward)
W2 = np.random.randn(d_feed_forward, d_embedding)
b2 = np.random.randn(d_embedding)
return decoder_block(
x, encoder_output,
WQs_self_attention, WKs_self_attention, WVs_self_attention,
WQs_ed_attention, WKs_ed_attention, WVs_ed_attention,
W1, b1, W2, b2,
)
Generating the output sequence
We have all the building blocks! Let’s now generate the output sequence.
- We have the encoder, which takes the input sequence and generates its rich representation. It’s composed of a stack of encoder blocks.
- We have the decoder, which takes the encoder output and generated tokens, and generates the output sequence. It’s composed of a stack of decoder blocks.
How do we go from the decoder’s output to a word? We need to add a final linear layer and a softmax layer on top of the decoder. The whole algorithm looks like this:
- The encoder receives the input sequence and generates a representation of it.
- The decoder begins with the SOS token and the encoder output. It generates the next token of the output sequence.
- We then apply a linear layer to generate the logits.
- We then apply a softmax layer to generate the probabilities.
- The decoder uses the encoder output and the previously generated token to generate the next token of the output sequence.
- We repeat steps 2-5 until we generate the EOS token.
This is mentioned in the section 3.4 of the paper.
1. Linear layer
The linear layer is a simple linear transformation. It takes the decoder’s output and transforms it into a vector of size vocab_size
. This is the size of the vocabulary. For example, if we have a vocabulary of 10000 words, the linear layer will transform the decoder’s output into a vector of size 10000. This vector will contain the probability of each word being the next word in the sequence. For simplicity, let’s go with a vocabulary of 10 words and assume the first decoder output is a very simple vector: [1, 0, 1, 0]. We’ll use random weights and biases matrices of the size vocab_size
x decoder_output_size
.
2. Softmax
These are called logits but they are not easily interpretable. We need to apply a softmax function to obtain the probabilities.
array([[0.01602618, 0.06261303, 0.38162024, 0.03087794, 0.0102383 ,
0.00446011, 0.01777314, 0.00068275, 0.46780959, 0.00789871]])
This is giving us probabilities! Let’a assume the vocabulary is the following:
[
text{vocab} = begin{bmatrix}
text{hello} & text{mundo} & text{world} & text{how} & text{?} & text{EOS} & text{SOS} & text{a} & text{hola} & text{c}
end{bmatrix}
]
The above tells us that the probabilities are
- hello: 0.01602618
- mundo: 0.06261303
- world: 0.38162024
- how: 0.03087794
- ?: 0.0102383
- EOS: 0.00446011
- SOS: 0.01777314
- a: 0.00068275
- hola: 0.46780959
- c: 0.00789871
From these, the most likely next token is “hola”. Picking always the most likely token is called greedy decoding. This is not always the best approach, as it might lead to suboptimal results, but we won’t dive into generation techniques at the moment. If you want to learn more about it, check out this amazing blog post.
3. The Random Encoder-Decoder Transformer
Let’s write the entire code for this! Let’s outline a dictionary that maps the phrases to their preliminary embeddings. Observe that that is additionally realized throughout coaching, however we’ll use random values for now.
vocabulary = [
"hello",
"mundo",
"world",
"how",
"?",
"EOS",
"SOS",
"a",
"hola",
"c",
]
embedding_reps = np.random.randn(10, 1, 4)
vocabulary_embeddings = {
word: embedding_reps[i] for i, word in enumerate(vocabulary)
}
vocabulary_embeddings
{'hello': array([[-1.19489531, -1.08007463, 1.41277762, 0.72054139]]),
'mundo': array([[-0.70265064, -0.58361306, -1.7710761 , 0.87478862]]),
'world': array([[ 0.52480342, 2.03519246, -0.45100608, -1.92472193]]),
'how': array([[-1.14693176, -1.55761929, 1.09607545, -0.21673596]]),
'?': array([[-0.23689522, -1.12496841, -0.03733462, -0.23477603]]),
'EOS': array([[ 0.5180958 , -0.39844119, 0.30004136, 0.03881324]]),
'SOS': array([[ 2.00439161, 2.19477149, -0.84901634, -0.89269937]]),
'a': array([[ 1.63558337, -1.2556952 , 1.65365362, 0.87639945]]),
'hola': array([[-0.5805717 , -0.93861149, 1.06847734, -0.34408367]]),
'c': array([[-2.79741142, 0.70521986, -0.44929098, -1.66167776]])}
And now let’s write our random generate
method that generates tokens autorergressively.
def generate(input_sequence, max_iters=10):
# We first encode the inputs into embeddings
# This skips the positional encoding step for simplicity
embedded_inputs = [
vocabulary_embeddings[token][0] for token in input_sequence
]
print("Embedding representation (encoder input)", embedded_inputs)
# We then generate an embedding representation
encoder_output = encoder(embedded_inputs)
print("Embedding generated by encoder (encoder output)", encoder_output)
# We initialize the decoder output with the embedding of the start token
sequence = vocabulary_embeddings["SOS"]
output = "SOS"
# Random matrices for the linear layer
W_linear = np.random.randn(d_embedding, len(vocabulary))
b_linear = np.random.randn(len(vocabulary))
# We limit number of decoding steps to avoid too long sequences without EOS
for i in range(max_iters):
# Decoder step
decoder_output = decoder(sequence, encoder_output)
logits = linear(decoder_output, W_linear, b_linear)
probs = softmax(logits)
# We get the most likely next token
next_token = vocabulary[np.argmax(probs)]
sequence = vocabulary_embeddings[next_token]
output += " " + next_token
print(
"Iteration", i,
"next token", next_token,
"with probability of", np.max(probs),
)
# If the next token is the end token, we return the sequence
if next_token == "EOS":
return output
return output
Let’s run this now!
Embedding representation (encoder input) [array([-1.19489531, -1.08007463, 1.41277762, 0.72054139]), array([ 0.52480342, 2.03519246, -0.45100608, -1.92472193])]
Embedding generated by encoder (encoder output) [[-0.15606365 0.90444064 0.82531037 -1.57368737]
[-0.15606217 0.90443936 0.82531082 -1.57368802]]
Iteration 0 next token how with probability of 0.6265258176587956
Iteration 1 next token a with probability of 0.42708031743571
Iteration 2 next token c with probability of 0.44288777368698484
Ok, so we got the tokens “how”, “a”, and “c”. This is not a good translation, but it’s expected! We only used random weights!
I suggest you to look again in detail at the whole encoder-decoder architecture from the original paper:
Conclusions
I hope that was fun and informational! We covered a lot of ground. Wait…was that it? And the answer is, mostly, yes! New transformer architectures add lots of tricks, but the core of the transformer is what we just covered. Depending on what task you want to solve, you can also only the encoder or the decoder. For example, for understanding-heavy tasks such as classification, you can use the encoder stack with a linear layer on top. For generation-heavy tasks such as translation, you can use the encoder and decoder stacks. And finally, for free generation, as in ChatGPT or Mistral, you can use only the decoder stack.
Of course, we also did lots of simplifications. Let’s briefly check which were the numbers in the original transformer paper:
- Embedding dimension: 512 (4 in our example)
- Number of encoders: 6 (6 in our example)
- Number of decoders: 6 (6 in our example)
- Feed-forward dimension: 2048 (8 in our example)
- Number of attention heads: 8 (2 in our example)
- Attention dimension: 64 (3 in our example)
We just covered lots of topics, but it’s quite interesting we can achieve impressive results by scaling up this math and doing smart training. We didn’t cover training in this blog post as the goal was to understand the math when using an existing model, but I hope this provided strong foundations for jumping into the training part. I hope you enjoyed this blog post!
Exercises
Here are some exercises to practice your understanding of the transformer.
- What is the purpose of the positional encoding?
- How does self-attention and encoder-decoder attention differ?
- What would happen if our attention dimension was too small? What about if it was too large?
- Briefly describe the structure of a feed-forward layer.
- Why is the decoder slower than the encoder?
- What is the purpose of the residual connections and layer normalization?
- How do we go from the decoder output to probabilities?
- Why is picking the most likely next token every single time problematic?