Now Reading
Gradient Descent into Insanity – Constructing an LLM from scratch

Gradient Descent into Insanity – Constructing an LLM from scratch

2024-02-15 14:01:52

please let me know:

The LLM from scratch tech tree

The LLM from scratch tech tree

Earlier than we are able to transfer onto constructing trendy options like Rotary Positional Encodings, we first want to determine easy methods to differentiate with a pc. The backpropagation algorithm that underpins the complete area of Deep Studying requires the power to distinguish the outputs of neural networks with respect to (wrt) their inputs. On this submit, we’ll go from nothing to an (admittedly very restricted) automated differentiation library that may differentiate arbitrary features of scalar values.

This one algorithm will kind the core of our deep studying library that, finally, will embrace every part we have to practice a language mannequin.

Making a tensor

We will’t do any differentiation if we don’t have any numbers to distinguish. We’ll need to add some additional performance that’s in commonplace float sorts so we’ll must create our personal. Let’s name it a Tensor.

  1. Turn the equation into a graph
  2. Label each edge with the appropriate derivative
  3. Find every path from the output to the input variable you care about
  4. Follow each path and multiply the derivatives you pass through
  5. Add together the results for each path

Now that we have an algorithm in pictures and words, let’s turn it into code.

The Algorithm™

Surprisingly, we have actually already converted our functions into graphs. If you recall, when we generate a tensor from an operation, we record the inputs to the operation in the output tensor (in .args). We also stored the functions to calculate derivatives for each of the inputs in .local_derivatives which means that we know both the destination and derivative for every edge that points to a given node. This means that we’ve already completed steps 1 and 2.

The next challenge is to find all paths from the tensor we want to differentiate to the input tensors that created it. Because none of our operations are self referential (outputs are never fed back in as inputs), and all of our edges have a direction, our graph of operations is a directed acyclic graph or DAG. The property of the graph having no cycles means that we can find all paths to every parameter pretty easily with a Breadth First Search (or Depth First Search but BFS makes some optimisations easier as we’ll see in part 2).

To try it out, let’s recreate that giant graph we made earlier. We can do this by first calculating (L) from the inputs

Wolfram Alpha) the spinoff of (L) wrt (x) is: [frac{partial L}{partial x} = 2m (c + mx – y)] Plugging the values for our tensors in, we get [2times2 (4 + (2times3) – 1) = 36]

Wolfram Alpha, the spinoff of this expression is: [frac{d f(x)}{dx} = -38 + 102 x – 33 x^2 + 8 x^3 + 30 x^4]

If we plug 2 into this equation, the reply is seemingly 578 (once more, because of Wolfram Alpha).

Let’s attempt it with our algorithm

Tricycle which is the identify for the deep studying framework we’re constructing.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top