Constructing a deep studying rig | part-1

03 february, 2024
I simply received my fingers on a mining rig with 3 rtx 3090 founder version for the modest sum of 1.7k euros.
My plan is to remodel it right into a deep studying ring, to finetune and serve LLM, play with torch distributed with some MoE in addition to doing a little bit of impartial analysis.
Detailed:
- rtx 3090 (x3)
- Ryzen 5 1600
- b450 metal legend
- RAM 4gig (lol)
- Cooler grasp 750w silver (x2)
Little bit of an insane deal, if it was just for the playing cards it will have value 560 euro (=1700/3) per playing cards. In 2024 the worth for a second hand 3090 is round 600 / 700. Plus I received the entire different spare half, that I’d want to switch. Enjoyable truth two years in the past this rig most likely would have value like 7k even with spare half.
Deep studying and inter card bandwidth.
The present rig is a mining rig. The three 3090s are related to the mobo by way of pci 1X extender. That is completely inefficient for deep studying.
PCI traces are doing the bridge between the the completely different a part of the pc in order that they’ll talk, a traditional gamer laptop has normally 24x traces and the gpu is normally utilizing 16x traces. Utilizing 1x imply dividing the traditional bandwidth by 16x. It apparently doesn’t matter for crypto mining. In all probability as a result of is that in crypto gpu are used to compute the “proof of labor” which is principally some brut power algorithm and the bandwidth doesn’t matter, every part keep throughout the card. However deep studying mannequin take knowledge in (by batch), there may be numerous communication wanted between the CPU that pre-process the information and the GPU, thus PCI traces matter.
By trying within the the bible of client grade deep studying, we will see that PCI traces x4 “needs to be sufficient”.
Working GPUs on 4x lanes is okay, particularly when you solely have 2 GPUs. For a 4 GPU setup, I would favor 8x lanes per GPU, however operating them at 4x lanes will most likely solely lower efficiency by round 5-10% when you parallelize throughout all 4 GPUs.
The CPU on the rig assist as much as 24 PCI traces and the mobo assist bifurcation, aka you possibly can cut up the principle x16 traces into 4×4 pci ones. That means I may plug my for playing cards on my important mobo utilizing a PCI riser like this one that I discovered beneficial by this glorious deep studying rip blog post.
It might imply that I solely want so as to add 150 euros extra (really 200 with the transport value) and received my deep studying rig prepared. Would have been the most cost effective deep studying rig of historical past.
The choice is to go together with a CPU which has many extra cpu traces (the ryzen 5 1600 has solely 16) in addition to a mobo with no less than 3 gpu slots. Downside is even the excessive finish ryzen 9 or intel i9 have solely 24 cpu traces … So I must go together with a AMD Epyc or Threadripper which aren’t low-cost.
In a perfect phrase the primary possibility would work out.
Is x4 traces actually okay for deep studying with LLM ?
This would possibly rely on the kind of GPU parallelism I wish to use.
DDP
Few years in the past the GPU parallelism was primarily about DDP: distributed knowledge parallelism. The mannequin is replicated on every gpu system, the information is cut up per gpu, every GPU do a traditional ahead backward move on its knowledge, compute the gradient. Then Every GPU shared their gradient by way of an All reduce communication (utilizing nccl) and every GPU replace its inside weights.
FSDP
Giant language mannequin are, as they identify counsel, bigger that their non generative counter components. GPT3 is 175B parameters, some mannequin even go as much as the trillion scale, although normally utilizing some sparse setup (Combination of Skilled) so not likely related for our calculation.
These days good and enormous LLM like llama2 is round 70b.
It signifies that even in int 8 precision the mannequin weight are nonetheless 70 GB. The 3090 solely have 24gb, so one mannequin doesn’t even match, not even speaking about coaching.
On this case we have to cut up the mannequin in chunk. They’re a number of means to do that:
- Pipeline parallelism: The mannequin is cut up in chunks, every GPU maintain a part of the layers. Communication between GPU occurred throughout ahead and backward every time that it must go to the subsequent chunk. To illustrate that we cut up the mannequin on 4 gpus.
Throughout ahead it is advisable use the send operation 3 instances as a result of you’ve got 4 chunks. Every ship is sending an enter activation.
- Tensor parallelism: Tensor parallelism cut up the load of every layer on every gpu. If Pipeline parallelism is splitting the mannequin horizontally tensor parallelism is splitting it vertically. The communication scheme is barely extra complicated, I do not absolutely get, however principally at every layer it is advisable do a mixture of All-gather and All-reduce operation.
At massive scale this two technique are used alongside knowledge parallelism, that is named 3d parallelism. Checkout this blog post for extra inf.
In my use case solely certainly one of this two technique will likely be used alongside DDP.
The conclusion is that such parallelism want extra inter gpu communication that pure DDP. So whereas PCI traces x4 per gpu is perhaps high quality for pure DDP, it is perhaps an enormous bottleneck for finetune 70b mannequin, even worse for native inference that’s reminiscence bounded.
Moreover I wish to play with Combination Of Skilled, that are sparse mannequin, a.okay.a not all weight are used throughout every ahead. Every “professional” is host on a distinct GPU and a router dispatch every token into the professional. This after all means much more communication that standard DDP.
Conclusion
So I’m a bit puzzled, utilizing the 4x pcie Lanes ought to work, however I will likely be restricted for something that isn’t DDP like finetune LLM or MoE.
I’ll examine the threadripper course, whether it is low-cost sufficient it’s most likely one of the best resolution, particularly if I plan so as to add a 4 gpus later.