Now Reading
Notes on coaching BERT from scratch on an 8GB client GPU

Notes on coaching BERT from scratch on an 8GB client GPU

2023-06-01 16:15:15

I educated a BERT mannequin (Devlin et al, 2019) from scratch on my desktop PC (which has a Nvidia 3060 Ti 8GB GPU). The mannequin structure, tokenizer, and coach all got here from Hugging Face libraries, and my contribution was primarily organising the code, organising the data (~20GB uncompressed textual content), and leaving my laptop operating. (And ensuring it was working appropriately, with good GPU utilization.)

  • The code is obtainable as a Jupyter pocket book, here.
  • The information is obtainable as a Hugging Face dataset, here.

The coaching of enormous language fashions is mostly related to GPU or TPU clusters, moderately than desktop PCs, and the next plot illustrates the distinction between the compute assets I used to coach this mannequin, and the assets used to coach the unique BERT-base mannequin.

Plot comparing compute resources and model performance on GLUE-dev.

Though each BERT-base and this mannequin had been educated for a similar period of time, BERT-base noticed ~30x extra tokens of textual content, (BERT-base noticed ~40 epochs of its coaching information, whereas this mannequin noticed only a single epoch of its coaching information).

The GLUE dev-set rating is proven within the plot above, to offer an thought of how nicely the mannequin performs at pure language duties.
Superb-tuning on GLUE took ~12 hours in whole (on prime of the 4 days / ~100 hours of pretraining).
The next desk reveals the GLUE-dev ends in extra element:

Mannequin MNLI (m/mm) SST-2 STSB RTE QNLI QQP MRPC CoLA Common
This mannequin 79.3/80.1 89.1 61.9 55.9 86.3 86.4 74.8 41.0 72.7
BERT-Base* 83.2/83.4 91.9 86.7 59.2 90.6 87.7 89.3 56.5 80.9

*BERT-Base refers to a totally educated BERT mannequin, the outcomes are taken from Cramming (Geiping et al, 2022).

Whereas we are able to see that BERT-Base carried out higher at each activity; the outcomes for “this mannequin” would have been excellent (presumably SOTA for just a few duties) in early 2018.

No hyperparameter tuning was carried out.
No particular methods had been used to enhance the coaching.
Optimizer and studying fee schedule had been guided by Cramming (Geiping et al, 2022),
however the mannequin structure adjustments and different strategies in Cramming weren’t used.
I did a few smaller coaching runs first (~1-12 hours).

I used to be in a position to monitor coaching remotely, utilizing Weights & Biases.

This endeavor was impressed by Cramming (Geiping et al, 2022),
a paper on find out how to practice well-performing BERT fashions, on modest compute assets (in solely 24 hours).

Plots from the 100 hours coaching run

The pre-training loss.
The pre-training loss.

See Also

The learning rate schedule, recommended by Cramming ([Geiping et al, 2022](https://arxiv.org/abs/2212.14034)).
The educational fee schedule, really helpful by Cramming (Geiping et al, 2022).

GPU utilization was around 98%.
GPU utilization was round 98%.

GPU memory usage was around 98%, this was achieved by adjusting the batch size.
GPU reminiscence utilization was round 98%, this was achieved by adjusting the batch dimension.

GPU temperature stayed between 76 - 80 degrees celsius, with a higher temperature on hotter days.
GPU temperature stayed between 76 – 80 levels celsius, with a better temperature on hotter days.

References:

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top