Now Reading
A hands-on information to coach LLaMA with RLHF

A hands-on information to coach LLaMA with RLHF

2023-04-06 14:06:20

Fashions resembling ChatGPT, GPT-4, and Claude are highly effective language fashions which were fine-tuned utilizing a technique referred to as Reinforcement Studying from Human Suggestions (RLHF) to be higher aligned with how we count on them to behave and wish to use them.

On this weblog submit, we present all of the steps concerned in coaching a LlaMa model to reply questions on Stack Exchange with RLHF by way of a mixture of:

  • Supervised Tremendous-tuning (SFT)
  • Reward / desire modeling (RM)
  • Reinforcement Studying from Human Suggestions (RLHF)

From InstructGPT paper: Ouyang, Lengthy, et al. “Coaching language fashions to observe directions with human suggestions.” arXiv preprint arXiv:2203.02155 (2022).

By combining these approaches, we’re releasing the StackLLaMA mannequin. This mannequin is on the market on the ???? Hub (see Meta’s LLaMA release for the unique LLaMA mannequin) and the entire training pipeline is on the market as a part of the Hugging Face TRL library. To offer you a style of what the mannequin can do, check out the demo beneath!

The LLaMA mannequin

When doing RLHF, it is very important begin with a succesful mannequin: the RLHF step is barely a fine-tuning step to align the mannequin with how we need to work together with it and the way we count on it to reply. Subsequently, we select to make use of the not too long ago launched and performant LLaMA models. The LLaMA fashions are the newest giant language fashions developed by Meta AI. They arrive in sizes starting from 7B to 65B parameters and have been educated on between 1T and 1.4T tokens, making them very succesful. We use the 7B mannequin as the bottom for all the next steps!
To entry the mannequin, use the form from Meta AI.

Stack Alternate dataset

Gathering human suggestions is a posh and costly endeavor. With the intention to bootstrap the method for this instance whereas nonetheless constructing a helpful mannequin, we make use of the StackExchange dataset. The dataset consists of questions and their corresponding solutions from the StackExchange platform (together with StackOverflow for code and plenty of different matters). It’s enticing for this use case as a result of the solutions come along with the variety of upvotes and a label for the accepted reply.

We observe the method described in Askell et al. 2021 and assign every reply a rating:

rating = log2 (1 + upvotes) rounded to the closest integer, plus 1 if the questioner accepted the reply (we assign a rating of −1 if the variety of upvotes is destructive).

For the reward mannequin, we are going to all the time want two solutions per query to check, as we’ll see later. Some questions have dozens of solutions, resulting in many attainable pairs. We pattern at most ten reply pairs per query to restrict the variety of knowledge factors per query. Lastly, we cleaned up formatting by changing HTML to Markdown to make the mannequin’s outputs extra readable. You could find the dataset in addition to the processing pocket book here.

Environment friendly coaching methods

Even coaching the smallest LLaMA mannequin requires an unlimited quantity of reminiscence. Some fast math: in bf16, each parameter makes use of 2 bytes (in fp32 4 bytes) along with 8 bytes used, e.g., within the Adam optimizer (see the performance docs in Transformers for more information). So a 7B parameter mannequin would use (2+8)*7B=70GB simply to slot in reminiscence and would doubtless want extra once you compute intermediate values resembling consideration scores. So that you couldn’t practice the mannequin even on a single 80GB A100 like that. You need to use some tips, like extra environment friendly optimizers of half-precision coaching, to squeeze a bit extra into reminiscence, however you’ll run out in the end.

An alternative choice is to make use of Parameter-Environment friendly Tremendous-Tuning (PEFT) strategies, such because the peft library, which might carry out Low-Rank Adaptation (LoRA) on a mannequin loaded in 8-bit.

Low-Rank Adaptation of linear layers: additional parameters (in orange) are added subsequent to the frozen layer (in blue), and the ensuing encoded hidden states are added along with the hidden states of the frozen layer.

Loading the mannequin in 8bit reduces the reminiscence footprint drastically because you solely want one byte per parameter for the weights (e.g. 7B LlaMa is 7GB in reminiscence). As a substitute of coaching the unique weights instantly, LoRA provides small adapter layers on high of some particular layers (often the eye layers); thus, the variety of trainable parameters is drastically diminished.

On this situation, a rule of thumb is to allocate ~1.2-1.4GB per billion parameters (relying on the batch dimension and sequence size) to suit the whole fine-tuning setup. As detailed within the hooked up weblog submit above, this allows fine-tuning bigger fashions (as much as 50-60B scale fashions on a NVIDIA A100 80GB) at low price.

These strategies have enabled fine-tuning giant fashions on client gadgets and Google Colab. Notable demos are fine-tuning fb/opt-6.7b (13GB in float16 ), and openai/whisper-large on Google Colab (15GB GPU RAM). To study extra about utilizing peft, discuss with our github repo or the previous blog post( on coaching 20b parameter fashions on client {hardware}.

Now we will match very giant fashions right into a single GPU, however the coaching may nonetheless be very sluggish. The only technique on this situation is knowledge parallelism: we replicate the identical coaching setup into separate GPUs and move totally different batches to every GPU. With this, you possibly can parallelize the ahead/backward passes of the mannequin and scale with the variety of GPUs.


We use both the transformers.Coach or speed up, which each help knowledge parallelism with none code modifications, by merely passing arguments when calling the scripts with torchrun or speed up launch. The next runs a coaching script with 8 GPUs on a single machine with speed up and torchrun, respectively.

speed up launch --multi_gpu --num_machines 1  --num_processes 8
torchrun --nnodes 1  --nproc_per_node 8

Supervised fine-tuning

Earlier than we begin coaching reward fashions and tuning our mannequin with RL, it helps if the mannequin is already good within the area we’re excited about. In our case, we would like it to reply questions, whereas for different use circumstances, we would need it to observe directions, through which case instruction tuning is a good thought. The simplest strategy to obtain that is by persevering with to coach the language mannequin with the language modeling goal on texts from the area or process. The StackExchange dataset is big (over 10 million directions), so we will simply practice the language mannequin on a subset of it.

There may be nothing particular about fine-tuning the mannequin earlier than doing RLHF – it’s simply the causal language modeling goal from pretraining that we apply right here. To make use of the information effectively, we use a method referred to as packing: as a substitute of getting one textual content per pattern within the batch after which padding to both the longest textual content or the maximal context of the mannequin, we concatenate quite a lot of texts with a EOS token in between and lower chunks of the context dimension to fill the batch with none padding.


With this method the coaching is far more environment friendly as every token that’s handed by way of the mannequin can be educated in distinction to padding tokens that are often masked from the loss. If you do not have a lot knowledge and are extra involved about sometimes chopping off some tokens which might be overflowing the context you may also use a classical knowledge loader.

The packing is dealt with by the ConstantLengthDataset and we will then use the Coach after loading the mannequin with peft. First, we load the mannequin in int8, put together it for coaching, after which add the LoRA adapters.

mannequin = AutoModelForCausalLM.from_pretrained(
        device_map={"": Accelerator().local_process_index}
mannequin = prepare_model_for_int8_training(mannequin)

lora_config = LoraConfig(

mannequin = get_peft_model(mannequin, config)

We practice the mannequin for a couple of thousand steps with the causal language modeling goal and save the mannequin. Since we are going to tune the mannequin once more with totally different targets, we merge the adapter weights with the unique mannequin weights.

Disclaimer: on account of LLaMA’s license, we launch solely the adapter weights for this and the mannequin checkpoints within the following sections. You may apply for entry to the bottom mannequin’s weights by filling out Meta AI’s form after which changing them to the ???? Transformers format by working this script. Notice that you will additionally want to put in ???? Transformers from supply till the v4.28 is launched.

Now that we now have fine-tuned the mannequin for the duty, we’re prepared to coach a reward mannequin.

Reward modeling and human preferences

In precept, we might fine-tune the mannequin utilizing RLHF instantly with the human annotations. Nonetheless, this is able to require us to ship some samples to people for score after every optimization iteration. That is costly and sluggish as a result of variety of coaching samples wanted for convergence and the inherent latency of human studying and annotator pace.

A trick that works effectively as a substitute of direct suggestions is coaching a reward mannequin on human annotations collected earlier than the RL loop. The objective of the reward mannequin is to mimic how a human would charge a textual content. There are a number of attainable methods to construct a reward mannequin: essentially the most simple manner can be to foretell the annotation (e.g. a score rating or a binary worth for “good”/”unhealthy”). In follow, what works higher is to foretell the rating of two examples, the place the reward mannequin is offered with two candidates (yokay,yj) (y_k, y_j)

This may be translated into the next loss operate:

loss(θ)=E(x,yj,yokay)D[log(σ(rθ(x,yj)rθ(x,yk)))] operatorname{loss}(theta)=- E_{left(x, y_j, y_kright) sim D}left[log left(sigmaleft(r_thetaleft(x, y_jright)-r_thetaleft(x, y_kright)right)right)right]

the place r r is the mannequin’s rating and yj y_j

With the StackExchange dataset, we will infer which of the 2 solutions was most popular by the customers based mostly on the rating. With that info and the loss outlined above, we will then modify the transformers.Coach by including a customized loss operate.

class RewardTrainer(Coach):
    def compute_loss(self, mannequin, inputs, return_outputs=False):
        rewards_j = mannequin(input_ids=inputs["input_ids_j"],  attention_mask=inputs["attention_mask_j"])[0]
        rewards_k = mannequin(input_ids=inputs["input_ids_k"], attention_mask=inputs["attention_mask_k"])[0]
        loss = -nn.practical.logsigmoid(rewards_j - rewards_k).imply()
        if return_outputs:
            return loss, {"rewards_j": rewards_j, "rewards_k": rewards_k}
        return loss

We make the most of a subset of a 100,000 pair of candidates and consider on a held-out set of fifty,000. With a modest coaching batch dimension of 4, we practice the LLaMA mannequin utilizing the LoRA peft adapter for a single epoch utilizing the Adam optimizer with BF16 precision. Our LoRA configuration is:

peft_config = LoraConfig(

The coaching is logged through Weights & Biases and took a couple of hours on 8-A100 GPUs utilizing the ???? analysis cluster and the mannequin achieves a last accuracy of 67%. Though this appears like a low rating, the duty can be very laborious, even for human annotators.

As detailed within the subsequent part, the ensuing adapter could be merged into the frozen mannequin and saved for additional downstream use.

Reinforcement Studying from Human Suggestions

With the fine-tuned language mannequin and the reward mannequin at hand, we are actually able to run the RL loop. It follows roughly three steps:

  1. Generate responses from prompts
  2. Charge the responses with the reward mannequin
  3. Run a reinforcement studying policy-optimization step with the scores


The Question and Response prompts are templated as follows earlier than being tokenized and handed to the mannequin:

Query: <Question>

Reply: <Response>

The identical template was used for SFT, RM and RLHF phases.

A standard problem with coaching the language mannequin with RL is that the mannequin can study to take advantage of the reward mannequin by producing full gibberish, which causes the reward mannequin to assign excessive rewards. To stability this, we add a penalty to the reward: we hold a reference of the mannequin that we don’t practice and examine the brand new mannequin’s technology to the reference one by computing the KL-divergence:

R(x,y)=r(x,y)βKL(x,y) operatorname{R}(x, y)=operatorname{r}(x, y)- beta operatorname{KL}(x, y)

the place r r is the reward from the reward mannequin and KL(x,y) operatorname{KL}(x,y)

As soon as extra, we make the most of peft for memory-efficient coaching, which provides an additional benefit within the RLHF context. Right here, the reference mannequin and coverage share the identical base, the SFT mannequin, which we load in 8-bit and freeze throughout coaching. We completely optimize the coverage’s LoRA weights utilizing PPO whereas sharing the bottom mannequin’s weights.

for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    question_tensors = batch["input_ids"]
    response_tensors = ppo_trainer.generate(
    batch["response"] = tokenizer.batch_decode(response_tensors, skip_special_tokens=True)

    texts = [q + r for q, r in zip(batch["query"], batch["response"])]
    pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
    rewards = [torch.tensor(output[0]["score"] - script_args.reward_baseline) for output in pipe_outputs]

    stats = ppo_trainer.step(question_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

We practice for 20 hours on 3×8 A100-80GB GPUs, utilizing the ???? analysis cluster, however you may also get respectable outcomes a lot faster (e.g. after ~20h on 8 A100 GPUs). All of the coaching statistics of the coaching run can be found on Weights & Biases.

Per batch reward at each step during training. The model’s performance plateaus after around 1000 steps.
Per batch reward at every step throughout coaching. The mannequin’s efficiency plateaus after round 1000 steps.

So what can the mannequin do after coaching? Let’s take a look!

See Also

llama prompt

Though we should not belief its recommendation on LLaMA issues simply, but, the reply seems to be coherent and even offers a Google hyperlink. Let’s take a look and a few of the coaching challenges subsequent.

Challenges, instabilities and workarounds

Coaching LLMs with RL isn’t all the time plain crusing. The mannequin we demo at this time is the results of many experiments, failed runs and hyper-parameter sweeps. Even then, the mannequin is much from excellent. Right here we are going to share a couple of of the observations and complications we encountered on the best way to creating this instance.

Larger reward means higher efficiency, proper?

Wow this run must be great, look at that sweet, sweet, reward!
Wow this run should be nice, take a look at that candy, candy, reward!

On the whole in RL, you need to obtain the best reward. In RLHF we use a Reward Mannequin, which is imperfect and given the possibility, the PPO algorithm will exploit these imperfections. This could present itself as sudden will increase in reward, nonetheless after we take a look at the textual content generations from the coverage, they principally comprise repetitions of the string “`, because the reward mannequin discovered the stack change solutions containing blocks of code often rank increased than ones with out it. Thankfully this problem was noticed pretty not often and usually the KL penalty ought to counteract such exploits.

KL is all the time a constructive worth, isn’t it?

As we beforehand talked about, a KL penalty time period is used as a way to push the mannequin’s outputs stay near that of the bottom coverage. On the whole, KL divergence measures the distances between two distributions and is all the time a constructive amount. Nonetheless, in trl we use an estimate of the KL which in expectation is the same as the actual KL divergence.

OkLpen(x,y)=log(πϕRL(yx)/πSFT(yx)) KL_{pen}(x,y) = log left(pi_phi^{mathrm{RL}}(y mid x) / pi^{mathrm{SFT}}(y mid x)proper)

Clearly, when a token is sampled from the coverage which has a decrease likelihood than the SFT mannequin, it will result in a destructive KL penalty, however on common will probably be constructive in any other case you would not be correctly sampling from the coverage. Nonetheless, some technology methods can power some tokens to be generated or some tokens can suppressed. For instance when producing in batches completed sequences are padded and when setting a minimal size the EOS token is suppressed. The mannequin can assign very excessive or low possibilities to these tokens which ends up in destructive KL. Because the PPO algorithm optimizes for reward, it’ll chase after these destructive penalties, resulting in instabilities.

Negative KL

One must be cautious when producing the responses and we recommend to all the time use a easy sampling technique first earlier than resorting to extra subtle technology strategies.

Ongoing points

There are nonetheless a variety of points that we have to higher perceive and resolve. For instance, there are occassionally spikes within the loss, which might result in additional instabilities.

Loss spikes

As we establish and resolve these points, we are going to upstream the modifications trl, to make sure the neighborhood can profit.


On this submit, we went by way of the whole coaching cycle for RLHF, beginning with making ready a dataset with human annotations, adapting the language mannequin to the area, coaching a reward mannequin, and eventually coaching a mannequin with RL.

By utilizing peft, anybody can run our instance on a single GPU! If coaching is just too sluggish, you need to use knowledge parallelism with no code modifications and scale coaching by including extra GPUs.

For an actual use case, that is simply step one! After getting a educated mannequin, you have to consider it and examine it towards different fashions to see how good it’s. This may be executed by rating generations of various mannequin variations, much like how we constructed the reward dataset.

When you add the analysis step, the enjoyable begins: you can begin iterating in your dataset and mannequin coaching setup to see if there are methods to enhance the mannequin. You may add different datasets to the combination or apply higher filters to the prevailing one. However, you possibly can strive totally different mannequin sizes and structure for the reward mannequin or practice for longer.

We’re actively enhancing TRL to make all steps concerned in RLHF extra accessible and are excited to see the issues folks construct with it! Take a look at the issues on GitHub should you’re excited about contributing.


@misc {beeching2023stackllama,
    creator       = { Edward Beeching and
                     Younes Belkada and
                     Kashif Rasul and
                     Lewis Tunstall and
                     Leandro von Werra and
                     Nazneen Rajani and
                     Nathan Lambert
    title        = { StackLLaMA: An RL Tremendous-tuned LLaMA Mannequin for Stack Alternate Query and Answering },
    12 months         = 2023,
    url          = { },
    doi          = { 10.57967/hf/0513 },
    writer    = { Hugging Face Weblog }


We thank Philipp Schmid for sharing his great demo of streaming textual content technology upon which our demo was based mostly. We additionally thank Omar Sanseviero and Louis Castricato for giving beneficial and detailed suggestions on the draft of the weblog submit.

Source Link

What's Your Reaction?
In Love
Not Sure
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top