Now Reading
How To Finetune GPT Like Giant Language Fashions on a Customized Dataset

How To Finetune GPT Like Giant Language Fashions on a Customized Dataset

2023-05-25 05:06:38

Takeaways

Learn to fine-tune giant language fashions (LLMs) on a customized dataset. We might be utilizing Lit-Parrot, a nanoGPT primarily based implementation of the GPT-NeoX
mannequin that helps – StableLM, Pythia, and RedPajama-INCITE mannequin weights.

The AI group’s effort has led to the event of many high-quality open-source LLMs, together with however not restricted to Open LLaMA, StableLM, and Pythia. You’ll be able to fine-tune these fashions on a customized instruction dataset to adapt to your particular process, resembling coaching a chatbot to reply monetary questions.

Lightning AI just lately launched Lit-Parrot, the second LLM implementation within the Lit-* collection. The purpose of those Lit-* collection is to offer the AI/ML group with a clear, strong, and optimized implementation of huge language fashions with pretraining and fine-tuning help utilizing LoRA and Adapter.

We are going to information you thru the method step-by-step, from set up to mannequin obtain and information preparation to fine-tuning. If in case you have already accomplished a step or are assured about it, be happy to skip it.

Putting in Lit-Parrot ????

The Lit-Parrot repository is accessible within the Lightning AI Github group here. To get began, clone the repository and set up its dependencies.


git clone <https://github.com/Lightning-AI/lit-parrot>
cd lit-parrot

We’re utilizing FlashAttention, a quick and memory-efficient implementation of consideration, which is just accessible in PyTorch Nightly 2.1 for the time being of writing this text.


# for cuda
pip set up --index-url <https://obtain.pytorch.org/whl/nightly/cu118> --pre 'torch>=2.1.0dev' # for cpu
pip set up --index-url <https://obtain.pytorch.org/whl/nightly/cpu> --pre 'torch>=2.1.0dev'

Lastly, set up the dependencies utilizing pip set up -r necessities.txt .

Downloading the mannequin weights

In an effort to use the mannequin or fine-tune it we want a pre-trained weight. Due to the hassle of open supply groups, we have now a bunch of open supply weights that we are able to use for business functions. Lit-Parrot being a GPT NeoX implementation
helps StableLM, Pythia, and RedPajama-INCITE weights. We use the RedPajama-INCITE 3B parameter weights on this tutorial. Yow will discover the directions to obtain different weights on this howto section.


# obtain the mannequin weights
python scripts/obtain.py --repo_id togethercomputer/RedPajama-INCITE-Base-3B-v1 # convert the weights to Lit-Parrot format
python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/togethercomputer/RedPajama-INCITE-Base-3B-v1

You will notice, gpt_neox layers being mapped to the Lit-Parrot layers within the terminal. After this step, you will discover the downloaded weights within the checkpoints/togethercomputer/RedPajama-INCITE-Base-3B-v1 folder.

Put together the dataset

On this tutorial, we are going to use the Dolly 2.0 instruction dataset by Databricks for fine-tuning. Finetuning entails two foremost steps- first, we course of the dataset within the Lit-Parrot format after which we run the fine-tuning script on the processed dataset.

Instruction datasets sometimes have three keys: instruction, enter (elective context for the given instruction), and the anticipated response from the LLM. Beneath is a pattern instance of instruction information:


[
{
"instruction": "Arrange the given numbers in ascending order.",
"input": "2, 4, 0, 8, 3",
"output": "0, 2, 3, 4, 8"
},
...
]

The dolly 2.0 dataset is available in JSON Lines format, which is plainly talking a textual content file with rows of JSON information. It’s a handy format when processing one report at a time. The Dolly dataset incorporates the next keys –


{
"instruction": "When did Virgin Australia begin working?",
"context": "Virgin Australia, the buying and selling identify of Virgin Australia Airways Pty Ltd, is an Australian-based airline. It's the largest airline by fleet dimension to make use of the Virgin model. It commenced companies on 31 August 2000 as Virgin Blue, with two plane on a single route. It all of a sudden discovered itself as a significant airline in Australia's home market after the collapse of Ansett Australia in September 2001. The airline has since grown to straight serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.",
"response": "Virgin Australia commenced companies on 31 August 2000 as Virgin Blue, with two plane on a single route.",
"class": "closed_qa"
}

We have to rename context to enter and response to output and we’re all set to course of our information.


with open(file_path, "r") as file:
information = file.readlines()
information = [json.loads(line) for line in data]
for merchandise in information:
merchandise["input"] = merchandise.pop("context")
merchandise["output"] = merchandise.pop("response")

We are able to modify the present Alpaca script for our information preparation. This script downloads information from tloen’s Alpaca-lora undertaking and saves the processed information. It features a put together operate that masses the uncooked instruction
dataset, creates prompts, and tokenizes them utilizing the mannequin tokenizer supplied within the checkpoint_dir. The tokenized information is cut up into coaching and take a look at units primarily based on the test_split_size supplied and saved to the destination_path.

To switch the Alpaca script, open it from here and edit the put together operate. That is how our ultimate operate would take care of mapping the keys appropriately.


DATA_FILE = "https://huggingface.co/datasets/databricks/databricks-dolly-15k/resolve/foremost/databricks-dolly-15k.jsonl"
DATA_FILE_NAME = "dolly_data_cleaned_archive.json" def put together(
destination_path: Path = Path("information/dolly"),
checkpoint_dir: Path = Path("checkpoints/togethercomputer/RedPajama-INCITE-Base-3B-v1"),
test_split_size: int = 2000,
max_seq_length: int = 256,
seed: int = 42,
mask_inputs: bool = False, # as in alpaca-lora
data_file_name: str = DATA_FILE_NAME,
) -> None:
"""Put together the Dolly dataset for instruction tuning. The output is a coaching and validation dataset saved as `prepare.pt` and `val.pt`,
which shops the preprocessed and tokenized prompts and labels.
"""
destination_path.mkdir(mother and father=True, exist_ok=True)
file_path = destination_path / data_file_name
obtain(file_path) tokenizer = Tokenizer(checkpoint_dir / "tokenizer.json", checkpoint_dir / "tokenizer_config.json") with open(file_path, "r") as file:
information = file.readlines()
information = [json.loads(line) for line in data]
for merchandise in information:
merchandise["input"] = merchandise.pop("context")
merchandise["output"] = merchandise.pop("response") # Partition the dataset into prepare and take a look at
train_split_size = len(information) - test_split_size
train_set, test_set = random_split(
information, lengths=(train_split_size, test_split_size), generator=torch.Generator().manual_seed(seed)
)
train_set, test_set = record(train_set), record(test_set) print(f"prepare has {len(train_set):,} samples")
print(f"val has {len(test_set):,} samples") print("Processing prepare cut up ...")
train_set = [prepare_sample(sample, tokenizer, max_seq_length, mask_inputs) for sample in tqdm(train_set)]
torch.save(train_set, file_path.guardian / "prepare.pt") print("Processing take a look at cut up ...")
test_set = [prepare_sample(sample, tokenizer, max_seq_length, mask_inputs) for sample in tqdm(test_set)]
torch.save(test_set, file_path.guardian / "take a look at.pt")

Lastly, let’s run the script by offering the information path and the mannequin checkpoint listing.

See Also


python scripts/prepare_yourscript.py
--destination_path information/dolly
--checkpoint_dir checkpoints/togethercomputer/RedPajama-INCITE-Base-3B-v1

Finetuning the RedPajama-INCITE mannequin

After you have accomplished all of the above steps, it’s easy to begin fine-tuning. It’s good to run the
finetune_adapter.py script by offering your information path.


python finetune_adapter.py
--data_dir information/dolly
--checkpoint_dir checkpoints/togethercomputer/RedPajama-INCITE-Base-3B-v1

You’ll be able to replace the default variety of GPUs, micro-batch dimension, and all the opposite hyperparameters within the fine-tuning script
here.

You’ll be able to play along with your fine-tuned mannequin utilizing the generate_adapter.py script by making an attempt totally different prompts
and turning the mannequin temperature.


python generate_adapter.py
--adapter_path out/adapter/alpaca/iter-015999.pth
--checkpoint_dir checkpoints/togethercomputer/RedPajama-INCITE-Base-3B-v1
--prompt "who's the creator of Sport of thrones?"

We might love to listen to what you’ve gotten constructed with Lit-Parrot. Do share us your favourite immediate and response on Twitter or within the Discord group!



Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top