Now Reading
Manner Sufficient – Effective-tuning gpt-3.5-turbo to be taught to play “Connections”

Manner Sufficient – Effective-tuning gpt-3.5-turbo to be taught to play “Connections”

2024-01-15 11:02:33

I began enjoying the NYTimes phrase sport “Connections” not too long ago, by the advice of some mates.
It has the kind of freshness that Wordle misplaced for me a very long time in the past.
After enjoying Connections for a couple of days, I puzzled if an OpenAI language mannequin might clear up the sport (the target is to group the 16 phrases into 4 classes of 4 phrases).
I attempted with gpt-4k-32k and gpt-4-1106-preview, tweaking prompts for a couple of hours and wasn’t in a position to make a lot progress.
It’s definitely potential immediate engineering alone might clear up this drawback, however it wasn’t straightforward for me for discover a path ahead.
I think about it can contain a little bit of creativity.
I made a decision this was nearly as good a time as any to try to tremendous tune a mannequin to do a factor I couldn’t simply get it to do with prompts.

Getting the dataset

I remembered seeing sooner or later that Wordle had an API to return the day’s phrase and it does at https://www.nytimes.com/svc/wordle/v2/yyyy-mm-dd.json.
I figured Connections may need an analogous JSON API and it does.
I wrote a script to get all of the options to previous sport and put the within the connections_data folder with the naming conference yyyy-mm-dd.json.

import json
import httpx
from datetime import timedelta, datetime


url = "https://www.nytimes.com/svc/connections/v1/{date}.json"
start_date = datetime.strptime("2023-06-12", "%Y-%m-%d")

def important():
    end_date = datetime.now()
    dates_generated = [
        start_date + timedelta(days=x)
        for x in range((end_date - start_date).days + 1)
    ]
    for date in dates_generated:
        formatted_date = date.strftime("%Y-%m-%d")
        response = httpx.get(url.format(date=formatted_date))
        response_object = response.json()
        with open(f"connections_data/{formatted_date}.json" , "w") as f:
            f.write(json.dumps(response_object, indent=2))


if __name__ == "__main__":
    important()

It seems the primary sport was printed on 2023-06-12.
Subsequent, I wrote some not fairly code to create a jsonl file to add to OpenAI.
They describe the construction of the file to fine-tune the gpt-3.5-turbo mannequin here.

import os
import json

SYSTEM_PROMPT = """The sport "Connections" is a phrase sport the place you begin with 16 phrases and must group them into 4 teams of 4. Every grouping has a class that unambiguously teams the 4 phrases collectively. Every puzzle has precisely one answer. Be careful for phrases that appear to belong to a number of classes. You'll be given 16 phrases. Output 4 groupings of 4 phrases and the classes to which they belong."""

OUT_FILE = "connections_prompts.jsonl"

def important():
    with open(OUT_FILE, "w") as writef:
        for file in os.listdir("connections_data"):
            if file.endswith(".json"):
                print(file)
                with open(f"connections_data/{file}", "r") as f:
                    knowledge = json.load(f)
                classes = [c for c in data["groups"].keys()]
                categories_and_members = {c: knowledge["groups"][c] for c in classes}
                assistant_str = ""
                for c, phrases in categories_and_members.objects():
                    assistant_str += f"class: {c.decrease()}nphrases: {' '.be a part of(phrases)}nn"
                all_words = [word.lower() for group in categories_and_members.values() for word in group["members"]]
                all_words_str =  ', '.be a part of(all_words)
                out_obj = {
                    "messages": [
                        {
                            "role": "system",
                            "content": SYSTEM_PROMPT,
                        },
                        {
                            "role": "user",
                            "content": f"Here are the 16 words: {all_words_str}",
                        },
                        {
                            "role": "assistant",
                            "content": assistant_str.strip(),
                        },
                    ]
                }
                writef.write(f"{json.dumps(out_obj)}n")


if __name__ == "__main__":
    important()

Estimating Worth

I estimated the worth of the job by counting the variety of tokens in my fine-tune file.
The OpenAI pricing page lists the worth of the fine-tune at $0.0080 / 1K tokens.
Moreover, the fine-tuning guide notes

To estimate the prices for a selected fine-tuning job, use the next method:

base value per 1k tokens * variety of tokens within the enter file * variety of epochs educated

For a coaching file with 100,000 tokens educated over 3 epochs, the anticipated value could be ~$2.40 USD.

Utilizing tiktoken to depend tokens

import tiktoken

# gpt-3.5-turbo	$0.0080 / 1K tokens

with open("connections_prompts.jsonl", "r") as file:
    knowledge = file.learn()

encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
tokens = len(encoding.encode(knowledge))
value = (tokens / 1000) * 0.0080
print(f"Variety of tokens: {tokens}.nWorth: {value:.2f}")

which output

Variety of tokens: 46814.
Worth: 0.37

Operating the fine-tune job

I took 15 of the rows (quantity chosen fairly randomly) from the consequence and separated them out right into a coaching file, in order that OpenAI can used it to calculate loss for the fine-tuning (I believe).
I referred to as these information connections_prompts_train.jsonl and connections_prompts_test.jsonl.
I went to OpenAI and created a brand new fine-tuning job.
I chosen the gpt-3.5-turbo-1106 mannequin then uploaded my coaching and take a look at knowledge, and clicked “create” and off it went.
Because the job ran, I might see this dashboard.
Right here’s what it seemed like when it completed.

Fine-tuning progress. A loss graph with two lines, slowly decreasing starting around 3 and down to around 0.2

And it value $0.90.
Based on the fine-tune metrics, 3 epochs (given my estimated value of $0.37) as talked about fine-tune training guide although I don’t recall setting that or seeing it anyplace else.

Fine-tuning cost screenshot

Testing the fine-tune

With the fine-tuning completed, I went to the playground to strive it out.
I fetched the phrases for the subsequent day’s puzzle, which was mercifully already accessible.
I forgot to make a validation set.
Disgrace on me.
It could have take a bit longer to fine-tune once more, and I had a whole lot of anticipation to see the outcomes, it being my first fine-tune and not likely understanding the loss graph or realizing if it labored.

I copied the identical system immediate from the coaching

The sport "Connections" is a phrase sport the place you begin with 16 phrases and must group them into 4 teams of 4. Every grouping has a class that unambiguously teams the 4 phrases collectively. Every puzzle has precisely one answer. Be careful for phrases that appear to belong to a number of classes. You'll be given 16 phrases. Output 4 groupings of 4 phrases and the classes to which they belong.

then added the consumer message with the phrases

Listed here are the 16 phrases: punt, badge, sort out, run, direct, cross, finish, desk, ticket, stall, chair, heart, lead, security, invite, maintain

and clicked submit.
The mannequin outputted

class: american soccer phrases
phrases: stage members

class: soccer actions
phrases: stage members

class: issues that may be numbered
phrases: stage members

class: discipline positions
phrases: stage members

A particular bummer.
I did a fast search in my challenge listing for “phrases: stage members”, and located it in every single place.
Oops!
I constructed the dataset improperly.

Fixing a dataset bug

After a little bit of a reset, I discovered some points within the immediate era python script and glued them right here

import os
import json

SYSTEM_PROMPT = """The sport "Connections" is a phrase sport the place you begin with 16 phrases and must group them into 4 teams of 4. Every grouping has a class that unambiguously teams the 4 phrases collectively. Every puzzle has precisely one answer. Be careful for phrases that appear to belong to a number of classes. You'll be given 16 phrases. Output 4 groupings of 4 phrases and the classes to which they belong."""

OUT_FILE = "connections_prompts.jsonl"

def important():
    with open(OUT_FILE, "w") as writef:
        for file in os.listdir("connections_data"):
            if file.endswith(".json"):
                print(file)
                with open(f"connections_data/{file}", "r") as f:
                    knowledge = json.load(f)
                classes = [c for c in data["groups"].keys()]
                categories_and_members = {c: knowledge["groups"][c]["members"] for c in classes}
                assistant_str = ""
                for c, phrases in categories_and_members.objects():
                    lower_words = [w.lower() for w in words]
                    assistant_str += f"class: {c.decrease()}nphrases: {', '.be a part of(lower_words)}nn"
                all_words = [word.lower() for group in categories_and_members.values() for word in group]
                all_words_str =  ', '.be a part of(all_words)
                out_obj = {
                    "messages": [
                        {
                            "role": "system",
                            "content": SYSTEM_PROMPT,
                        },
                        {
                            "role": "user",
                            "content": f"Here are the 16 words: {all_words_str}",
                        },
                        {
                            "role": "assistant",
                            "content": assistant_str.strip(),
                        },
                    ]
                }
                writef.write(f"{json.dumps(out_obj)}n")


if __name__ == "__main__":
    important()

I checked my knowledge and validated the proper phrases for every class have been within the output jsonl file.
This time round, I remembered to create an exterior validation set so I had a couple of circumstances to run as soon as the fine-tune was prepared.
I cut up the info up 60% to coach, 20% to validate for the fine-tune and 20% to make use of myself to validate after the fine-tune.

I put up my toes and waited for the fine-tune to run.

The primary run of the fine-tune.

fine-tune run model output

I’m very glad I had extra validation knowledge as a result of I couldn’t consider it after I noticed it work.
I ran a number of extra and every time, it acquired the phrase teams appropriate.
Although sometimes it appeared to battle to get the class proper with x ___ or ___ x classes.
However actually, I used to be slightly shocked.
Earlier than this fine-tuning, the mannequin couldn’t come all that near fixing Connections puzzles and now it might typically get the 4 phrase teams proper each time and the classes proper more often than not.
It appeared to good to be true.
It was!
Can you see the issue?
The enter phrases

Listed here are the 16 phrases: fido, fortunate, rover, spot, catch, discover, observe, see, bait, chum, fly, sinker, bone, rex, shirt, storm

and the output classes

class: frequent canine names
phrases: fido, fortunate, rover, spot

class: understand
phrases: catch, discover, observe, see

class: fishing gear
phrases: bait, chum, fly, sinker

class: ___ in a teacup
phrases: bone, rex, shirt, storm

are in the identical order.

After I change the enter order of the phrases for this fine-tune

Listed here are the 16 phrases: fortunate, chum, shirt, storm, spot, catch, rex, fly, observe, rover, see, bait, fido, discover, sinker, bone

all of it falls aside.

See Also

class: pal
phrases: fortunate, chum, shirt, storm

class: discover
phrases: spot, catch, rex

class: methods to name
phrases: fly, observe, rover, see

class: canine, e.g.
phrases: bait, fido, discover, sinker

The classes cease making sense.
The mannequin hallucinates phrases.
Some classes don’t even have 4 phrases in them.
This fine-tune is toast.

Fixing one other dataset bug

Again to the code.
I modified my knowledge set era code, then cut up my knowledge into three units once more.
I used a secure random seed, in order that the era could be the identical every time.

import os
import json
import random

random.seed(42)

SYSTEM_PROMPT = """The sport "Connections" is a phrase sport the place you begin with 16 phrases and must group them into 4 teams of 4. Every grouping has a class that unambiguously teams the 4 phrases collectively. Every puzzle has precisely one answer. Be careful for phrases that appear to belong to a number of classes. You'll be given 16 phrases. Output 4 groupings of 4 phrases and the classes to which they belong."""

OUT_FILE = "connections_prompts.jsonl"

def important():
    with open(OUT_FILE, "w") as writef:
        for file in os.listdir("connections_data"):
            if file.endswith(".json"):
                with open(f"connections_data/{file}", "r") as f:
                    knowledge = json.load(f)
                classes = [c for c in data["groups"].keys()]
                categories_and_members = {c: knowledge["groups"][c]["members"] for c in classes}
                assistant_str = ""
                for c, phrases in categories_and_members.objects():
                    lower_words = [w.lower() for w in words]
                    assistant_str += f"class: {c.decrease()}nphrases: {', '.be a part of(lower_words)}nn"
                all_words = [word.lower() for group in categories_and_members.values() for word in group]
                random.shuffle(all_words)
                all_words_str =  ', '.be a part of(all_words)
                out_obj = {
                    "messages": [
                        {
                            "role": "system",
                            "content": SYSTEM_PROMPT,
                        },
                        {
                            "role": "user",
                            "content": f"Here are the 16 words: {all_words_str}",
                        },
                        {
                            "role": "assistant",
                            "content": assistant_str.strip(),
                        },
                    ]
                }
                writef.write(f"{json.dumps(out_obj)}n")


if __name__ == "__main__":
    important()

I ran a couple of exams

Listed here are the 16 phrases: grownup, top, kiss, prime, pair, tease, occasions, josh, peak, ten, rating, gross, child, dozen, max, rib
class: nudge, as for info
phrases: trace, josh, rib, tease

class: mountain ___
phrases: grownup, gross, top, peak

class: 12
phrases: dozen, max, rating, occasions

class: ___ quantity
phrases: child, kiss, pair, prime

and

Listed here are the 16 phrases: get, presto, safe, tada, hearth, voila, carpenter, adam, win, behold, shock, anger, land, concern, happiness, pink
class: eureka!
phrases: behold, presto, tada, voila

class: “inside out” feelings
phrases: anger, concern, happiness, unhappiness

class: ___ and peace
phrases: land, order, pink, safe

class: ___’s rib
phrases: adam, carpenter, get, shock

The outcomes are okay, however not unimaginable.
It does get some groupings appropriate.
There are hallucinations, including phrases that don’t exist.
Additionally, most classes and grouping aren’t appropriate.

Evaluating the outcomes

To determine if this fine-tune was worthwhile, I wanted benchmark towards the efficiency of gpt-3.5-turbo-1106.

I wrote some fast and soiled code to learn the validation.jsonl, make an OpenAI name, parse the response and evaluate it to the recognized, appropriate reply.
I made a decision to measure proportion of puzzles appropriate (all 4 classes) and proportion of classes appropriate (4 phrases appropriately grouped).
I additionally augmented the consumer immediate a bit to get the un-fine-tuned mannequin to output it’s response the identical approach the fine-tuned mannequin would for constant parsing.

import json
from openai import OpenAI


shopper = OpenAI()

knowledge = []
puzzles_correct = 0
puzzles_incorrect = 0
with open("validation.jsonl", "r") as file:
    for row in file:
        knowledge.append(json.hundreds(row))

total_puzzles = len(knowledge)
categories_correct = 0
total_categories = total_puzzles * 4

for da in knowledge:
    system_prompt = da["messages"][0]["content"]
    user_prompt = da["messages"][1]["content"] + """nOutput your response in actual adherence to the next type

class: <class>
phrases: <word1>, <word2>, <word3>, <word4>

for instance:

class: top
phrases: top, prime, peak, max
"""

    expected_result = da["messages"][2]

    completion = shopper.chat.completions.create(
        mannequin="gpt-3.5-turbo-1106"
        # for the fine-tune:
        # mannequin="ft:gpt-3.5-turbo-1106:private::<my_id>",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ]
    )
    # parse response
    consequence = completion.decisions[0].message.content material
    categories_with_words = {}
    for line in consequence.strip().cut up("nn"):
        class, words_str = line.decrease().cut up("nphrases: ")
        category_name = class.cut up(": ")[1].strip()
        phrases = [word.strip() for word in words_str.split(", ")]
        categories_with_words[category_name] = tuple(sorted(phrases))

    # evaluate to appropriate reply
    correct_categories_with_words = {}
    correct_str =  da["messages"][2]["content"]
    for line in correct_str.strip().cut up("nn"):
        class, words_str = line.cut up("nphrases: ")
        category_name = class.cut up(": ")[1].strip()
        phrases = [word.strip() for word in words_str.split(", ")]
        correct_categories_with_words[category_name] = tuple(sorted(phrases))

    num_correct = 0
    for word_list in categories_with_words.values():
        if (word_list in checklist(correct_categories_with_words.values())):
            print(word_list)
            num_correct += 1
    categories_correct += num_correct
    if num_correct == 4:
        puzzles_correct += 1
    else:
        puzzles_incorrect += 1

    # Calculate and print the chances
    percent_correct = (puzzles_correct / total_puzzles) * 100
    percent_incorrect = (puzzles_incorrect / total_puzzles) * 100
    print(f"Right: {percent_correct:.2f}%")
    print(f"Incorrect: {percent_incorrect:.2f}%")

    percent_categories_correct = (categories_correct / total_categories) * 100
    print(f"Complete Classes Right: {percent_categories_correct:.2f}%")

I began by operating the code towards gpt-3.5-turbo-1106.

Right: 0.00%
Incorrect: 100.00%
Complete Classes Right: 20.24%

Subsequent, I ran it towards my fine-tuned mannequin.

Right: 4.76%
Incorrect: 95.24%
Complete Classes Right: 23.81%

Not an enormous distinction.
I did a couple of extra runs and not one of the outcomes we’re too out of the bizarre.
Aside from constant formatting, it’s not clear the mannequin acquired all that significantly better on the sport after fine-tuning.

Wrap up

This expertise was an attention-grabbing introduction to mannequin fine-tuning.
The outcomes weren’t that incredible, however I realized quite a bit about straightforward pitfalls and errors one could make and had some enjoyable.

Future work

To see if there may be extra to probe for this challenge, I ran the validation set by gpt-4 with the next outcomes

Right: 9.52%
Incorrect: 90.48%
Complete Classes Right: 39.29%

This enchancment is notable over gpt-3.5’s ~20-25%.
I requested entry to fine-tune gpt-4, so we’ll see if that comes by and I can strive it out.


Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top