Manner Sufficient – Effective-tuning gpt-3.5-turbo to be taught to play “Connections”
I began enjoying the NYTimes phrase sport “Connections” not too long ago, by the advice of some mates.
It has the kind of freshness that Wordle misplaced for me a very long time in the past.
After enjoying Connections for a couple of days, I puzzled if an OpenAI language mannequin might clear up the sport (the target is to group the 16 phrases into 4 classes of 4 phrases).
I attempted with gpt-4k-32k
and gpt-4-1106-preview
, tweaking prompts for a couple of hours and wasn’t in a position to make a lot progress.
It’s definitely potential immediate engineering alone might clear up this drawback, however it wasn’t straightforward for me for discover a path ahead.
I think about it can contain a little bit of creativity.
I made a decision this was nearly as good a time as any to try to tremendous tune a mannequin to do a factor I couldn’t simply get it to do with prompts.
Getting the dataset
I remembered seeing sooner or later that Wordle had an API to return the day’s phrase and it does at https://www.nytimes.com/svc/wordle/v2/yyyy-mm-dd.json
.
I figured Connections may need an analogous JSON API and it does.
I wrote a script to get all of the options to previous sport and put the within the connections_data
folder with the naming conference yyyy-mm-dd.json
.
import json
import httpx
from datetime import timedelta, datetime
url = "https://www.nytimes.com/svc/connections/v1/{date}.json"
start_date = datetime.strptime("2023-06-12", "%Y-%m-%d")
def important():
end_date = datetime.now()
dates_generated = [
start_date + timedelta(days=x)
for x in range((end_date - start_date).days + 1)
]
for date in dates_generated:
formatted_date = date.strftime("%Y-%m-%d")
response = httpx.get(url.format(date=formatted_date))
response_object = response.json()
with open(f"connections_data/{formatted_date}.json" , "w") as f:
f.write(json.dumps(response_object, indent=2))
if __name__ == "__main__":
important()
It seems the primary sport was printed on 2023-06-12.
Subsequent, I wrote some not fairly code to create a jsonl file to add to OpenAI.
They describe the construction of the file to fine-tune the gpt-3.5-turbo
mannequin here.
import os
import json
SYSTEM_PROMPT = """The sport "Connections" is a phrase sport the place you begin with 16 phrases and must group them into 4 teams of 4. Every grouping has a class that unambiguously teams the 4 phrases collectively. Every puzzle has precisely one answer. Be careful for phrases that appear to belong to a number of classes. You'll be given 16 phrases. Output 4 groupings of 4 phrases and the classes to which they belong."""
OUT_FILE = "connections_prompts.jsonl"
def important():
with open(OUT_FILE, "w") as writef:
for file in os.listdir("connections_data"):
if file.endswith(".json"):
print(file)
with open(f"connections_data/{file}", "r") as f:
knowledge = json.load(f)
classes = [c for c in data["groups"].keys()]
categories_and_members = {c: knowledge["groups"][c] for c in classes}
assistant_str = ""
for c, phrases in categories_and_members.objects():
assistant_str += f"class: {c.decrease()}nphrases: {' '.be a part of(phrases)}nn"
all_words = [word.lower() for group in categories_and_members.values() for word in group["members"]]
all_words_str = ', '.be a part of(all_words)
out_obj = {
"messages": [
{
"role": "system",
"content": SYSTEM_PROMPT,
},
{
"role": "user",
"content": f"Here are the 16 words: {all_words_str}",
},
{
"role": "assistant",
"content": assistant_str.strip(),
},
]
}
writef.write(f"{json.dumps(out_obj)}n")
if __name__ == "__main__":
important()
Estimating Worth
I estimated the worth of the job by counting the variety of tokens in my fine-tune file.
The OpenAI pricing page lists the worth of the fine-tune at $0.0080 / 1K tokens.
Moreover, the fine-tuning guide notes
To estimate the prices for a selected fine-tuning job, use the next method:
base value per 1k tokens * variety of tokens within the enter file * variety of epochs educated
For a coaching file with 100,000 tokens educated over 3 epochs, the anticipated value could be ~$2.40 USD.
Utilizing tiktoken
to depend tokens
import tiktoken
# gpt-3.5-turbo $0.0080 / 1K tokens
with open("connections_prompts.jsonl", "r") as file:
knowledge = file.learn()
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
tokens = len(encoding.encode(knowledge))
value = (tokens / 1000) * 0.0080
print(f"Variety of tokens: {tokens}.nWorth: {value:.2f}")
which output
Variety of tokens: 46814.
Worth: 0.37
Operating the fine-tune job
I took 15 of the rows (quantity chosen fairly randomly) from the consequence and separated them out right into a coaching file, in order that OpenAI can used it to calculate loss for the fine-tuning (I believe).
I referred to as these information connections_prompts_train.jsonl
and connections_prompts_test.jsonl
.
I went to OpenAI and created a brand new fine-tuning job.
I chosen the gpt-3.5-turbo-1106
mannequin then uploaded my coaching and take a look at knowledge, and clicked “create” and off it went.
Because the job ran, I might see this dashboard.
Right here’s what it seemed like when it completed.
And it value $0.90.
Based on the fine-tune metrics, 3 epochs (given my estimated value of $0.37) as talked about fine-tune training guide although I don’t recall setting that or seeing it anyplace else.
Testing the fine-tune
With the fine-tuning completed, I went to the playground to strive it out.
I fetched the phrases for the subsequent day’s puzzle, which was mercifully already accessible.
I forgot to make a validation set.
Disgrace on me.
It could have take a bit longer to fine-tune once more, and I had a whole lot of anticipation to see the outcomes, it being my first fine-tune and not likely understanding the loss graph or realizing if it labored.
I copied the identical system immediate from the coaching
The sport "Connections" is a phrase sport the place you begin with 16 phrases and must group them into 4 teams of 4. Every grouping has a class that unambiguously teams the 4 phrases collectively. Every puzzle has precisely one answer. Be careful for phrases that appear to belong to a number of classes. You'll be given 16 phrases. Output 4 groupings of 4 phrases and the classes to which they belong.
then added the consumer message with the phrases
Listed here are the 16 phrases: punt, badge, sort out, run, direct, cross, finish, desk, ticket, stall, chair, heart, lead, security, invite, maintain
and clicked submit.
The mannequin outputted
class: american soccer phrases
phrases: stage members
class: soccer actions
phrases: stage members
class: issues that may be numbered
phrases: stage members
class: discipline positions
phrases: stage members
A particular bummer.
I did a fast search in my challenge listing for “phrases: stage members”, and located it in every single place.
Oops!
I constructed the dataset improperly.
Fixing a dataset bug
After a little bit of a reset, I discovered some points within the immediate era python script and glued them right here
import os
import json
SYSTEM_PROMPT = """The sport "Connections" is a phrase sport the place you begin with 16 phrases and must group them into 4 teams of 4. Every grouping has a class that unambiguously teams the 4 phrases collectively. Every puzzle has precisely one answer. Be careful for phrases that appear to belong to a number of classes. You'll be given 16 phrases. Output 4 groupings of 4 phrases and the classes to which they belong."""
OUT_FILE = "connections_prompts.jsonl"
def important():
with open(OUT_FILE, "w") as writef:
for file in os.listdir("connections_data"):
if file.endswith(".json"):
print(file)
with open(f"connections_data/{file}", "r") as f:
knowledge = json.load(f)
classes = [c for c in data["groups"].keys()]
categories_and_members = {c: knowledge["groups"][c]["members"] for c in classes}
assistant_str = ""
for c, phrases in categories_and_members.objects():
lower_words = [w.lower() for w in words]
assistant_str += f"class: {c.decrease()}nphrases: {', '.be a part of(lower_words)}nn"
all_words = [word.lower() for group in categories_and_members.values() for word in group]
all_words_str = ', '.be a part of(all_words)
out_obj = {
"messages": [
{
"role": "system",
"content": SYSTEM_PROMPT,
},
{
"role": "user",
"content": f"Here are the 16 words: {all_words_str}",
},
{
"role": "assistant",
"content": assistant_str.strip(),
},
]
}
writef.write(f"{json.dumps(out_obj)}n")
if __name__ == "__main__":
important()
I checked my knowledge and validated the proper phrases for every class have been within the output jsonl file.
This time round, I remembered to create an exterior validation set so I had a couple of circumstances to run as soon as the fine-tune was prepared.
I cut up the info up 60% to coach, 20% to validate for the fine-tune and 20% to make use of myself to validate after the fine-tune.
I put up my toes and waited for the fine-tune to run.
The primary run of the fine-tune.
I’m very glad I had extra validation knowledge as a result of I couldn’t consider it after I noticed it work.
I ran a number of extra and every time, it acquired the phrase teams appropriate.
Although sometimes it appeared to battle to get the class proper with x ___
or ___ x
classes.
However actually, I used to be slightly shocked.
Earlier than this fine-tuning, the mannequin couldn’t come all that near fixing Connections puzzles and now it might typically get the 4 phrase teams proper each time and the classes proper more often than not.
It appeared to good to be true.
It was!
Can you see the issue?
The enter phrases
Listed here are the 16 phrases: fido, fortunate, rover, spot, catch, discover, observe, see, bait, chum, fly, sinker, bone, rex, shirt, storm
and the output classes
class: frequent canine names
phrases: fido, fortunate, rover, spot
class: understand
phrases: catch, discover, observe, see
class: fishing gear
phrases: bait, chum, fly, sinker
class: ___ in a teacup
phrases: bone, rex, shirt, storm
are in the identical order.
After I change the enter order of the phrases for this fine-tune
Listed here are the 16 phrases: fortunate, chum, shirt, storm, spot, catch, rex, fly, observe, rover, see, bait, fido, discover, sinker, bone
all of it falls aside.
class: pal
phrases: fortunate, chum, shirt, storm
class: discover
phrases: spot, catch, rex
class: methods to name
phrases: fly, observe, rover, see
class: canine, e.g.
phrases: bait, fido, discover, sinker
The classes cease making sense.
The mannequin hallucinates phrases.
Some classes don’t even have 4 phrases in them.
This fine-tune is toast.
Fixing one other dataset bug
Again to the code.
I modified my knowledge set era code, then cut up my knowledge into three units once more.
I used a secure random seed, in order that the era could be the identical every time.
import os
import json
import random
random.seed(42)
SYSTEM_PROMPT = """The sport "Connections" is a phrase sport the place you begin with 16 phrases and must group them into 4 teams of 4. Every grouping has a class that unambiguously teams the 4 phrases collectively. Every puzzle has precisely one answer. Be careful for phrases that appear to belong to a number of classes. You'll be given 16 phrases. Output 4 groupings of 4 phrases and the classes to which they belong."""
OUT_FILE = "connections_prompts.jsonl"
def important():
with open(OUT_FILE, "w") as writef:
for file in os.listdir("connections_data"):
if file.endswith(".json"):
with open(f"connections_data/{file}", "r") as f:
knowledge = json.load(f)
classes = [c for c in data["groups"].keys()]
categories_and_members = {c: knowledge["groups"][c]["members"] for c in classes}
assistant_str = ""
for c, phrases in categories_and_members.objects():
lower_words = [w.lower() for w in words]
assistant_str += f"class: {c.decrease()}nphrases: {', '.be a part of(lower_words)}nn"
all_words = [word.lower() for group in categories_and_members.values() for word in group]
random.shuffle(all_words)
all_words_str = ', '.be a part of(all_words)
out_obj = {
"messages": [
{
"role": "system",
"content": SYSTEM_PROMPT,
},
{
"role": "user",
"content": f"Here are the 16 words: {all_words_str}",
},
{
"role": "assistant",
"content": assistant_str.strip(),
},
]
}
writef.write(f"{json.dumps(out_obj)}n")
if __name__ == "__main__":
important()
I ran a couple of exams
Listed here are the 16 phrases: grownup, top, kiss, prime, pair, tease, occasions, josh, peak, ten, rating, gross, child, dozen, max, rib
class: nudge, as for info
phrases: trace, josh, rib, tease
class: mountain ___
phrases: grownup, gross, top, peak
class: 12
phrases: dozen, max, rating, occasions
class: ___ quantity
phrases: child, kiss, pair, prime
and
Listed here are the 16 phrases: get, presto, safe, tada, hearth, voila, carpenter, adam, win, behold, shock, anger, land, concern, happiness, pink
class: eureka!
phrases: behold, presto, tada, voila
class: “inside out” feelings
phrases: anger, concern, happiness, unhappiness
class: ___ and peace
phrases: land, order, pink, safe
class: ___’s rib
phrases: adam, carpenter, get, shock
The outcomes are okay, however not unimaginable.
It does get some groupings appropriate.
There are hallucinations, including phrases that don’t exist.
Additionally, most classes and grouping aren’t appropriate.
Evaluating the outcomes
To determine if this fine-tune was worthwhile, I wanted benchmark towards the efficiency of gpt-3.5-turbo-1106
.
I wrote some fast and soiled code to learn the validation.jsonl
, make an OpenAI name, parse the response and evaluate it to the recognized, appropriate reply.
I made a decision to measure proportion of puzzles appropriate (all 4 classes) and proportion of classes appropriate (4 phrases appropriately grouped).
I additionally augmented the consumer immediate a bit to get the un-fine-tuned mannequin to output it’s response the identical approach the fine-tuned mannequin would for constant parsing.
import json
from openai import OpenAI
shopper = OpenAI()
knowledge = []
puzzles_correct = 0
puzzles_incorrect = 0
with open("validation.jsonl", "r") as file:
for row in file:
knowledge.append(json.hundreds(row))
total_puzzles = len(knowledge)
categories_correct = 0
total_categories = total_puzzles * 4
for da in knowledge:
system_prompt = da["messages"][0]["content"]
user_prompt = da["messages"][1]["content"] + """nOutput your response in actual adherence to the next type
class: <class>
phrases: <word1>, <word2>, <word3>, <word4>
for instance:
class: top
phrases: top, prime, peak, max
"""
expected_result = da["messages"][2]
completion = shopper.chat.completions.create(
mannequin="gpt-3.5-turbo-1106"
# for the fine-tune:
# mannequin="ft:gpt-3.5-turbo-1106:private::<my_id>",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
]
)
# parse response
consequence = completion.decisions[0].message.content material
categories_with_words = {}
for line in consequence.strip().cut up("nn"):
class, words_str = line.decrease().cut up("nphrases: ")
category_name = class.cut up(": ")[1].strip()
phrases = [word.strip() for word in words_str.split(", ")]
categories_with_words[category_name] = tuple(sorted(phrases))
# evaluate to appropriate reply
correct_categories_with_words = {}
correct_str = da["messages"][2]["content"]
for line in correct_str.strip().cut up("nn"):
class, words_str = line.cut up("nphrases: ")
category_name = class.cut up(": ")[1].strip()
phrases = [word.strip() for word in words_str.split(", ")]
correct_categories_with_words[category_name] = tuple(sorted(phrases))
num_correct = 0
for word_list in categories_with_words.values():
if (word_list in checklist(correct_categories_with_words.values())):
print(word_list)
num_correct += 1
categories_correct += num_correct
if num_correct == 4:
puzzles_correct += 1
else:
puzzles_incorrect += 1
# Calculate and print the chances
percent_correct = (puzzles_correct / total_puzzles) * 100
percent_incorrect = (puzzles_incorrect / total_puzzles) * 100
print(f"Right: {percent_correct:.2f}%")
print(f"Incorrect: {percent_incorrect:.2f}%")
percent_categories_correct = (categories_correct / total_categories) * 100
print(f"Complete Classes Right: {percent_categories_correct:.2f}%")
I began by operating the code towards gpt-3.5-turbo-1106
.
Right: 0.00%
Incorrect: 100.00%
Complete Classes Right: 20.24%
Subsequent, I ran it towards my fine-tuned mannequin.
Right: 4.76%
Incorrect: 95.24%
Complete Classes Right: 23.81%
Not an enormous distinction.
I did a couple of extra runs and not one of the outcomes we’re too out of the bizarre.
Aside from constant formatting, it’s not clear the mannequin acquired all that significantly better on the sport after fine-tuning.
Wrap up
This expertise was an attention-grabbing introduction to mannequin fine-tuning.
The outcomes weren’t that incredible, however I realized quite a bit about straightforward pitfalls and errors one could make and had some enjoyable.
Future work
To see if there may be extra to probe for this challenge, I ran the validation set by gpt-4
with the next outcomes
Right: 9.52%
Incorrect: 90.48%
Complete Classes Right: 39.29%
This enchancment is notable over gpt-3.5’s ~20-25%.
I requested entry to fine-tune gpt-4
, so we’ll see if that comes by and I can strive it out.