Understanding GPT tokenizers

Understanding GPT tokenizers
Massive language fashions resembling GPT-3/4, LLaMA and PaLM work when it comes to tokens. They take textual content, convert it into tokens (integers), then predict which tokens ought to come subsequent.
Taking part in round with these tokens is an fascinating solution to get a greater concept for a way these items truly works underneath the hood.
OpenAI supply a Tokenizer instrument for exploring how tokens work
I’ve constructed my very own, barely extra fascinating instrument as an Observable pocket book:
https://observablehq.com/@simonw/gpt-tokenizer
You should use the pocket book to transform textual content to tokens, tokens to textual content and likewise to run searches in opposition to the complete token desk.
Right here’s what the pocket book seems like:
The textual content I’m tokenizing right here is:
The canine eats the apples El perro come las manzanas 片仮名
This produces 21 integer tokens: 5 for the English textual content, 8 for the Spanish textual content and 6 (two every) for these three Japanese characters. The 2 newlines are every represented by tokens as nicely.
The pocket book makes use of the tokenizer from GPT-2 (borrowing from this excellent notebook by EJ Fox and Ian Johnson), so it’s helpful primarily as an academic instrument—there are variations between the way it works and the newest tokenizers for GPT-3 and above.
Exploring some fascinating tokens
Taking part in with the tokenizer reveals all kinds of fascinating patterns.
Commonest English phrases are assigned a single token. As demonstrated above:
- “The”: 464
- “ canine”: 3290
- “ eats”: 25365
- “ the”: 262
- “ apples”: 22514
Be aware that capitalization is essential right here. “The” with a capital T is token 464, however “ the” with each a number one house and a lowercase t is token 262.
Many phrases even have a token that includes a number one house. This makes for way more environment friendly encoding of full sentences, since they are often encoded while not having to spend a token on every whitespace character.
Languages aside from English endure from much less environment friendly tokenization.
“El perro come las manzanas” in Spanish is encoded like this:
- “El”: 9527
- “ per”: 583
- “ro”: 305
- “ come”: 1282
- “ las”: 39990
- “ man”: 582
- “zan”: 15201
- “as”: 292
The English bias is apparent right here. “ man” will get a decrease token ID of 582, as a result of it’s an English phrase. “zan” will get a token ID of 15201 as a result of it’s not a phrase that stands alone in English, however is a typical sufficient sequence of characters that it nonetheless warrants its personal token.
Some languages even have single characters that find yourself encoding to a number of tokens, resembling these Japanese characters:
- 片: 31965 229
- 仮: 20015 106
- 名: 28938 235
Glitch tokens
A captivating subset of tokens are what are generally known as “glitch tokens”. My favorite instance of these is token 23282—“ davidjl”.
We will discover that token by looking for “david” utilizing the search field within the pocket book:
Riley Goodside highlighted some weird behaviour with that token:
Why this occurs is an intriguing puzzle.
It seems doubtless that this token refers to consumer davidjl123 on Reddit, a eager member of the /r/counting subreddit. He’s posted incremented numbers there nicely over 163,000 occasions.
Presumably that subreddit ended up within the coaching information used to create the tokenizer utilized by GPT-2, and since that exact username confirmed up tons of of hundreds of occasions it ended up getting its personal token.
However why would that break issues like this? The very best idea I’ve seen to this point got here from londons_explore on Hacker News:
These glitch tokens are all close to the centroid of the token embedding house. That signifies that the mannequin can’t actually differentiate between these tokens and the others equally close to the middle of the embedding house, and subsequently when requested to ’repeat’ them, will get the unsuitable one.
That occurred as a result of the tokens have been on the web many thousands and thousands of occasions (the davidjl consumer has 163,000 posts on reddit merely counting growing numbers), but the tokens themselves have been by no means laborious to foretell (and subsequently whereas coaching, the gradients grew to become almost zero, and the embedding vectors decayed to zero, which some optimizers will do when normalizing weights).
The dialog hooked up to the publish SolidGoldMagikarp (plus, prompt generation) on LessWrong has an excellent deal extra element on this phenomenon.
Counting tokens with tiktoken
OpenAI’s fashions every have a token restrict. It’s generally essential to rely the variety of tokens in a string earlier than passing it to the API, so as to be sure that restrict isn’t exceeded.
One method that wants that is Retrieval Augmented Generation, the place you reply a consumer’s query by operating a search (or an embedding search) in opposition to a corpus of paperwork, extract the most definitely content material and embrace that as context in a immediate.
The important thing to efficiently implementing that sample is to incorporate as a lot related context as will match inside the token restrict—so that you want to have the ability to rely tokens.
OpenAI present a Python library for doing this referred to as tiktoken.
In case you dig round contained in the library you’ll discover it presently consists of 5 totally different tokenization schemes: r50k_base
, p50k_base
, p50k_edit
, cl100k_base
and gpt2
.
Of those cl100k_base
is probably the most related, being the tokenizer for each GPT-4 and the cheap gpt-3.5-turbo
mannequin utilized by present ChatGPT.
p50k_base
is utilized by text-davinci-003
. A full mapping of fashions to tokenizers will be discovered within the MODEL_TO_ENCODING
dictionary in tiktoken/mannequin.py
.
Right here’s methods to use tiktoken
:
import tiktoken encoding = tiktoken.encoding_for_model("gpt-4") # or "gpt-3.5-turbo" or "text-davinci-003" tokens = encoding.encode("Right here is a few textual content") token_count = len(tokens)
tokens
will now be an array of 4 integer token IDs—[8586, 374, 1063, 1495]
on this case.
Use the .decode()
technique to show an array of token IDs again into textual content:
textual content = encoding.decode(tokens) # 'Right here is a few textual content'
The primary time you name encoding_for_model()
the encoding information shall be fetched over HTTP from a openaipublic.blob.core.home windows.internet
Azure blob storage bucket (code here). That is cached in a temp listing, however that may get cleared ought to your machine restart. You may pressure it to make use of a extra persistent cache listing by setting a TIKTOKEN_CACHE_DIR
surroundings variable.
ttok
I launched my ttok instrument a few weeks ago. It’s a command-line wrapper round tiktoken
with two key options: it might rely tokens in textual content that’s piped to it, and it might additionally truncate that textual content all the way down to a specified variety of tokens:
# Depend tokens
echo -n "Depend these tokens" | ttok
# Outputs: 3 (the newline is skipped due to echo -n)
# Truncation
curl 'https://simonwillison.internet/' | strip-tags -m | ttok -t 6
# Outputs: Simon Willison’s Weblog
# View integer token IDs
echo "Present these tokens" | ttok --tokens
# Outputs: 7968 1521 11460 198
Use -m gpt2
or related to make use of an encoding for a distinct mannequin.
Watching tokens get generated
When you perceive tokens, the way in which GPT instruments generate textual content begins to make much more sense.
Specifically, it’s enjoyable to look at GPT-4 streaming again its output as impartial tokens (GPT-4 is barely slower than 3.5, making it simpler to see what’s happening).
Right here’s what I get for llm -s '5 names for a pet pelican' -4
—utilizing my llm CLI instrument to generate textual content from GPT-4:
As you possibly can see, names that aren’t within the dictionary resembling “Pelly” take a number of tokens, however “Captain Gulliver” outputs the token “Captain” as a single chunk.