Now Reading
Understanding GPT tokenizers

Understanding GPT tokenizers

2023-06-08 15:40:30

Understanding GPT tokenizers

Massive language fashions resembling GPT-3/4, LLaMA and PaLM work when it comes to tokens. They take textual content, convert it into tokens (integers), then predict which tokens ought to come subsequent.

Taking part in round with these tokens is an fascinating solution to get a greater concept for a way these items truly works underneath the hood.

OpenAI supply a Tokenizer instrument for exploring how tokens work

I’ve constructed my very own, barely extra fascinating instrument as an Observable pocket book:

https://observablehq.com/@simonw/gpt-tokenizer

You should use the pocket book to transform textual content to tokens, tokens to textual content and likewise to run searches in opposition to the complete token desk.

Right here’s what the pocket book seems like:

GPT token encoder and decoder. Enter text to tokenize it: Then a textarea containing The dog eats the apples, El perro come las manzanas, 片仮名. 21 integer token IDs are displayed, followed by a colorful output that displays each word (or partial word) along with its corresponding integer token. The Japanese characters correspond to two integer tokens each.

The textual content I’m tokenizing right here is:

The canine eats the apples
El perro come las manzanas
片仮名

This produces 21 integer tokens: 5 for the English textual content, 8 for the Spanish textual content and 6 (two every) for these three Japanese characters. The 2 newlines are every represented by tokens as nicely.

The pocket book makes use of the tokenizer from GPT-2 (borrowing from this excellent notebook by EJ Fox and Ian Johnson), so it’s helpful primarily as an academic instrument—there are variations between the way it works and the newest tokenizers for GPT-3 and above.

Exploring some fascinating tokens

Taking part in with the tokenizer reveals all kinds of fascinating patterns.

Commonest English phrases are assigned a single token. As demonstrated above:

  • “The”: 464
  • “ canine”: 3290
  • “ eats”: 25365
  • “ the”: 262
  • “ apples”: 22514

Be aware that capitalization is essential right here. “The” with a capital T is token 464, however “ the” with each a number one house and a lowercase t is token 262.

Many phrases even have a token that includes a number one house. This makes for way more environment friendly encoding of full sentences, since they are often encoded while not having to spend a token on every whitespace character.

Languages aside from English endure from much less environment friendly tokenization.

“El perro come las manzanas” in Spanish is encoded like this:

  • “El”: 9527
  • “ per”: 583
  • “ro”: 305
  • “ come”: 1282
  • “ las”: 39990
  • “ man”: 582
  • “zan”: 15201
  • “as”: 292

The English bias is apparent right here. “ man” will get a decrease token ID of 582, as a result of it’s an English phrase. “zan” will get a token ID of 15201 as a result of it’s not a phrase that stands alone in English, however is a typical sufficient sequence of characters that it nonetheless warrants its personal token.

Some languages even have single characters that find yourself encoding to a number of tokens, resembling these Japanese characters:

  • 片: 31965 229
  • 仮: 20015 106
  • 名: 28938 235

Glitch tokens

A captivating subset of tokens are what are generally known as “glitch tokens”. My favorite instance of these is token 23282—“ davidjl”.

We will discover that token by looking for “david” utilizing the search field within the pocket book:

Search tokens: a search box containing david, with 5 results. 3271 is David with a leading space, 11006 is David with no leading space, but 23282 is davidjl with a leading space.

Riley Goodside highlighted some weird behaviour with that token:

Four GPT-4 sessions. Prompt: Repeat the string "davidjl". Response: jndl. Prompt: How many letters are in this username: davidjl. Response: The username "jdnl" contains 4 letters. Prompt: Are the strings "jdl" and "davidjl" identical? Response: No, the strings "jdl" and "jspb" are not identical. Each character in a string contributes to its uniqueness. Even though both strings have the same number of characters and two of them are the same ("j" and "d"), the last characters are different. Thus, "jdl" and "jspb" are different strings. Prompt: What's the deal with davidjl? Response: I'm sorry, but "JDL" is quite ambiguous. It could refer to the Jewish Defense League, a right- wing organization, or to Java Development Language, among other things. Could you please provide more context or clarify what you're referring to? I'll do my best to give you the information you're looking for.

Why this occurs is an intriguing puzzle.

It seems doubtless that this token refers to consumer davidjl123 on Reddit, a eager member of the /r/counting subreddit. He’s posted incremented numbers there nicely over 163,000 occasions.

Presumably that subreddit ended up within the coaching information used to create the tokenizer utilized by GPT-2, and since that exact username confirmed up tons of of hundreds of occasions it ended up getting its personal token.

However why would that break issues like this? The very best idea I’ve seen to this point got here from londons_explore on Hacker News:

These glitch tokens are all close to the centroid of the token embedding house. That signifies that the mannequin can’t actually differentiate between these tokens and the others equally close to the middle of the embedding house, and subsequently when requested to ’repeat’ them, will get the unsuitable one.

That occurred as a result of the tokens have been on the web many thousands and thousands of occasions (the davidjl consumer has 163,000 posts on reddit merely counting growing numbers), but the tokens themselves have been by no means laborious to foretell (and subsequently whereas coaching, the gradients grew to become almost zero, and the embedding vectors decayed to zero, which some optimizers will do when normalizing weights).

The dialog hooked up to the publish SolidGoldMagikarp (plus, prompt generation) on LessWrong has an excellent deal extra element on this phenomenon.

Counting tokens with tiktoken

OpenAI’s fashions every have a token restrict. It’s generally essential to rely the variety of tokens in a string earlier than passing it to the API, so as to be sure that restrict isn’t exceeded.

One method that wants that is Retrieval Augmented Generation, the place you reply a consumer’s query by operating a search (or an embedding search) in opposition to a corpus of paperwork, extract the most definitely content material and embrace that as context in a immediate.

See Also

The important thing to efficiently implementing that sample is to incorporate as a lot related context as will match inside the token restrict—so that you want to have the ability to rely tokens.

OpenAI present a Python library for doing this referred to as tiktoken.

In case you dig round contained in the library you’ll discover it presently consists of 5 totally different tokenization schemes: r50k_base, p50k_base, p50k_edit, cl100k_base and gpt2.

Of those cl100k_base is probably the most related, being the tokenizer for each GPT-4 and the cheap gpt-3.5-turbo mannequin utilized by present ChatGPT.

p50k_base is utilized by text-davinci-003. A full mapping of fashions to tokenizers will be discovered within the MODEL_TO_ENCODING dictionary in tiktoken/mannequin.py.

Right here’s methods to use tiktoken:

import tiktoken

encoding = tiktoken.encoding_for_model("gpt-4")
# or "gpt-3.5-turbo" or "text-davinci-003"

tokens = encoding.encode("Right here is a few textual content")
token_count = len(tokens)

tokens will now be an array of 4 integer token IDs—[8586, 374, 1063, 1495] on this case.

Use the .decode() technique to show an array of token IDs again into textual content:

textual content = encoding.decode(tokens)
# 'Right here is a few textual content'

The primary time you name encoding_for_model() the encoding information shall be fetched over HTTP from a openaipublic.blob.core.home windows.internet Azure blob storage bucket (code here). That is cached in a temp listing, however that may get cleared ought to your machine restart. You may pressure it to make use of a extra persistent cache listing by setting a TIKTOKEN_CACHE_DIR surroundings variable.

ttok

I launched my ttok instrument a few weeks ago. It’s a command-line wrapper round tiktoken with two key options: it might rely tokens in textual content that’s piped to it, and it might additionally truncate that textual content all the way down to a specified variety of tokens:

# Depend tokens
echo -n "Depend these tokens" | ttok
# Outputs: 3 (the newline is skipped due to echo -n)

# Truncation
curl 'https://simonwillison.internet/' | strip-tags -m | ttok -t 6
# Outputs: Simon Willison’s Weblog

# View integer token IDs
echo "Present these tokens" | ttok --tokens
# Outputs: 7968 1521 11460 198

Use -m gpt2 or related to make use of an encoding for a distinct mannequin.

Watching tokens get generated

When you perceive tokens, the way in which GPT instruments generate textual content begins to make much more sense.

Specifically, it’s enjoyable to look at GPT-4 streaming again its output as impartial tokens (GPT-4 is barely slower than 3.5, making it simpler to see what’s happening).

Right here’s what I get for llm -s '5 names for a pet pelican' -4—utilizing my llm CLI instrument to generate textual content from GPT-4:

Terminal window running that command. 1. Pelly 2. Beaky 3. SkyDancer 4. Scoop 5. Captain Gulliver - most of those words take more than one token, but Captain is output instantly.

As you possibly can see, names that aren’t within the dictionary resembling “Pelly” take a number of tokens, however “Captain Gulliver” outputs the token “Captain” as a single chunk.



Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top