Non-determinism in GPT-4 is brought on by Sparse MoE

2023-08-04 16:37:09

It’s well-known at this level that GPT-4/GPT-3.5-turbo is non-deterministic, even at temperature=0.0. That is an odd habits in case you’re used to dense decoder-only fashions, the place temp=0 ought to suggest greedy sampling which ought to suggest full determinism, as a result of the logits for the following token ought to be a pure perform of the enter sequence & the mannequin weights.

When requested about this behaviour on the developer roundtables throughout OpenAI’s World Tour, the responses of the members of technical employees have been one thing alongside the strains of,

Actually, we’re confused as properly. We expect there is likely to be some bug in our techniques, or some non-determinism in optimized floating point calculations&mldr;

And internally, I used to be considering – okay, I do know the latter level is true generally, and perhaps OpenAI doesn’t have sufficient engineers to look into an issue as small as this. I felt slightly bit confused once I seen a reference to this habits over 3 years ago – 3 years, and this couldn’t be fastened?

However I didn’t have a significant different rationalization for the phenomenon. In spite of everything, why would you need to maintain issues random? Ilya’s at all times occurring about reliability, proper? There was no method OpenAI needed to maintain determinism bugged, so an unresolvable {hardware} limitation was the very best rationalization.

3 months later, studying a paper whereas on board a boring flight house, I’ve my reply.

Within the current Soft MoE paper, there was an attention-grabbing blurb in Part 2.2 that sparked a connection:

Below capability constraints, all Sparse MoE approaches route tokens in teams of a set measurement and implement (or encourage) steadiness throughout the group. When teams comprise tokens from completely different sequences or inputs, these tokens usually compete towards one another for accessible spots in knowledgeable buffers. As a consequence, the mannequin is not deterministic on the sequence-level, however solely on the batch-level, as some enter sequences might have an effect on the ultimate prediction for different inputs

It’s presently public knowledge that GPT-4 is a Combination of Specialists mannequin. Provided that GPT-4 was trained before Q2 2022, and that Sparse Combination-of-Specialists have existed long before that, I feel the next speculation is justified:

The GPT-4 API is hosted with a backend that does batched inference. Though among the randomness could also be defined by different components, the overwhelming majority of non-determinism within the API is explainable by its Sparse MoE structure failing to implement per-sequence determinism.

That is both fully mistaken, or one thing that was already apparent and well-known to individuals growing MoE fashions. How can we confirm this?

Are you actually certain it isn’t {hardware}?

Not but. Let’s ask GPT-4 to jot down a script to check our speculation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76



import os
import json
import tqdm
import openai
from time import sleep
from pathlib import Path

chat_models = ["gpt-4", "gpt-3.5-turbo"]
message_history = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a unique, surprising, extremely randomized story with highly unpredictable changes of events."}
]

completion_models = ["text-davinci-003", "text-davinci-001", "davinci-instruct-beta", "davinci"]
immediate = "[System: You are a helpful assistant]nnPerson: Write a novel, shocking, extraordinarily randomized story with extremely unpredictable adjustments of occasions.nnAI:"

outcomes = []

import time
class TimeIt:
    def __init__(self, title): self.title = title
    def __enter__(self): self.begin = time.time()
    def __exit__(self, *args): print(f"{self.title} took {time.time() - self.begin} seconds")

C = 30  # variety of completions to make per mannequin
N = 128 # max_tokens
# Testing chat fashions
for mannequin in chat_models:
    sequences = set()
    errors = 0 # though I monitor errors, at no level have been any errors ever emitted
    with TimeIt(mannequin):
        for _ in vary(C):
            strive:
                completion = openai.ChatCompletion.create(
                    mannequin=mannequin,
                    messages=message_history,
                    max_tokens=N,
                    temperature=0,
                    logit_bias={"100257": -100.0}, # this does not actually do something, as a result of chat fashions do not do <|endoftext|> a lot
                )
                sequences.add(completion.decisions[0].message['content'])
                sleep(1) # cheaply keep away from charge limiting
            besides Exception as e:
                print('one thing went mistaken for', mannequin, e)
                errors += 1
    print(f"nMannequin {mannequin} created {len(sequences)} ({errors=}) distinctive sequences:")
    print(json.dumps(listing(sequences)))
    outcomes.append((len(sequences), mannequin))
# Testing completion fashions
for mannequin in completion_models:
    sequences = set()
    errors = 0
    with TimeIt(mannequin):
        for _ in vary(C):
            strive:
                completion = openai.Completion.create(
                    mannequin=mannequin,
                    immediate=immediate,
                    max_tokens=N,
                    temperature=0,
                    logit_bias = {"50256": -100.0}, # stop EOS
                )
                sequences.add(completion.decisions[0].textual content)
                sleep(1)
            besides Exception as e:
                print('one thing went mistaken for', mannequin, e)
                errors += 1
    print(f"nMannequin {mannequin} created {len(sequences)} ({errors=}) distinctive sequences:")
    print(json.dumps(listing(sequences)))
    outcomes.append((len(sequences), mannequin))

# Printing desk of outcomes
print("nDesk of Outcomes:")
print("Num_SequencestModel_Name")
for num_sequences, model_name in outcomes:
    print(f"{num_sequences}t{model_name}")

This ultimate script is slightly completely different from what you’d see in case you clicked on the share hyperlink. I needed to redo the script a couple of occasions alongside the way in which, due to a couple of issues:

the OpenAI API was taking very lengthy to reply. I had so as to add timestamp logging to verify I wasn’t doing one thing mistaken – I wasn’t, the API was merely actually sluggish, with practically 10 seconds of delay to name even 3.5 turbo. I’m wondering why?
some completion fashions have been truncating their responses very early. I added a logit bias towards EOS to attempt to repair this.
Associated: there isn’t any equal bias towards the <|im_end|> token; the API returns, Invalid key in 'logit_bias': 100265. Most worth is 100257. 100265 is the correct worth for <|im_end|>:

I figured this lack-of-logit-bias drawback for the chat fashions was a non-issue – most completions reached the max token size, and so they have been absurdly extra non-deterministic anyway (including the logit bias would realistically solely improve the variety of distinctive sequences)

An hour of ready and scripting later, and I bought confirmation:

Empirical Outcomes

Listed here are the outcomes (3 makes an attempt, N=30, max_tokens=128):

Mannequin Title	Distinctive Completions (/30)	Common (/30)	Notes
gpt-4	12,11,12	11.67
gpt-3.5-turbo	4,4,3	3.67
text-davinci-003	3,2,4	3.00
text-davinci-001	2,2,2	2.00
davinci-instruct-beta	1,1,1	deterministic	Outputs deteriorated into repeated loop
davinci	1,1,1	deterministic	Outputs deteriorated into repeated loop

Earlier than I seen the logit_bias drawback, I additionally obtained the next outcomes (max_tokens=256):

Mannequin Title	Distinctive Completions (/30)	Notes
gpt-4	30
gpt-3.5-turbo	9
text-davinci-003	5
text-davinci-001	2	Observed the logit bias drawback at this level

Sure, I’m certain

The variety of distinctive completions from GPT-4 is ridiculously excessive – virtually at all times non-deterministic with longer outputs. This virtually actually confirms that one thing is up with GPT-4.

Moreover, all different fashions that don’t collapse right into a repetitive ineffective loop expertise a point of non-determinism as properly. This strains up with the general public declare that unreliable GPU calculations are liable for a point of randomness. Nevertheless,

I’m nonetheless partially confused by the gradual improve in randomness from text-davinci-001 as much as gpt-3.5-turbo. I don’t have a neat rationalization for why 003 is reliably extra random than 001, or turbo extra so than 003. Though I anticipate solely the chat fashions to be MoE fashions, and never any of the three.5 completion fashions, I don’t really feel assured based mostly on the present proof accessible.
That is solely proof that one thing is inflicting GPT-4 to be a lot, rather more non-deterministic than different fashions. Possibly I’m nonetheless fully mistaken concerning the MoE half. Possibly it’s simply due to parameter depend. (however then – why would Turbo be extra unpredictable than davinci? Turbo’s sooner; in case you assumed the identical structure, Turbo should be smaller)

Implications

It’s really fairly loopy to me, that this seems to be true. For a couple of causes:

We’re to this point behind

If the non-determinism is an inherent function of batched inference with Sparse MoE, then this truth ought to be visibly apparent to anybody that works with fashions in that vein.

Provided that the overwhelming majority of GPT-4 customers nonetheless don’t know what’s inflicting their API calls to be unreliable, it ought to be concluded that (I’m fully mistaken, OR) too few individuals know something about MoE fashions to launch this rationalization into the general public consciousness.

It implies that Google Deepmind knew, and located it trivial sufficient to jot down as a throwaway sentence in a paper. It implies that I ought to be much more bullish on them, and much more bearish towards each different wannabe basis mannequin org that’s nonetheless engaged on dense fashions solely.

GPT-3.5-Turbo could also be MoE too

I heard a rumour, once, about 3.5-turbo sharing the similar structure as GPT-4; simply with a lot a lot much less parameters than it, and even GPT-3.

And, once I heard it, I used to be considering: Nah, that sounds too difficult for a small public mannequin. Why wouldn’t they only use a dense one? Matches on one GPU, no complexity overhead, actually easy to optimise&mldr;

Quick ahead to now, and we’re nonetheless struggling a regime the place it takes 70B parameters to meet Turbo’s performance – a quantity which simply doesn’t make sense for the way a lot visitors OpenAI’s dealing with, and the way a lot velocity they get.

It’s additionally simple to note that Turbo is the one different mannequin within the API that has its logprobs restricted from public view. The widespread rationalization was that they have been restricted to prevent increased accuracy in distillation, one thing which sounds slightly bit naive immediately given Orca and others. OpenAI has additionally publicly stated that they’re engaged on getting the logprobs built-in into ChatCompletions, making “stop distillation” much less seemingly than “that is exhausting to engineer reliably as a result of they’re inherently too random proper now”:

However, nonetheless, as I stated earlier – not absolutely assured on this one. Possibly somebody ought to open a prediction market?

Conclusion

Everybody is aware of that OpenAI’s GPT fashions are non-deterministic at temperature=0
It’s usually attributed to non-deterministic CUDA optimised floating level op inaccuracies
I current a distinct speculation: batched inference in sparse MoE fashions are the foundation reason behind most non-determinism within the GPT-4 API. I clarify why this can be a neater speculation than the earlier one.
I empirically display that API calls to GPT-4 (and doubtlessly some 3.5 fashions) are considerably extra non-deterministic than different OpenAI fashions.
I speculate that GPT-3.5-turbo could also be MoE as properly, resulting from velocity + non-det + logprobs elimination.

Source Link