Benchmarking GPT-4 Turbo – A Cautionary Story
Since we first launched Mentat, GPT-4 has been the default mannequin. Whereas Mentat can run with GPT-3.5 and even native fashions, GPT-4’s increased high quality code edits are effectively price the additional price. We’re excited to unlock GPT-4 Turbo’s full potential — being 2-3x cheaper and having a considerably bigger context — however does it match the standard of GPT-4, particularly for enhancing code?
Benchmarking GPT-4 Turbo
To reply this query, we ran each GPT-4 and GPT-4 Turbo on our Exercism benchmarks, compiled from a set of 122 Exercism programming workout routines. We ran GPT-4 on every job and gave it two tries to succeed (on the second try, it’s proven why it failed). GPT-4 ended up fixing 86/122, or 70%, of the JavaScript workout routines.
Working that very same benchmark with GPT-4 Turbo we acquired 84/122, or 68.8% of the workout routines. Barely worse, however shut sufficient that it may simply be statistical noise. However wait! Trying nearer on the outcomes there was a big distinction: GPT-4 solved 76 on the primary attempt to solely an extra 10 on the second try, whereas GPT-4 Turbo solely solved 56 on the primary attempt to solved an extra 28 on the second try.
Deciphering Outcomes
Why would GPT-4 Turbo remedy fewer duties on the primary try? We examined particular person workout routines and the code it wrote for them and located that it typically wrote cheap options however failed on the primary try as a consequence of unclear or ambiguous directions. However then how does GPT-4 remedy them? Our principle was that GPT-4 had considerably memorized the Exercism coaching duties, however that when GPT-4 was downsized (most certainly via distillation) to GPT-4 Turbo, it misplaced a few of this uncooked memorization functionality.
We designed a check for this principle: we reran the benchmarks with out displaying the fashions the directions to every train. As an alternative, we simply advised them that they had been Exercism workout routines, and gave them the train names and performance stubs. This isn’t sufficient to unravel the issue until the mannequin has it memorized. As a result of fee limits on GPT-4 Turbo, we solely ran the primary 40 workout routines. GPT-4 solved 23/40, or 57.5%, on the primary attempt to an extra 5 on the second attempt. GPT-4 Turbo, however, solely solved 12/40, or 30%, on the primary attempt to an extra 11 on the second attempt. We interpret this as confirming that GPT-4 has extra of the workout routines memorized than GPT-4-Turbo.
Our outcomes appear just like this finding:
Though the creator OCR’ed the SAT questions and believes that they weren’t within the coaching knowledge, we consider it’s pretty possible that these questions – and undoubtedly some extremely comparable questions – ended up within the coaching knowledge sooner or later, making the drop in GPT-4 Turbo’s measured efficiency laborious to interpret.
Future Benchmarks
Benchmarks derived from content material in GPT’s coaching knowledge are nonetheless helpful for figuring out how good GPT is at responding within the appropriate edit format, in addition to evaluating fine-tuned fashions to one another. However these outcomes make it fairly clear that they aren’t an correct check for evaluating fashions educated on separate datasets or distilled fashions – which is what we suspect GPT-4 Turbo is.
These outcomes emphasize the necessity for higher benchmarks, one thing we have already begun constructing for Mentat: actual world, reasonable coding duties primarily based on current commits to open supply repositories that had been made after the coaching cutoff. Though no benchmark will ever be excellent, we’re assured that these enhancements will assist us gauge the relative accuracy of various fashions sooner or later.