Now Reading
Phi-2: The stunning energy of small language fashions

Phi-2: The stunning energy of small language fashions

2023-12-12 10:29:11

Contributors

Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, Suriya Gunasekar, Mojan Javaheripi, Piero Kauffmann, Yin Tat Lee, Yuanzhi Li, Anh Nguyen, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Michael Santacroce, Harkirat Singh Behl, Adam Taumann Kalai, Xin Wang, Rachel Ward, Philipp Witte, Cyril Zhang, Yi Zhang

Satya Nadella on stage at Microsoft Ignite 2023 announcing Phi-2.
Determine 1. Satya Nadella asserting Phi-2 at Microsoft Ignite 2023.

Over the previous few months, our Machine Studying Foundations group at Microsoft Analysis has launched a set of small language fashions (SLMs) known as “Phi” that obtain outstanding efficiency on a wide range of benchmarks. Our first mannequin, the 1.3 billion parameter Phi-1 (opens in new tab), achieved state-of-the-art efficiency on Python coding amongst present SLMs (particularly on the HumanEval and MBPP benchmarks). We then prolonged our focus to widespread sense reasoning and language understanding and created a brand new 1.3 billion parameter mannequin named Phi-1.5 (opens in new tab), with efficiency corresponding to fashions 5x bigger.

We at the moment are releasing Phi-2 (opens in new tab), a 2.7 billion-parameter language mannequin that demonstrates excellent reasoning and language understanding capabilities, showcasing state-of-the-art efficiency amongst base language fashions with lower than 13 billion parameters. On advanced benchmarks Phi-2 matches or outperforms fashions as much as 25x bigger, due to new improvements in mannequin scaling and coaching knowledge curation.

With its compact measurement, Phi-2 is a perfect playground for researchers, together with for exploration round mechanistic interpretability, security enhancements, or fine-tuning experimentation on a wide range of duties. We now have made Phi-2 (opens in new tab) accessible within the Azure AI Studio mannequin catalog to foster analysis and improvement on language fashions.

Microsoft Analysis Podcast

AI Frontiers: AI for well being and the way forward for analysis with Peter Lee

Peter Lee, head of Microsoft Analysis, and Ashley Llorens, AI scientist and engineer, focus on the way forward for AI analysis and the potential for GPT-4 as a medical copilot.


Key Insights Behind Phi-2

The large improve within the measurement of language fashions to a whole lot of billions of parameters has unlocked a bunch of rising capabilities which have redefined the panorama of pure language processing. A query stays whether or not such emergent talents might be achieved at a smaller scale utilizing strategic decisions for coaching, e.g., knowledge choice.

Our line of labor with the Phi fashions goals to reply this query by coaching SLMs that obtain efficiency on par with fashions of a lot larger scale (but nonetheless removed from the frontier fashions). Our key insights for breaking the traditional language mannequin scaling legal guidelines with Phi-2 are twofold:

Firstly, coaching knowledge high quality performs a essential function in mannequin efficiency. This has been identified for many years, however we take this perception to its excessive by specializing in “textbook-quality” knowledge, following upon our prior work “Textbooks Are All You Need.” Our coaching knowledge combination accommodates artificial datasets particularly created to show the mannequin widespread sense reasoning and normal data, together with science, each day actions, and principle of thoughts, amongst others. We additional increase our coaching corpus with fastidiously chosen internet knowledge that’s filtered primarily based on academic worth and content material high quality. Secondly, we use progressive methods to scale up, ranging from our 1.3 billion parameter mannequin, Phi-1.5, and embedding its data throughout the 2.7 billion parameter Phi-2. This scaled data switch not solely accelerates coaching convergence however reveals clear increase in Phi-2 benchmark scores.

A bar plot comparing the performance of Phi-2 (with 2.7B parameters) and Phi-1.5 (with 1.3B parameters) on common sense reasoning, language understanding, math, coding, and the Bigbench-hard benchmark. Phi-2 outperforms Phi1.5 in all categories. The commonsense reasoning tasks are PIQA, WinoGrande, ARC easy and challenge, and SIQA. The language understanding tasks are HellaSwag, OpenBookQA, MMLU, SQuADv2, and BoolQ. The math task is GSM8k, and coding includes the HumanEval and MBPP benchmarks.
Determine 2. Comparability between Phi-2 (2.7B) and Phi-1.5 (1.3B) fashions. All duties are evaluated in 0-shot apart from BBH and MMLU which use 3-shot CoT and 5-shot, respectively.

Coaching Particulars

Phi-2 is a Transformer-based mannequin with a next-word prediction goal, skilled on 1.4T tokens from a number of passes on a mix of Artificial and Internet datasets for NLP and coding. The coaching for Phi-2 took 14 days on 96 A100 GPUs. Phi-2 is a base mannequin that has not undergone alignment by reinforcement studying from human suggestions (RLHF), nor has it been instruct fine-tuned. Regardless of this, we noticed higher conduct with respect to toxicity and bias in comparison with present open-source fashions that went by alignment (see Determine 3). That is according to what we noticed in Phi-1.5 as a result of our tailor-made knowledge curation method, see our previous tech report (opens in new tab) for extra particulars on this. For extra details about the Phi-2 mannequin, please go to Azure AI | Machine Learning Studio (opens in new tab).

A barplot comparing the safety score of Phi-1.5, Phi-2, and Llama-7B models on 13 categories of the ToxiGen benchmark. Phi-1.5 achieves the highest score on all categories, Phi-2 achieves the second-highest scores and Llama-7B achieves the lowest scores across all categories.
Determine 3. Security scores computed on 13 demographics from ToxiGen. A subset of 6541 sentences are chosen and scored between 0 to 1 primarily based on scaled perplexity and sentence toxicity. A better rating signifies the mannequin is much less more likely to produce poisonous sentences in comparison with benign ones.

Phi-2 Analysis

Under, we summarize Phi-2 efficiency on educational benchmarks in comparison with well-liked language fashions. Our benchmarks span a number of classes, specifically, Huge Bench Onerous (BBH) (3 shot with CoT), commonsense reasoning (PIQA, WinoGrande, ARC simple and problem, SIQA), language understanding (HellaSwag, OpenBookQA, MMLU (5-shot), SQuADv2 (2-shot), BoolQ), math (GSM8k (8 shot)), and coding (HumanEval, MBPP (3-shot)).

With solely 2.7 billion parameters, Phi-2 surpasses the efficiency of Mistral and Llama-2 fashions at 7B and 13B parameters on varied aggregated benchmarks. Notably, it achieves higher efficiency in comparison with 25x bigger Llama-2-70B mannequin on muti-step reasoning duties, i.e., coding and math. Moreover, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, regardless of being smaller in measurement.

In fact, we acknowledge the present challenges with mannequin analysis, and that many public benchmarks may leak into the coaching knowledge. For our first mannequin, Phi-1, we did an intensive decontamination examine to discard this risk, which might be present in our first report “Textbooks Are All You Need.” Finally, we imagine that the easiest way to evaluate a language mannequin is to check it on concrete use circumstances. Following that spirit, we additionally evaluated Phi-2 utilizing a number of Microsoft inside proprietary datasets and duties, evaluating it once more to Mistral and Llama-2. We noticed comparable traits, i.e. on common, Phi-2 outperforms Mistral-7B, and the latter outperforms the Llama-2 fashions (7B, 13B, and 70B).

Mannequin Dimension BBH Commonsense
Reasoning
Language
Understanding
Math Coding
Llama-2 7B 40.0 62.2 56.7 16.5 21.0
13B 47.8 65.0 61.9 34.2 25.4
70B 66.5 69.2 67.6 64.1 38.3
Mistral 7B 57.2 66.4 63.7 46.4 39.4
Phi-2 2.7B 59.2 68.8 62.0 61.1 53.7
Desk 1. Averaged efficiency on grouped benchmarks in comparison with well-liked open-source SLMs.
Mannequin Dimension BBH BoolQ MBPP MMLU
Gemini Nano 2 3.2B 42.4 79.3 27.2 55.8
Phi-2 2.7B 59.3 83.3 59.1 56.7
Desk 2. Comparability between Phi-2 and Gemini Nano 2 Mannequin on Gemini’s reported benchmarks.

Along with these benchmarks, we additionally carried out in depth testing on generally used prompts from the analysis neighborhood. We noticed a conduct in accordance with the expectation we had given the benchmark outcomes. For instance, we examined a immediate used to probe a mannequin’s capacity to unravel physics issues, most not too long ago used to guage the capabilities of the Gemini Extremely mannequin, and achieved the next end result:

An example prompt is given to Phi-2 which says “A skier slides down a frictionless slope of height 40m and length 80m. What's the skier’s speed at the bottom?”. Phi-2 then answers the prompt by explaining the conversion of potential energy to kinetic energy and providing the formulas to compute each one. It then proceeds to compute the correct speed using the energy formulas.
Determine 4. Phi-2’s output on a easy physics drawback, which incorporates an roughly appropriate sq. root calculation.
The model is then provided with a student’s wrong answer to the skier physics problem and asked if it can correct the student’s mistake. Phi-2 replies with the student’s mistake, i.e., using the wrong formula for potential energy, and provides the correct formula.
Determine 5. Equally to Gemini’s check we additionally additional queried Phi-2 with a pupil’s unsuitable reply to see if Phi-2 might determine the place the error is (it did, regardless of Phi-2 being not fine-tuned for chat or instruction-following). We notice nevertheless that it isn’t absolutely an apple-to-apple comparability with the Gemini Extremely’s output described within the Gemini report, specifically within the latter case the scholar’s reply was given as a picture with handwritten textual content somewhat than uncooked textual content in our case.



Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top