Phi-2: The stunning energy of small language fashions

Contributors
Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, Suriya Gunasekar, Mojan Javaheripi, Piero Kauffmann, Yin Tat Lee, Yuanzhi Li, Anh Nguyen, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Michael Santacroce, Harkirat Singh Behl, Adam Taumann Kalai, Xin Wang, Rachel Ward, Philipp Witte, Cyril Zhang, Yi Zhang

Over the previous few months, our Machine Studying Foundations group at Microsoft Analysis has launched a set of small language fashions (SLMs) known as “Phi” that obtain outstanding efficiency on a wide range of benchmarks. Our first mannequin, the 1.3 billion parameter Phi-1 (opens in new tab), achieved state-of-the-art efficiency on Python coding amongst present SLMs (particularly on the HumanEval and MBPP benchmarks). We then prolonged our focus to widespread sense reasoning and language understanding and created a brand new 1.3 billion parameter mannequin named Phi-1.5 (opens in new tab), with efficiency corresponding to fashions 5x bigger.
We at the moment are releasing Phi-2 (opens in new tab), a 2.7 billion-parameter language mannequin that demonstrates excellent reasoning and language understanding capabilities, showcasing state-of-the-art efficiency amongst base language fashions with lower than 13 billion parameters. On advanced benchmarks Phi-2 matches or outperforms fashions as much as 25x bigger, due to new improvements in mannequin scaling and coaching knowledge curation.
With its compact measurement, Phi-2 is a perfect playground for researchers, together with for exploration round mechanistic interpretability, security enhancements, or fine-tuning experimentation on a wide range of duties. We now have made Phi-2 (opens in new tab) accessible within the Azure AI Studio mannequin catalog to foster analysis and improvement on language fashions.
Microsoft Analysis Podcast
AI Frontiers: AI for well being and the way forward for analysis with Peter Lee
Peter Lee, head of Microsoft Analysis, and Ashley Llorens, AI scientist and engineer, focus on the way forward for AI analysis and the potential for GPT-4 as a medical copilot.
Opens in a brand new tab
Key Insights Behind Phi-2
The large improve within the measurement of language fashions to a whole lot of billions of parameters has unlocked a bunch of rising capabilities which have redefined the panorama of pure language processing. A query stays whether or not such emergent talents might be achieved at a smaller scale utilizing strategic decisions for coaching, e.g., knowledge choice.
Our line of labor with the Phi fashions goals to reply this query by coaching SLMs that obtain efficiency on par with fashions of a lot larger scale (but nonetheless removed from the frontier fashions). Our key insights for breaking the traditional language mannequin scaling legal guidelines with Phi-2 are twofold:
Firstly, coaching knowledge high quality performs a essential function in mannequin efficiency. This has been identified for many years, however we take this perception to its excessive by specializing in “textbook-quality” knowledge, following upon our prior work “Textbooks Are All You Need.” Our coaching knowledge combination accommodates artificial datasets particularly created to show the mannequin widespread sense reasoning and normal data, together with science, each day actions, and principle of thoughts, amongst others. We additional increase our coaching corpus with fastidiously chosen internet knowledge that’s filtered primarily based on academic worth and content material high quality. Secondly, we use progressive methods to scale up, ranging from our 1.3 billion parameter mannequin, Phi-1.5, and embedding its data throughout the 2.7 billion parameter Phi-2. This scaled data switch not solely accelerates coaching convergence however reveals clear increase in Phi-2 benchmark scores.

Coaching Particulars
Phi-2 is a Transformer-based mannequin with a next-word prediction goal, skilled on 1.4T tokens from a number of passes on a mix of Artificial and Internet datasets for NLP and coding. The coaching for Phi-2 took 14 days on 96 A100 GPUs. Phi-2 is a base mannequin that has not undergone alignment by reinforcement studying from human suggestions (RLHF), nor has it been instruct fine-tuned. Regardless of this, we noticed higher conduct with respect to toxicity and bias in comparison with present open-source fashions that went by alignment (see Determine 3). That is according to what we noticed in Phi-1.5 as a result of our tailor-made knowledge curation method, see our previous tech report (opens in new tab) for extra particulars on this. For extra details about the Phi-2 mannequin, please go to Azure AI | Machine Learning Studio (opens in new tab).

Phi-2 Analysis
Under, we summarize Phi-2 efficiency on educational benchmarks in comparison with well-liked language fashions. Our benchmarks span a number of classes, specifically, Huge Bench Onerous (BBH) (3 shot with CoT), commonsense reasoning (PIQA, WinoGrande, ARC simple and problem, SIQA), language understanding (HellaSwag, OpenBookQA, MMLU (5-shot), SQuADv2 (2-shot), BoolQ), math (GSM8k (8 shot)), and coding (HumanEval, MBPP (3-shot)).
With solely 2.7 billion parameters, Phi-2 surpasses the efficiency of Mistral and Llama-2 fashions at 7B and 13B parameters on varied aggregated benchmarks. Notably, it achieves higher efficiency in comparison with 25x bigger Llama-2-70B mannequin on muti-step reasoning duties, i.e., coding and math. Moreover, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, regardless of being smaller in measurement.
In fact, we acknowledge the present challenges with mannequin analysis, and that many public benchmarks may leak into the coaching knowledge. For our first mannequin, Phi-1, we did an intensive decontamination examine to discard this risk, which might be present in our first report “Textbooks Are All You Need.” Finally, we imagine that the easiest way to evaluate a language mannequin is to check it on concrete use circumstances. Following that spirit, we additionally evaluated Phi-2 utilizing a number of Microsoft inside proprietary datasets and duties, evaluating it once more to Mistral and Llama-2. We noticed comparable traits, i.e. on common, Phi-2 outperforms Mistral-7B, and the latter outperforms the Llama-2 fashions (7B, 13B, and 70B).
Mannequin | Dimension | BBH | Commonsense Reasoning |
Language Understanding |
Math | Coding |
---|---|---|---|---|---|---|
Llama-2 | 7B | 40.0 | 62.2 | 56.7 | 16.5 | 21.0 |
13B | 47.8 | 65.0 | 61.9 | 34.2 | 25.4 | |
70B | 66.5 | 69.2 | 67.6 | 64.1 | 38.3 | |
Mistral | 7B | 57.2 | 66.4 | 63.7 | 46.4 | 39.4 |
Phi-2 | 2.7B | 59.2 | 68.8 | 62.0 | 61.1 | 53.7 |
Mannequin | Dimension | BBH | BoolQ | MBPP | MMLU |
---|---|---|---|---|---|
Gemini Nano 2 | 3.2B | 42.4 | 79.3 | 27.2 | 55.8 |
Phi-2 | 2.7B | 59.3 | 83.3 | 59.1 | 56.7 |
Along with these benchmarks, we additionally carried out in depth testing on generally used prompts from the analysis neighborhood. We noticed a conduct in accordance with the expectation we had given the benchmark outcomes. For instance, we examined a immediate used to probe a mannequin’s capacity to unravel physics issues, most not too long ago used to guage the capabilities of the Gemini Extremely mannequin, and achieved the next end result:


Opens in a brand new tab