LLaVA-1.6: Improved reasoning, OCR, and world data

In October 2023, we launched LLaVA-1.5 with a easy and environment friendly design together with nice efficiency on a benchmark suite of 12 datasets. It has since served as the inspiration of many complete research of knowledge, mannequin, and capabilities of huge multimodal fashions (LMM), and has enabled numerous new functions.
In the present day, we’re thrilled to current LLaVA-1.6, with improved reasoning, OCR, and world data. LLaVA-1.6 even exceeds Gemini Professional on a number of benchmarks.
In contrast with LLaVA-1.5, LLaVA-1.6 has a number of enhancements:
- Rising the enter picture decision to 4x extra pixels. This enables it to understand extra visible particulars. It helps three facet ratios, as much as 672×672, 336×1344, 1344×336 decision.
- Higher visible reasoning and OCR functionality with an improved visible instruction tuning information combination.
- Higher visible dialog for extra eventualities, protecting totally different functions. Higher world data and logical reasoning.
- Environment friendly deployment and inference with SGLang.
Together with efficiency enhancements, LLaVA-1.6 maintains the minimalist design and information effectivity of LLaVA-1.5. It re-uses the pretrained connector of LLaVA-1.5, and nonetheless makes use of lower than 1M visible instruction tuning samples. The biggest 34B variant finishes coaching in ~1 day with 32 A100s. Code, information, mannequin might be made publicly obtainable.
Open-Supply Launch
We open-source the LLaVA-1.6 to facilitate future growth of LMM locally. Code, information, mannequin might be made publicly obtainable.
Outcomes
Open-Supply
Proprietary
Information (PT) | Information (IT) | Mannequin | MMMU (val) | Math-Vista | MMB-ENG | MMB-CN | MM-Vet | LLaVA-Wild | SEED-IMG |
---|---|---|---|---|---|---|---|---|---|
N/A | N/A | GPT-4V | 56.8 | 49.9 | 75.8 | 73.9 | 67.6 | – | 71.6 |
N/A | N/A | Gemini Extremely | 59.4 | 53 | – | – | – | – | – |
N/A | N/A | Gemini Professional | 47.9 | 45.2 | 73.6 | 74.3 | 64.3 | – | 70.7 |
1.4B | 50M | Qwen-VL-Plus | 45.2 | 43.3 | – | – | 55.7 | – | 65.7 |
1.5B | 5.12M | CogVLM-30B | 32.1 | – | – | – | 56.8 | – | – |
125M | ~1M | Yi-VL-34B | 45.9 | – | – | – | – | – | – |
558K | 665K | LLaVA-1.5-13B | 36.4 | 27.6 | 67.8 | 63.3 | 36.3 | 72.5 | 68.2 |
558K | 760K | LLaVA-1.6-34B | 51.1 | 46.5 | 79.3 | 79 | 57.4 | 89.6 | 75.9 |
For extra outcomes, please unfold to see expanded tables
Extra benchmarks might be added quickly.
$^dagger$specialist
Gray: supervised finetuned
$^*$coaching picture noticed
Regular textual content: zero-shot
Mannequin | VQAv2 | GQA | VisWiz | TextVQA | ScienceQA |
---|---|---|---|---|---|
GPT-4V | 77.2 | – | – | 78.0 | – |
Gemini Extremely | 77.8 | – | – | 82.3 | – |
Gemini Professional | 71.2 | – | – | 74.6 | – |
PALI-X | 86.0$^dagger$ | – | 70.9$^dagger$ | 71.4$^dagger$ | – |
CogVLM-30B | 83.4 (84.7$^dagger$) | 65.2$^dagger$ | 76.4$^dagger$ | 68.1 (69.3$^dagger$) | 92.7$^dagger$ |
LLaVA-1.5-13B | 80 | 63.3 | 53.6 | 61.3$^*$ | 71.6 |
LLaVA-1.6-Vicuna-7B | 81.8 | 64.2 | 57.6 | 64.9 | 70.1 |
LLaVA-1.6-Vicuna-13B | 82.8 | 65.4 | 60.5 | 67.1 | 73.6 |
LLaVA-1.6-Mistral-7B | 82.2 | 64.8 | 60.0 | 65.7 | 72.8 |
LLaVA-1.6-34B | 83.7 | 67.1 | 63.8 | 69.5 | 81.8 |
Information (PT) | Information (IT) | Mannequin | MMMU (val) | MMMU (take a look at) | MathVista | MMB-ENG | MMB-CN | MM-Vet | LLaVA-Wild | SEED-IMG | MME | POPE |
---|---|---|---|---|---|---|---|---|---|---|---|---|
N/A | N/A | GPT-4V | 56.8 | 55.7 | 49.9 | 75.8 | 73.9 | 67.6 | – | 71.6 | – | – |
N/A | N/A | Gemini Extremely | 59.4 | – | 53 | – | – | – | – | – | – | – |
N/A | N/A | Gemini Professional | 47.9 | – | 45.2 | 73.6 | 74.3 | 64.3 | – | 70.7 | – | – |
1.4B | 50M | Qwen-VL-Plus | 45.2 | 40.8 | 43.3 | – | – | – | – | – | – | – |
1.5B | 5.12M | CogVLM-30B | 32.1 | 30.1 | – | – | – | 56.8 | – | – | – | – |
558K | 665K | LLaVA-1.5-13B (336$^2$) | 36.4 | 33.6 | 27.6 | 67.8 | 63.3 | 36.3 | 72.5 | 68.2 | 1531/295 | 85.93 |
558K | 760K | LLaVA-1.6-Vicuna-7B (672$^2$) | 35.8 | – | 34.6 | 67.4 | 60.6 | 43.9 | 81.6 | 70.2 | 1519/332 | 86.53 |
558K | 760K | LLaVA-1.6-Mistral-7B (672$^2$) | 35.3 | – | 37.7 | 68.7 | 61.2 | 47.3 | 83.2 | 72.2 | 1498/321 | 86.73 |
558K | 760K | LLaVA-1.6-13B (672$^2$) | 36.2 | – | 35.3 | 70 | 64.4 | 48.4 | 87.3 | 71.9 | 1575/326 | 86.23 |
558K | 760K | LLaVA-1.6-34B (672$^2$) | 51.1 | 44.7 | 46.5 | 79.3 | 79 | 57.4 | 89.6 | 75.9 | 1631/397 | 87.73 |
✨ Highlights:
- SoTA Efficiency! LLaVA-1.6 achieves the most effective efficiency in contrast with open-source LMMs corresponding to CogVLM or Yi-VL. In contrast with industrial ones, it catches as much as Gemini Professional and outperforms Qwen-VL-Plus on chosen benchmarks.
- Zero-shot Chinese language functionality. LLaVA-1.6’s Chinese language functionality is an rising zero-shot functionality (i.e., solely English multimodal information is taken into account). Its efficiency on Chinese language multimodal eventualities is surprisingly good, e.g., SoTA on MMBench-CN.
- Low Coaching Price. LLaVA-1.6 is skilled with 32 GPUs for ~1 day, with 1.3M information samples in whole. The compute / coaching information value is 100-1000 occasions smaller than others.
Qualitative Outcomes
Mark Zuckerberg talks about Llama-3 and 600K H100s
Person: Clarify this to me intimately.
LLaVA-1.6-34B
The picture you’ve got supplied seems to be a screenshot of a social media put up, particularly from Instagram, that includes a person talking and a textual content overlay. The person within the picture is carrying a black t-shirt and seems to be in the course of a dialog or presentation. The textual content overlay is a quote from a put up by a consumer named “zuck,” which is presumably a reference to Mark Zuckerberg, the CEO of Meta (previously often called Fb).
The textual content overlay reads:
“Some updates on our AI efforts. Our long run imaginative and prescient is to construct common intelligence, open supply it responsibly, and make it broadly obtainable so everybody can profit. We’re bringing our two main AI analysis efforts (FAIR and GenAI) nearer collectively to help this. We’re at the moment coaching our next-gen mannequin Llama 3, and we’re constructing huge compute infrastructure to help our future roadmap, together with 35k H100s by the tip of this 12 months — and total nearly 600k H100s equivalents of compute in the event you embody different GPUs. Additionally actually enthusiastic about our progress constructing new AI-centric computing units like Ray Ban Meta good glasses. Tons extra to return.”
This textual content offers insights into Meta’s synthetic intelligence (AI) initiatives. It mentions that the corporate’s long-term purpose is to develop common intelligence and make it accessible to everybody. The put up additionally discusses the mixing of Meta’s two AI analysis efforts, FAIR (Fb AI Analysis) and GenAI, to help this purpose.
The textual content additionally mentions the coaching of a next-generation AI mannequin known as Llama 3 and the development of huge compute infrastructure to help Meta’s future plans. It highlights the corporate’s funding in NVIDIA’s H100 GPUs, with a complete of 35,000 H100s by the tip of the 12 months and nearly 600,000 H100s equivalents of compute if different GPUs are included.
Lastly, the put up expresses pleasure about Meta’s progress in creating AI-centric computing units, particularly Ray Ban Meta good glasses. The put up concludes with a press release that there’s extra to return, suggesting that Meta has extra AI-related bulletins or developments within the pipeline.
The picture of the person talking may very well be from a video or a dwell stream the place he’s discussing these AI updates. The context of the picture means that it’s half of a bigger dialog or announcement associated to Meta’s AI technique and developments.
When to go away to pickup primarily based on flight information?
Person: I want to select up my spouse. I dwell in San Jose. When ought to I depart?
LLaVA-1.6-34B
Based mostly on the knowledge supplied within the picture, the flight is scheduled to reach at 11:51 AM at San Francisco Worldwide Airport (SFO). In case you dwell in San Jose, it’s best to take into account the journey time between San Jose and San Francisco, which is roughly 45-60 minutes relying on visitors situations.
To make sure you have sufficient time to select up your spouse, it’s best to depart San Jose no later than 11:00 AM to account for visitors and any sudden delays. Nevertheless, it is all the time a good suggestion to go away sooner than obligatory to make sure you arrive on time.
Detailed Technical Enchancment
We element the findings we’ve got and enhancements we make to LLaVA-1.5. Extra implementation particulars might be launched and documented right here within the coming days.
(1) Dynamic Excessive Decision
We design our mannequin at excessive decision with an goal to protect its information effectivity.
When supplied with high-resolution photographs and representations that protect these particulars, the mannequin’s capability to understand intricate particulars in a picture is considerably improved. It reduces the mannequin hallucination that conjectures the imagined visible content material when confronted with low-resolution photographs. Our ‘AnyRes’ method is designed to accommodate photographs of assorted excessive resolutions. We make use of a grid configuration of ${2 occasions 2, 1 occasions {2,3,4}, {2,3,4} occasions 1}$, balancing efficiency effectivity with operational prices. See our updated LLaVA-1.5 technical report for extra particulars.
Illustration of dynamic excessive decision scheme: a grid configuration of ${2 occasions 2}$
(2) Information Combination
- Excessive-quality Person Instruct Information. Our definition of high-quality visible instruction-following information hinges on two principal standards: First, the variety of process directions, guaranteeing adequately characterize a broad spectrum of consumer intents which might be more likely to be encountered in real-world eventualities, significantly in the course of the mannequin’s deployment part. Second, the prevalence of responses is vital, with the target of soliciting favorable consumer suggestions. To attain this, we take into account two information sources: (1) Present GPT-V information. LAION-GPT-V and ShareGPT-4V. (2) To additional facilitate higher visible dialog for extra eventualities, we acquire a small 15K visible instruction tuning dataset protecting totally different functions. The directions and pictures come from LLaVA demo, that are real-world customers requests. We fastidiously filter samples that will have privateness issues or are probably dangerous, and generate the response with GPT-4V.
- Multimodal Doc/Chart Information. (1) We take away TextCaps from our coaching information as we understand that TextCaps makes use of the identical set of coaching photographs as TextVQA. This enables us to raised perceive the zero-shot OCR functionality of our mannequin when evaluating TextVQA throughout growth. To take care of and additional enhance our mannequin’s OCR functionality, we exchange TextCaps with DocVQA and SynDog-EN. (2) Motivated by Qwen-VL-7B-Chat, we additional add ChartQA, DVQA, and AI2D for higher chart and diagram understanding.
(3) Scaling LLM spine
Along with Vicuna-1.5 (7B and 13B), we take into account extra LLMs, together with Mistral-7B and Nous-Hermes-2-Yi-34B. These LLMs possess good properties, versatile industrial use phrases, sturdy bilingual help and bigger language mannequin capability. It permits LLaVA to help a wider spectrum of customers and extra eventualities locally. The LLaVA recipe works nicely with numerous LLMs, and scales up easily with the LLM as much as 34B.
Mannequin Card
Identify | LLaVA-1.6-7B | LLaVA-1.6-13B | LLaVA-1.6-34B | |
---|---|---|---|---|
Mannequin Dimension | Complete | 7.06B | 13.35B | 34.75B |
Imaginative and prescient Encoder | 303.5M | 303.5M | 303.5M | |
Connector | 21M | 31.5M | 58.7M | |
LLM | 6.74B | 13B | 34.39B | |
Decision | 336 x [(2,2), (1,2), (2,1), (1,3), (3,1), (1,4), (4,1)] | |||
Stage-1 | Coaching Information | 558K | ||
Trainable Module | Connector | |||
Stage-2 | Coaching Information | 760K | ||
Trainable Module | Full mannequin | |||
Compute (#GPU x #Hours) | 8×20 | 16×24 | 32×30 | |
Coaching Information (#Samples) | 1318K |
Group
Acknowledgement
- A16Z Open Supply AI Grants Program.
- We thank Lianmin Zheng, Ying Sheng, Shiyi Cao for the mixing of LLaVA to SGLang.
- This work was supported partially by NSF CAREER IIS2150012, Microsoft Speed up Basis Fashions Analysis, and Institute of Info & communications Expertise Planning & Analysis(IITP) grants funded by the Korea authorities(MSIT) (No. 2022-0-00871, Growth of AI Autonomy and Information Enhancement for AI Agent Collaboration) and (No. RS-2022-00187238, Growth of Giant Korean Language Mannequin Expertise for Environment friendly Pre-training).
Quotation
@misc{liu2024llava16,
title={LLaVA-1.6: Improved reasoning, OCR, and world data},
url={https://llava-vl.github.io/weblog/2024-01-30-llava-1-6/},
creator={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},
month={January},
12 months={2024}
}
@misc{liu2023improvedllava,
title={Improved Baselines with Visible Instruction Tuning},
creator={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae},
writer={arXiv:2310.03744},
12 months={2023},
}
@misc{liu2023llava,
title={Visible Instruction Tuning},
creator={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
writer={NeurIPS},
12 months={2023},
}