LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
LMM real-life tasks Leaderboard LLaVA-Plus (13B) ELO Rating 1203 # 1
Win rate 0.3507 # 1
Visual Question Answering MM-Vet LLaVA-Plus-7B (All Tools) GPT-4 score 27.5±0.3 # 84
Params 7B # 1
Visual Question Answering MM-Vet LLaVA-Plus-13B (All Tools, V1.3, 336px) GPT-4 score 35.0±0.0 # 63
Params 13B # 1

Methods


No methods listed for this paper. Add relevant methods here