TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Video Question Answer	ActivityNet-QA	Chat-UniVi	Confidence Score	3.3	# 5
Zero-Shot Video Question Answer	ActivityNet-QA	Chat-UniVi	Accuracy	46.1	# 9
Zero-Shot Video Question Answer	ActivityNet-QA	Chat-UniVi-13B	Confidence Score	3.6	# 2
Zero-Shot Video Question Answer	ActivityNet-QA	Chat-UniVi-13B	Accuracy	46.4	# 8
Video Question Answering	ActivityNet-QA	Chat-UniVi-13B	Accuracy	46.4	# 13
Video Question Answering	ActivityNet-QA	Chat-UniVi-13B	Confidence score	3.3	# 2
Image-based Generative Performance Benchmarking	ImageInstruct	Chat-UniVi-7B	Conversation	84.1	# 1
Image-based Generative Performance Benchmarking	ImageInstruct	Chat-UniVi-7B	Detail description	74.2	# 3
Image-based Generative Performance Benchmarking	ImageInstruct	Chat-UniVi-7B	Complex reasoning	93.7	# 3
Image-based Generative Performance Benchmarking	ImageInstruct	Chat-UniVi-7B	All	84.2	# 3
Image-based Generative Performance Benchmarking	ImageInstruct	LLaVA-13B	Conversation	83.1	# 3
Image-based Generative Performance Benchmarking	ImageInstruct	LLaVA-13B	Detail description	75.3	# 2
Image-based Generative Performance Benchmarking	ImageInstruct	LLaVA-13B	Complex reasoning	96.5	# 1
Image-based Generative Performance Benchmarking	ImageInstruct	LLaVA-13B	All	85.1	# 2
Image-based Generative Performance Benchmarking	ImageInstruct	Chat-UniVi-13B	Conversation	84.1	# 1
Image-based Generative Performance Benchmarking	ImageInstruct	Chat-UniVi-13B	Detail description	79.4	# 1
Image-based Generative Performance Benchmarking	ImageInstruct	Chat-UniVi-13B	Complex reasoning	94.7	# 2
Image-based Generative Performance Benchmarking	ImageInstruct	Chat-UniVi-13B	All	86.1	# 1
Image-based Generative Performance Benchmarking	ImageInstruct	LLaVA-7B	Conversation	70.3	# 4
Image-based Generative Performance Benchmarking	ImageInstruct	LLaVA-7B	Detail description	56.6	# 4
Image-based Generative Performance Benchmarking	ImageInstruct	LLaVA-7B	Complex reasoning	83.3	# 4
Image-based Generative Performance Benchmarking	ImageInstruct	LLaVA-7B	All	70.1	# 4
Zero-Shot Video Question Answer	MSRVTT-QA	Chat-UniVi-7B	Accuracy	55.0	# 12
Zero-Shot Video Question Answer	MSRVTT-QA	Chat-UniVi-7B	Confidence Score	3.1	# 12
Zero-Shot Video Question Answer	MSVD-QA	Chat-UniVi-7B	Accuracy	69.3	# 10
Zero-Shot Video Question Answer	MSVD-QA	Chat-UniVi-7B	Confidence Score	3.7	# 6
Science Question Answering	ScienceQA	Chat-UniVi-13B	Natural Science	90.41	# 3
Science Question Answering	ScienceQA	Chat-UniVi-13B	Social Science	95.05	# 2
Science Question Answering	ScienceQA	Chat-UniVi-13B	Language Science	88.91	# 3
Science Question Answering	ScienceQA	Chat-UniVi-13B	Text Context	89.64	# 3
Science Question Answering	ScienceQA	Chat-UniVi-13B	Image Context	88.05	# 3
Science Question Answering	ScienceQA	Chat-UniVi-13B	No Context	90.94	# 3
Science Question Answering	ScienceQA	Chat-UniVi-13B	Grades 1-6	91.19	# 3
Science Question Answering	ScienceQA	Chat-UniVi-13B	Grades 7-12	90.64	# 2
Science Question Answering	ScienceQA	Chat-UniVi-13B	Avg. Accuracy	90.99	# 4
Zero-Shot Video Question Answer	TGIF-QA	Chat-UniVi-7B	Accuracy	69.0	# 4
Zero-Shot Video Question Answer	TGIF-QA	Chat-UniVi-7B	Confidence Score	3.8	# 4
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	Chat-UniVi	gpt-score	2.81	# 2
Video-based Generative Performance Benchmarking	VideoInstruct	Chat-UniVi	Correctness of Information	2.89	# 9
Video-based Generative Performance Benchmarking	VideoInstruct	Chat-UniVi	Detail Orientation	2.91	# 8
Video-based Generative Performance Benchmarking	VideoInstruct	Chat-UniVi	Contextual Understanding	3.46	# 8
Video-based Generative Performance Benchmarking	VideoInstruct	Chat-UniVi	Temporal Understanding	2.39	# 10
Video-based Generative Performance Benchmarking	VideoInstruct	Chat-UniVi	Consistency	2.81	# 5
Video-based Generative Performance Benchmarking	VideoInstruct	Chat-UniVi	mean	2.99	# 6
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	Chat-UniVi	gpt-score	2.89	# 4
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	Chat-UniVi	gpt-score	2.39	# 5
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	Chat-UniVi	gpt-score	3.46	# 4
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	Chat-UniVi	gpt-score	2.91	# 5

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/image-based-generative-performance)](https://paperswithcode.com/sota/image-based-generative-performance?p=chat-univi-unified-visual-representation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/video-based-generative-performance-2)](https://paperswithcode.com/sota/video-based-generative-performance-2?p=chat-univi-unified-visual-representation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/science-question-answering-on-scienceqa)](https://paperswithcode.com/sota/science-question-answering-on-scienceqa?p=chat-univi-unified-visual-representation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/zeroshot-video-question-answer-on-tgif-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-tgif-qa?p=chat-univi-unified-visual-representation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/video-based-generative-performance-1)](https://paperswithcode.com/sota/video-based-generative-performance-1?p=chat-univi-unified-visual-representation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/video-based-generative-performance-3)](https://paperswithcode.com/sota/video-based-generative-performance-3?p=chat-univi-unified-visual-representation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/video-based-generative-performance-5)](https://paperswithcode.com/sota/video-based-generative-performance-5?p=chat-univi-unified-visual-representation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/video-based-generative-performance-4)](https://paperswithcode.com/sota/video-based-generative-performance-4?p=chat-univi-unified-visual-representation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/video-based-generative-performance)](https://paperswithcode.com/sota/video-based-generative-performance?p=chat-univi-unified-visual-representation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/zeroshot-video-question-answer-on-activitynet)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-activitynet?p=chat-univi-unified-visual-representation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/zeroshot-video-question-answer-on-msvd-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msvd-qa?p=chat-univi-unified-visual-representation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/zeroshot-video-question-answer-on-msrvtt-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msrvtt-qa?p=chat-univi-unified-visual-representation)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/video-question-answering-on-activitynet-qa)](https://paperswithcode.com/sota/video-question-answering-on-activitynet-qa?p=chat-univi-unified-visual-representation)`

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

14 Nov 2023 · Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, Li Yuan ·

Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations. However, existing methods encounter challenges in effectively handling both image and video understanding, particularly with limited visual tokens. In this work, we introduce Chat-UniVi, a Unified Vision-language model capable of comprehending and engaging in conversations involving images and videos through a unified visual representation. Specifically, we employ a set of dynamic visual tokens to uniformly represent images and videos. This representation framework empowers the model to efficiently utilize a limited number of visual tokens to simultaneously capture the spatial details necessary for images and the comprehensive temporal relationship required for videos. Moreover, we leverage a multi-scale representation, enabling the model to perceive both high-level semantic concepts and low-level visual details. Notably, Chat-UniVi is trained on a mixed dataset containing both images and videos, allowing direct application to tasks involving both mediums without requiring any modifications. Extensive experimental results demonstrate that Chat-UniVi consistently outperforms even existing methods exclusively designed for either images or videos. Code is available at https://github.com/PKU-YuanGroup/Chat-UniVi.

PDF Abstract

Code

Add Remove Mark official

pku-yuangroup/chat-univi official

↳ Quickstart in

Spaces

628

pku-yuangroup/video-bench

Tasks

Add Remove

Image-based Generative Performance Benchmarking

Language Modelling

Science Question Answering

Video-based Generative Performance Benchmarking

Video-based Generative Performance Benchmarking (Consistency)

Video-based Generative Performance Benchmarking (Contextual Understanding)

Video-based Generative Performance Benchmarking (Correctness of Information)

Video-based Generative Performance Benchmarking (Correctness of Information) on VideoInstruct

Video-based Generative Performance Benchmarking (Detail Orientation))

Video-based Generative Performance Benchmarking (Temporal Understanding)

Video Question Answering

Video Understanding

Zero-Shot Video Question Answer

Datasets

ActivityNet

ScienceQA

ActivityNet-QA

TGIF-QA MSRVTT-QA MSVD-QA VideoInstruct

Results from the Paper

Edit

Ranked #1 on Image-based Generative Performance Benchmarking on ImageInstruct

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Video Question Answer	ActivityNet-QA	Chat-UniVi	Confidence Score	3.3	# 5	Compare
Zero-Shot Video Question Answer	ActivityNet-QA	Chat-UniVi	Accuracy	46.1	# 9	Compare
Zero-Shot Video Question Answer	ActivityNet-QA	Chat-UniVi-13B	Confidence Score	3.6	# 2	Compare
Zero-Shot Video Question Answer	ActivityNet-QA	Chat-UniVi-13B	Accuracy	46.4	# 8	Compare
Video Question Answering	ActivityNet-QA	Chat-UniVi-13B	Accuracy	46.4	# 13	Compare
Video Question Answering	ActivityNet-QA	Chat-UniVi-13B	Confidence score	3.3	# 2	Compare
Image-based Generative Performance Benchmarking	ImageInstruct	Chat-UniVi-7B	Conversation	84.1	# 1	Compare
			Detail description	74.2	# 3	Compare
			Complex reasoning	93.7	# 3	Compare
			All	84.2	# 3	Compare
Image-based Generative Performance Benchmarking	ImageInstruct	LLaVA-13B	Conversation	83.1	# 3	Compare
			Detail description	75.3	# 2	Compare
			Complex reasoning	96.5	# 1	Compare
			All	85.1	# 2	Compare
Image-based Generative Performance Benchmarking	ImageInstruct	Chat-UniVi-13B	Conversation	84.1	# 1	Compare
			Detail description	79.4	# 1	Compare
			Complex reasoning	94.7	# 2	Compare
			All	86.1	# 1	Compare
Image-based Generative Performance Benchmarking	ImageInstruct	LLaVA-7B	Conversation	70.3	# 4	Compare
			Detail description	56.6	# 4	Compare
			Complex reasoning	83.3	# 4	Compare
			All	70.1	# 4	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	Chat-UniVi-7B	Accuracy	55.0	# 12	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	Chat-UniVi-7B	Confidence Score	3.1	# 12	Compare
Zero-Shot Video Question Answer	MSVD-QA	Chat-UniVi-7B	Accuracy	69.3	# 10	Compare
Zero-Shot Video Question Answer	MSVD-QA	Chat-UniVi-7B	Confidence Score	3.7	# 6	Compare
Science Question Answering	ScienceQA	Chat-UniVi-13B	Natural Science	90.41	# 3	Compare
			Social Science	95.05	# 2	Compare
			Language Science	88.91	# 3	Compare
			Text Context	89.64	# 3	Compare
			Image Context	88.05	# 3	Compare
			No Context	90.94	# 3	Compare
			Grades 1-6	91.19	# 3	Compare
			Grades 7-12	90.64	# 2	Compare
			Avg. Accuracy	90.99	# 4	Compare
Zero-Shot Video Question Answer	TGIF-QA	Chat-UniVi-7B	Accuracy	69.0	# 4	Compare
Zero-Shot Video Question Answer	TGIF-QA	Chat-UniVi-7B	Confidence Score	3.8	# 4	Compare
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	Chat-UniVi	gpt-score	2.81	# 2	Compare
Video-based Generative Performance Benchmarking	VideoInstruct	Chat-UniVi	Correctness of Information	2.89	# 9	Compare
			Detail Orientation	2.91	# 8	Compare
			Contextual Understanding	3.46	# 8	Compare
			Temporal Understanding	2.39	# 10	Compare
			Consistency	2.81	# 5	Compare
			mean	2.99	# 6	Compare
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	Chat-UniVi	gpt-score	2.89	# 4	Compare
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	Chat-UniVi	gpt-score	2.39	# 5	Compare
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	Chat-UniVi	gpt-score	3.46	# 4	Compare
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	Chat-UniVi	gpt-score	2.91	# 5	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove