TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Question Answering	ActivityNet-QA	MA-LMM	Accuracy	49.8	# 7
Video Classification	Breakfast	MA-LMM	Accuracy (%)	93.0	# 1
Video Classification	COIN	MA-LMM	Accuracy (%)	93.2	# 1
Video Question Answering	MSRVTT-QA	MA-LMM	Accuracy	48.5	# 5
Visual Question Answering (VQA)	MSVD-QA	MA-LMM	Accuracy	0.606	# 2
Video Captioning	YouCook2	MA-LMM	METEOR	17.6	# 6
Video Captioning	YouCook2	MA-LMM	CIDEr	1.31	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ma-lmm-memory-augmented-large-multimodal/video-classification-on-breakfast)](https://paperswithcode.com/sota/video-classification-on-breakfast?p=ma-lmm-memory-augmented-large-multimodal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ma-lmm-memory-augmented-large-multimodal/video-classification-on-coin-1)](https://paperswithcode.com/sota/video-classification-on-coin-1?p=ma-lmm-memory-augmented-large-multimodal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ma-lmm-memory-augmented-large-multimodal/visual-question-answering-on-msvd-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msvd-qa-1?p=ma-lmm-memory-augmented-large-multimodal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ma-lmm-memory-augmented-large-multimodal/video-question-answering-on-msrvtt-qa)](https://paperswithcode.com/sota/video-question-answering-on-msrvtt-qa?p=ma-lmm-memory-augmented-large-multimodal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ma-lmm-memory-augmented-large-multimodal/video-captioning-on-youcook2)](https://paperswithcode.com/sota/video-captioning-on-youcook2?p=ma-lmm-memory-augmented-large-multimodal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ma-lmm-memory-augmented-large-multimodal/video-question-answering-on-activitynet-qa)](https://paperswithcode.com/sota/video-question-answering-on-activitynet-qa?p=ma-lmm-memory-augmented-large-multimodal)`

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

8 Apr 2024 · Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, Ser-Nam Lim ·

With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding. In this study, we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning, and our model can achieve state-of-the-art performances across multiple datasets. Code available at https://boheumd.github.io/MA-LMM/.

PDF Abstract

Code

Add Remove Mark official

boheumd/MA-LMM official

114

Tasks

Add Remove

Question Answering

Video Captioning

Video Classification

Video Question Answering

Video Understanding

Visual Question Answering (VQA)

Datasets

MSR-VTT

MSVD

YouCook2

Breakfast

ActivityNet-QA COIN MSRVTT-QA MSVD-QA

Results from the Paper

Edit

Ranked #1 on Video Classification on COIN

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Question Answering	ActivityNet-QA	MA-LMM	Accuracy	49.8	# 7	Compare
Video Classification	Breakfast	MA-LMM	Accuracy (%)	93.0	# 1	Compare
Video Classification	COIN	MA-LMM	Accuracy (%)	93.2	# 1	Compare
Video Question Answering	MSRVTT-QA	MA-LMM	Accuracy	48.5	# 5	Compare
Visual Question Answering (VQA)	MSVD-QA	MA-LMM	Accuracy	0.606	# 2	Compare
Video Captioning	YouCook2	MA-LMM	METEOR	17.6	# 6	Compare
Video Captioning	YouCook2	MA-LMM	CIDEr	1.31	# 6	Compare

Methods

Add Remove

Focus

Edit Social Preview

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove