TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Recognition	EPIC-KITCHENS-100	AVT	Action@1	47.2	# 12
Action Recognition	EPIC-KITCHENS-100	AVT	Verb@1	70.4	# 9
Action Recognition	EPIC-KITCHENS-100	AVT	Noun@1	59.3	# 11
Audio Classification	VGGSound	AVT (Audio-Visual)	Top 1 Accuracy	63.9	# 8
Audio Classification	VGGSound	AVT (Audio-Visual)	Top 5 Accuracy	85.0	# 3
Audio Classification	VGGSound	AVT (V)	Top 1 Accuracy	53.2	# 16
Audio Classification	VGGSound	AVT (V)	Top 5 Accuracy	74.8	# 8
Multi-modal Classification	VGG-Sound	AVT	Top-1 Accuracy	63.9	# 4
Multi-modal Classification	VGG-Sound	AVT	Top-5 Accuracy	85.0	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/avt-audio-video-transformer-for-multimodal/multi-modal-classification-on-vgg-sound)](https://paperswithcode.com/sota/multi-modal-classification-on-vgg-sound?p=avt-audio-video-transformer-for-multimodal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/avt-audio-video-transformer-for-multimodal/audio-classification-on-vggsound)](https://paperswithcode.com/sota/audio-classification-on-vggsound?p=avt-audio-video-transformer-for-multimodal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/avt-audio-video-transformer-for-multimodal/action-recognition-on-epic-kitchens-100)](https://paperswithcode.com/sota/action-recognition-on-epic-kitchens-100?p=avt-audio-video-transformer-for-multimodal)`

AVT: Audio-Video Transformer for Multimodal Action Recognition

Submitted to ICLR 2022 · Wentao Zhu, Jingru Yi, Kevin Hsu, Xiaohang Sun, Xiang Hao, Linda Liu, Mohamed Omar ·

Action recognition is an essential field for video understanding. To learn from heterogeneous data sources effectively, in this work, we propose a novel multimodal action recognition approach termed Audio-Video Transformer (AVT). AVT uses a combination of video and audio signals to improve action recognition accuracy, leveraging the effective spatio-temporal representation by the video Transformer. For multimodal fusion, simply concatenating multimodal tokens in a cross-modal Transformer requires large computational and memory resources, instead we reduce the cross-modality complexity through an audio-video bottleneck Transformer. To improve the learning efficiency of multimodal Transformer, we integrate self-supervised objectives, i.e., audio-video contrastive learning, audio-video matching, and masked audio and video learning, into AVT training, which maps diverse audio and video representations into a common multimodal representation space. We further propose a masked audio segment loss to learn semantic audio activities in AVT. Extensive experiments and ablation studies on three public datasets and two in-house datasets consistently demonstrate the effectiveness of the proposed AVT. Specifically, AVT outperforms its previous state-of-the-art counterparts on Kinetics-Sounds and Epic-Kitchens-100 datasets by 8% and 1%, respectively, without external training data. AVT also surpasses one of the previous state-of-the-art video Transformers by 10% on the VGGSound dataset by leveraging the audio signal. Compared to one of the previous state-of-the-art multimodal Transformers, AVT is 1.3x more efficient in terms of FLOPs and improves the accuracy by 4.2% on Epic-Kitchens-100. Visualization results further demonstrate that the audio provides complementary and discriminative features, and our AVT can effectively understand the action from a combination of audio and video.

PDF

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Action Recognition

Audio Classification

Contrastive Learning

Multi-modal Classification

Video Understanding

Datasets

Kinetics

VGG-Sound

EPIC-KITCHENS-100

Results from the Paper

Add Remove

Ranked #4 on Multi-modal Classification on VGG-Sound

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Recognition	EPIC-KITCHENS-100	AVT	Action@1	47.2	# 12	Compare
			Verb@1	70.4	# 9	Compare
			Noun@1	59.3	# 11	Compare
Audio Classification	VGGSound	AVT (Audio-Visual)	Top 1 Accuracy	63.9	# 8	Compare
Audio Classification	VGGSound	AVT (Audio-Visual)	Top 5 Accuracy	85.0	# 3	Compare
Audio Classification	VGGSound	AVT (V)	Top 1 Accuracy	53.2	# 16	Compare
Audio Classification	VGGSound	AVT (V)	Top 5 Accuracy	74.8	# 8	Compare
Multi-modal Classification	VGG-Sound	AVT	Top-1 Accuracy	63.9	# 4	Compare
Multi-modal Classification	VGG-Sound	AVT	Top-5 Accuracy	85.0	# 2	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

AVT: Audio-Video Transformer for Multimodal Action Recognition

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove