AVT: Audio-Video Transformer for Multimodal Action Recognition

Action recognition is an essential field for video understanding. To learn from heterogeneous data sources effectively, in this work, we propose a novel multimodal action recognition approach termed Audio-Video Transformer (AVT). AVT uses a combination of video and audio signals to improve action recognition accuracy, leveraging the effective spatio-temporal representation by the video Transformer. For multimodal fusion, simply concatenating multimodal tokens in a cross-modal Transformer requires large computational and memory resources, instead we reduce the cross-modality complexity through an audio-video bottleneck Transformer. To improve the learning efficiency of multimodal Transformer, we integrate self-supervised objectives, i.e., audio-video contrastive learning, audio-video matching, and masked audio and video learning, into AVT training, which maps diverse audio and video representations into a common multimodal representation space. We further propose a masked audio segment loss to learn semantic audio activities in AVT. Extensive experiments and ablation studies on three public datasets and two in-house datasets consistently demonstrate the effectiveness of the proposed AVT. Specifically, AVT outperforms its previous state-of-the-art counterparts on Kinetics-Sounds and Epic-Kitchens-100 datasets by 8% and 1%, respectively, without external training data. AVT also surpasses one of the previous state-of-the-art video Transformers by 10% on the VGGSound dataset by leveraging the audio signal. Compared to one of the previous state-of-the-art multimodal Transformers, AVT is 1.3x more efficient in terms of FLOPs and improves the accuracy by 4.2% on Epic-Kitchens-100. Visualization results further demonstrate that the audio provides complementary and discriminative features, and our AVT can effectively understand the action from a combination of audio and video.

PDF
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Action Recognition EPIC-KITCHENS-100 AVT Action@1 47.2 # 12
Verb@1 70.4 # 9
Noun@1 59.3 # 11
Audio Classification VGGSound AVT (Audio-Visual) Top 1 Accuracy 63.9 # 8
Top 5 Accuracy 85.0 # 3
Audio Classification VGGSound AVT (V) Top 1 Accuracy 53.2 # 16
Top 5 Accuracy 74.8 # 8
Multi-modal Classification VGG-Sound AVT Top-1 Accuracy 63.9 # 4
Top-5 Accuracy 85.0 # 2

Methods