AVT: Audio-Video Transformer for Multimodal Action Recognition
Action recognition is an essential field for video understanding. To learn from heterogeneous data sources effectively, in this work, we propose a novel multimodal action recognition approach termed Audio-Video Transformer (AVT). AVT uses a combination of video and audio signals to improve action recognition accuracy, leveraging the effective spatio-temporal representation by the video Transformer. For multimodal fusion, simply concatenating multimodal tokens in a cross-modal Transformer requires large computational and memory resources, instead we reduce the cross-modality complexity through an audio-video bottleneck Transformer. To improve the learning efficiency of multimodal Transformer, we integrate self-supervised objectives, i.e., audio-video contrastive learning, audio-video matching, and masked audio and video learning, into AVT training, which maps diverse audio and video representations into a common multimodal representation space. We further propose a masked audio segment loss to learn semantic audio activities in AVT. Extensive experiments and ablation studies on three public datasets and two in-house datasets consistently demonstrate the effectiveness of the proposed AVT. Specifically, AVT outperforms its previous state-of-the-art counterparts on Kinetics-Sounds and Epic-Kitchens-100 datasets by 8% and 1%, respectively, without external training data. AVT also surpasses one of the previous state-of-the-art video Transformers by 10% on the VGGSound dataset by leveraging the audio signal. Compared to one of the previous state-of-the-art multimodal Transformers, AVT is 1.3x more efficient in terms of FLOPs and improves the accuracy by 4.2% on Epic-Kitchens-100. Visualization results further demonstrate that the audio provides complementary and discriminative features, and our AVT can effectively understand the action from a combination of audio and video.
PDFDatasets
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Action Recognition | EPIC-KITCHENS-100 | AVT | Action@1 | 47.2 | # 12 | |
Verb@1 | 70.4 | # 9 | ||||
Noun@1 | 59.3 | # 11 | ||||
Audio Classification | VGGSound | AVT (Audio-Visual) | Top 1 Accuracy | 63.9 | # 8 | |
Top 5 Accuracy | 85.0 | # 3 | ||||
Audio Classification | VGGSound | AVT (V) | Top 1 Accuracy | 53.2 | # 16 | |
Top 5 Accuracy | 74.8 | # 8 | ||||
Multi-modal Classification | VGG-Sound | AVT | Top-1 Accuracy | 63.9 | # 4 | |
Top-5 Accuracy | 85.0 | # 2 |