MoViNets: Mobile Video Networks for Efficient Video Recognition

We present Mobile Video Networks (MoViNets), a family of computation and memory efficient video networks that can operate on streaming video for online inference. 3D convolutional neural networks (CNNs) are accurate at video recognition but require large computation and memory budgets and do not support online inference, making them difficult to work on mobile devices. We propose a three-step approach to improve computational efficiency while substantially reducing the peak memory usage of 3D CNNs. First, we design a video network search space and employ neural architecture search to generate efficient and diverse 3D CNN architectures. Second, we introduce the Stream Buffer technique that decouples memory from video clip duration, allowing 3D CNNs to embed arbitrary-length streaming video sequences for both training and inference with a small constant memory footprint. Third, we propose a simple ensembling technique to improve accuracy further without sacrificing efficiency. These three progressive techniques allow MoViNets to achieve state-of-the-art accuracy and efficiency on the Kinetics, Moments in Time, and Charades video action recognition datasets. For instance, MoViNet-A5-Stream achieves the same accuracy as X3D-XL on Kinetics 600 while requiring 80% fewer FLOPs and 65% less memory. Code will be made available at https://github.com/tensorflow/models/tree/master/official/vision.

PDF Abstract CVPR 2021 PDF CVPR 2021 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Action Classification Charades MoViNet-A4 MAP 48.5 # 12
Action Classification Charades MoViNet-A2 MAP 32.5 # 40
Action Classification Charades MoViNet-A6 MAP 63.2 # 3
Action Recognition EPIC-KITCHENS-100 MoViNet-A2 Action@1 41.2 # 23
Verb@1 67.1 # 17
Noun@1 52.3 # 23
GFLOPs 7.59x1 # 1
Action Recognition EPIC-KITCHENS-100 MoViNet-A6 Action@1 47.7 # 11
Verb@1 72.2 # 3
Noun@1 57.3 # 16
GFLOPs 117x1 # 1
Action Recognition EPIC-KITCHENS-100 MoViNet-A0 Action@1 36.8 # 26
Verb@1 64.8 # 23
Noun@1 47.4 # 24
GFLOPs 1.74x1 # 1
Action Recognition EPIC-KITCHENS-100 MoViNet-A4 Action@1 44.4 # 18
Verb@1 68.8 # 15
Noun@1 56.2 # 19
GFLOPs 42.2x1 # 1
Action Recognition EPIC-KITCHENS-100 MoViNet-A5 Action@1 44.5 # 15
Verb@1 69.1 # 13
Noun@1 55.1 # 20
GFLOPs 74.9x1 # 1
Action Classification Kinetics-400 MoViNet-A1 Acc@1 72.7 # 164
Acc@5 91.2 # 112
FLOPs (G) x views 6.0x1 # 1
Action Classification Kinetics-400 MoViNet-A0 Acc@1 65.8 # 183
Acc@5 87.4 # 124
FLOPs (G) x views 2.7x1 # 1
Action Classification Kinetics-400 MoViNet-A4 Acc@1 80.5 # 88
Acc@5 94.5 # 65
FLOPs (G) x views 105x1 # 1
Action Classification Kinetics-400 MoViNet-A5 Acc@1 80.9 # 83
Acc@5 94.9 # 56
FLOPs (G) x views 281x1 # 1
Action Classification Kinetics-400 MoViNet-A3 Acc@1 78.2 # 119
Acc@5 93.8 # 83
FLOPs (G) x views 56.9x1 # 1
Action Classification Kinetics-400 MoViNet-A2 Acc@1 75.0 # 150
Acc@5 92.3 # 105
FLOPs (G) x views 10.3x1 # 1
Action Classification Kinetics-400 MoViNet-A6 Acc@1 81.5 # 74
FLOPs (G) x views 386x1 # 1
Action Classification Kinetics-600 MoViNet-A5 (AutoAugment) Top-1 Accuracy 84.3 # 33
Top-5 Accuracy 96.4 # 26
GFLOPs 281x1 # 1
Action Classification Kinetics-600 MoViNet-A6 Top-1 Accuracy 83.5 # 38
Top-5 Accuracy 96.5 # 23
GFLOPs 386x1 # 1
Action Classification Kinetics-600 MoViNet-A0 Top-1 Accuracy 71.5 # 62
Top-5 Accuracy 90.4 # 48
GFLOPs 2.7x1 # 1
Action Classification Kinetics-600 MoViNet-A1 Top-1 Accuracy 76.0 # 59
Top-5 Accuracy 92.6 # 46
GFLOPs 6.0x1 # 1
Action Classification Kinetics-600 MoViNet-A2 Top-1 Accuracy 77.5 # 57
Top-5 Accuracy 93.4 # 45
GFLOPs 10.3x1 # 1
Action Classification Kinetics-600 MoViNet-A3 Top-1 Accuracy 80.8 # 50
Top-5 Accuracy 80.8 # 49
GFLOPs 56.9x1 # 1
Action Classification Kinetics-600 MoViNet-A4 Top-1 Accuracy 81.2 # 48
Top-5 Accuracy 94.9 # 41
GFLOPs 105x1 # 1
Action Classification Kinetics-600 MoViNet-A5 Top-1 Accuracy 82.7 # 43
Top-5 Accuracy 95.7 # 33
GFLOPs 281x1 # 1
Action Classification Kinetics-700 MoViNet-A0 Top-1 Accuracy 58.5 # 30
Action Classification Kinetics-700 MoViNet-A5 Top-1 Accuracy 71.7 # 21
Action Classification Kinetics-700 MoViNet-A4 Top-1 Accuracy 70.7 # 23
Action Classification Kinetics-700 MoViNet-A6 Top-1 Accuracy 72.3 # 20
Action Classification Kinetics-700 MoViNet-A3 Top-1 Accuracy 68.0 # 26
Action Classification Kinetics-700 MoViNet-A2 Top-1 Accuracy 66.7 # 28
Action Classification Kinetics-700 MoViNet-A1 Top-1 Accuracy 63.5 # 29
Action Classification MiT MoViNet-A0 Top 1 Accuracy 27.5 # 30
Action Classification MiT MoViNet-A2 Top 1 Accuracy 34.3 # 18
Action Classification MiT MoViNet-A3 Top 1 Accuracy 35.6 # 17
Action Classification MiT MoViNet-A4 Top 1 Accuracy 37.9 # 14
Action Classification MiT MoViNet-A5 Top 1 Accuracy 39.1 # 13
Action Classification MiT MoViNet-A6 Top 1 Accuracy 40.2 # 12
Action Classification MiT MoViNet-A1 Top 1 Accuracy 32.0 # 22
Action Recognition Something-Something V2 MoViNet-A3 Parameters 5.3M # 5
GFLOPs 23.7x1 # 6
Action Recognition Something-Something V2 MoViNet-A2 Top-1 Accuracy 63.5 # 96
Top-5 Accuracy 89.0 # 74
Parameters 4.8M # 6
GFLOPs 10.3x1 # 6
Action Recognition Something-Something V2 MoViNet-A1 Top-1 Accuracy 62.7 # 99
Top-5 Accuracy 89.0 # 74
Parameters 4.6M # 7
GFLOPs 6.0x1 # 6
Action Recognition Something-Something V2 MoViNet-A0 Top-1 Accuracy 61.3 # 107
Top-5 Accuracy 88.2 # 79
Parameters 3.1M # 11
GFLOPs 2.7x1 # 6

Methods