OmniVec: Learning robust representations with cross modal sharing

7 Nov 2023  ยท  Siddharth Srivastava, Gaurav Sharma ยท

Majority of research in learning based methods has been towards designing and training networks for specific tasks. However, many of the learning based tasks, across modalities, share commonalities and could be potentially tackled in a joint framework. We present an approach in such direction, to learn multiple tasks, in multiple modalities, with a unified architecture. The proposed network is composed of task specific encoders, a common trunk in the middle, followed by task specific prediction heads. We first pre-train it by self-supervised masked training, followed by sequential training for the different tasks. We train the network on all major modalities, e.g.\ visual, audio, text and 3D, and report results on $22$ diverse and challenging public benchmarks. We demonstrate empirically that, using a joint network to train across modalities leads to meaningful information sharing and this allows us to achieve state-of-the-art results on most of the benchmarks. We also show generalization of the trained network on cross-modal tasks as well as unseen datasets and tasks.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Audio Classification AudioSet OmniVec Test mAP 0.548 # 1
Audio Classification ESC-50 OmniVec Top-1 Accuracy 98.4 # 2
PRE-TRAINING DATASET Multiple # 1
Accuracy (5-fold) 98.4 # 2
Image Classification ImageNet OmniVec(ViT) Top 1 Accuracy 92.4% # 1
Image Classification iNaturalist 2018 OmniVec Top-1 Accuracy 93.8 # 1
Action Classification Kinetics-400 OmniVec Acc@1 91.1 # 3
3D Point Cloud Classification ModelNet40-C OmniVec Error Rate 0.156 # 1
Action Classification Moments in Time OmniVec Top 1 Accuracy 49.8 # 1
Video Retrieval MSR-VTT-1kA OmniVec (pretrained) text-to-video R@10 78.6 # 38
Video Retrieval MSR-VTT-1kA OmniVec text-to-video R@10 89.4 # 2
Semantic Segmentation NYU Depth v2 OmniVec Mean IoU 60.8 # 1
Fine-Grained Image Classification Oxford-IIIT Pet Dataset OmniVec Accuracy 99.2 # 1
Image Classification Places365 OmniVec(ViT) Top 1 Accuracy 63.5 # 1
Semantic Segmentation S3DIS Area5 OmniVec mIoU 75.9 # 1
3D Point Cloud Classification ScanObjectNN OmniVec Overall Accuracy 96.1 # 1
Action Recognition UCF101 OmniVec 3-fold Accuracy 99.6 # 1
Video Retrieval YouCook2 OmniVec text-to-video R@10 70.8 # 6
Video Retrieval YouCook2 OmniVec (pretrained) text-to-video R@10 64.2 # 9

Methods


No methods listed for this paper. Add relevant methods here