TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Audio Classification	AudioSet	OmniVec	Test mAP	0.548	# 1
Audio Classification	ESC-50	OmniVec	Top-1 Accuracy	98.4	# 2
Audio Classification	ESC-50	OmniVec	PRE-TRAINING DATASET	Multiple	# 1
Audio Classification	ESC-50	OmniVec	Accuracy (5-fold)	98.4	# 2
Image Classification	ImageNet	OmniVec(ViT)	Top 1 Accuracy	92.4%	# 1
Image Classification	iNaturalist 2018	OmniVec	Top-1 Accuracy	93.8	# 1
Action Classification	Kinetics-400	OmniVec	Acc@1	91.1	# 3
3D Point Cloud Classification	ModelNet40-C	OmniVec	Error Rate	0.156	# 1
Action Classification	Moments in Time	OmniVec	Top 1 Accuracy	49.8	# 1
Video Retrieval	MSR-VTT-1kA	OmniVec (pretrained)	text-to-video R@10	78.6	# 38
Video Retrieval	MSR-VTT-1kA	OmniVec	text-to-video R@10	89.4	# 2
Semantic Segmentation	NYU Depth v2	OmniVec	Mean IoU	60.8	# 1
Fine-Grained Image Classification	Oxford-IIIT Pet Dataset	OmniVec	Accuracy	99.2	# 1
Image Classification	Places365	OmniVec(ViT)	Top 1 Accuracy	63.5	# 1
Semantic Segmentation	S3DIS Area5	OmniVec	mIoU	75.9	# 1
3D Point Cloud Classification	ScanObjectNN	OmniVec	Overall Accuracy	96.1	# 1
Action Recognition	UCF101	OmniVec	3-fold Accuracy	99.6	# 1
Video Retrieval	YouCook2	OmniVec	text-to-video R@10	70.8	# 6
Video Retrieval	YouCook2	OmniVec (pretrained)	text-to-video R@10	64.2	# 9

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omnivec-learning-robust-representations-with/audio-classification-on-audioset)](https://paperswithcode.com/sota/audio-classification-on-audioset?p=omnivec-learning-robust-representations-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omnivec-learning-robust-representations-with/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=omnivec-learning-robust-representations-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omnivec-learning-robust-representations-with/image-classification-on-inaturalist-2018)](https://paperswithcode.com/sota/image-classification-on-inaturalist-2018?p=omnivec-learning-robust-representations-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omnivec-learning-robust-representations-with/3d-point-cloud-classification-on-modelnet40-c)](https://paperswithcode.com/sota/3d-point-cloud-classification-on-modelnet40-c?p=omnivec-learning-robust-representations-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omnivec-learning-robust-representations-with/action-classification-on-moments-in-time-2)](https://paperswithcode.com/sota/action-classification-on-moments-in-time-2?p=omnivec-learning-robust-representations-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omnivec-learning-robust-representations-with/semantic-segmentation-on-nyu-depth-v2)](https://paperswithcode.com/sota/semantic-segmentation-on-nyu-depth-v2?p=omnivec-learning-robust-representations-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omnivec-learning-robust-representations-with/fine-grained-image-classification-on-oxford-1)](https://paperswithcode.com/sota/fine-grained-image-classification-on-oxford-1?p=omnivec-learning-robust-representations-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omnivec-learning-robust-representations-with/image-classification-on-places365)](https://paperswithcode.com/sota/image-classification-on-places365?p=omnivec-learning-robust-representations-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omnivec-learning-robust-representations-with/semantic-segmentation-on-s3dis-area5)](https://paperswithcode.com/sota/semantic-segmentation-on-s3dis-area5?p=omnivec-learning-robust-representations-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omnivec-learning-robust-representations-with/3d-point-cloud-classification-on-scanobjectnn)](https://paperswithcode.com/sota/3d-point-cloud-classification-on-scanobjectnn?p=omnivec-learning-robust-representations-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omnivec-learning-robust-representations-with/action-recognition-in-videos-on-ucf101)](https://paperswithcode.com/sota/action-recognition-in-videos-on-ucf101?p=omnivec-learning-robust-representations-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omnivec-learning-robust-representations-with/audio-classification-on-esc-50)](https://paperswithcode.com/sota/audio-classification-on-esc-50?p=omnivec-learning-robust-representations-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omnivec-learning-robust-representations-with/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=omnivec-learning-robust-representations-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omnivec-learning-robust-representations-with/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=omnivec-learning-robust-representations-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omnivec-learning-robust-representations-with/video-retrieval-on-youcook2)](https://paperswithcode.com/sota/video-retrieval-on-youcook2?p=omnivec-learning-robust-representations-with)`

OmniVec: Learning robust representations with cross modal sharing

7 Nov 2023 · Siddharth Srivastava, Gaurav Sharma ·

Majority of research in learning based methods has been towards designing and training networks for specific tasks. However, many of the learning based tasks, across modalities, share commonalities and could be potentially tackled in a joint framework. We present an approach in such direction, to learn multiple tasks, in multiple modalities, with a unified architecture. The proposed network is composed of task specific encoders, a common trunk in the middle, followed by task specific prediction heads. We first pre-train it by self-supervised masked training, followed by sequential training for the different tasks. We train the network on all major modalities, e.g.\ visual, audio, text and 3D, and report results on $22$ diverse and challenging public benchmarks. We demonstrate empirically that, using a joint network to train across modalities leads to meaningful information sharing and this allows us to achieve state-of-the-art results on most of the benchmarks. We also show generalization of the trained network on cross-modal tasks as well as unseen datasets and tasks.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

3D Point Cloud Classification

Action Classification

Audio Classification

Fine-Grained Image Classification

Image Classification

Semantic Segmentation

Datasets

ImageNet

UCF101

ModelNet

Kinetics

Places

NYUv2

HMDB51

Kinetics 400

AudioSet

MSR-VTT

iNaturalist

SUN RGB-D

S3DIS

ESC-50

ScanObjectNN

YouCook2 SAMSum

Oxford-IIIT Pet Dataset

Places365 Oxford-IIIT Pets

ModelNet40-C

Results from the Paper

Edit

Ranked #1 on Image Classification on ImageNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Audio Classification	AudioSet	OmniVec	Test mAP	0.548	# 1	Compare
Audio Classification	ESC-50	OmniVec	Top-1 Accuracy	98.4	# 2	Compare
			PRE-TRAINING DATASET	Multiple	# 1	Compare
			Accuracy (5-fold)	98.4	# 2	Compare
Image Classification	ImageNet	OmniVec(ViT)	Top 1 Accuracy	92.4%	# 1	Compare
Image Classification	iNaturalist 2018	OmniVec	Top-1 Accuracy	93.8	# 1	Compare
Action Classification	Kinetics-400	OmniVec	Acc@1	91.1	# 3	Compare
3D Point Cloud Classification	ModelNet40-C	OmniVec	Error Rate	0.156	# 1	Compare
Action Classification	Moments in Time	OmniVec	Top 1 Accuracy	49.8	# 1	Compare
Video Retrieval	MSR-VTT-1kA	OmniVec (pretrained)	text-to-video R@10	78.6	# 38	Compare
Video Retrieval	MSR-VTT-1kA	OmniVec	text-to-video R@10	89.4	# 2	Compare
Semantic Segmentation	NYU Depth v2	OmniVec	Mean IoU	60.8	# 1	Compare
Fine-Grained Image Classification	Oxford-IIIT Pet Dataset	OmniVec	Accuracy	99.2	# 1	Compare
Image Classification	Places365	OmniVec(ViT)	Top 1 Accuracy	63.5	# 1	Compare
Semantic Segmentation	S3DIS Area5	OmniVec	mIoU	75.9	# 1	Compare
3D Point Cloud Classification	ScanObjectNN	OmniVec	Overall Accuracy	96.1	# 1	Compare
Action Recognition	UCF101	OmniVec	3-fold Accuracy	99.6	# 1	Compare
Video Retrieval	YouCook2	OmniVec	text-to-video R@10	70.8	# 6	Compare
Video Retrieval	YouCook2	OmniVec (pretrained)	text-to-video R@10	64.2	# 9	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

OmniVec: Learning robust representations with cross modal sharing

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove