TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Transfer 3D Point Cloud Classification	ModelNet40	ViT-Lens	Accuracy (%)	87.6	# 2
Zero-shot 3D classification	Objaverse LVIS	ViT-Lens	Top 1 Accuracy	52.0	# 4
Zero-Shot Transfer 3D Point Cloud Classification	ScanObjectNN	ViT-Lens	OBJ_ONLY Accuracy(%)	60.1	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vit-lens-towards-omni-modal-representations/zero-shot-transfer-3d-point-cloud)](https://paperswithcode.com/sota/zero-shot-transfer-3d-point-cloud?p=vit-lens-towards-omni-modal-representations)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vit-lens-towards-omni-modal-representations/zero-shot-transfer-3d-point-cloud-2)](https://paperswithcode.com/sota/zero-shot-transfer-3d-point-cloud-2?p=vit-lens-towards-omni-modal-representations)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vit-lens-towards-omni-modal-representations/zero-shot-3d-classification-on-objaverse-lvis)](https://paperswithcode.com/sota/zero-shot-3d-classification-on-objaverse-lvis?p=vit-lens-towards-omni-modal-representations)`

ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights

20 Aug 2023 · Weixian Lei, Yixiao Ge, Jianfeng Zhang, Dylan Sun, Kun Yi, Ying Shan, Mike Zheng Shou ·

Though the success of CLIP-based training recipes in vision-language models, their scalability to more modalities (e.g., 3D, audio, etc.) is limited to large-scale data, which is expensive or even inapplicable for rare modalities. In this paper, we present ViT-Lens that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning to a pre-defined space. Specifically, the modality-specific lens is tuned to project multimodal signals to the shared embedding space, which are then processed by a strong ViT that carries pre-trained image knowledge. The encoded multimodal representations are optimized toward aligning with the modal-independent space, pre-defined by off-the-shelf foundation models. A well-trained lens with a ViT backbone has the potential to serve as one of these foundation models, supervising the learning of subsequent modalities. ViT-Lens provides a unified solution for representation learning of increasing modalities with two appealing benefits: (i) Exploiting the pretrained ViT across tasks and domains effectively with efficient data regime; (ii) Emergent downstream capabilities of novel modalities are demonstrated due to the modality alignment space. We evaluate ViT-Lens in the context of 3D as an initial verification. In zero-shot 3D classification, ViT-Lens achieves substantial improvements over previous state-of-the-art, showing 52.0% accuracy on Objaverse-LVIS, 87.4% on ModelNet40, and 60.6% on ScanObjectNN. Furthermore, we enable zero-shot 3D question-answering by simply integrating the trained 3D lens into the InstructBLIP model without any adaptation. We will release the results of ViT-Lens on more modalities in the near future.

PDF Abstract

Code

Add Remove Mark official

TencentARC/ViT-Lens official

130

Tasks

Add Remove

3D Classification

Question Answering

Representation Learning

Training-free 3D Point Cloud Classification

Zero-shot 3D classification

Zero-Shot Transfer 3D Point Cloud Classification

Datasets

ShapeNet

ModelNet

LVIS

ScanObjectNN

Objaverse

ABO

3D-FUTURE

Results from the Paper

Add Remove

Ranked #2 on Zero-Shot Transfer 3D Point Cloud Classification on ModelNet40 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Transfer 3D Point Cloud Classification	ModelNet40	ViT-Lens	Accuracy (%)	87.6	# 2	Compare
Zero-shot 3D classification	Objaverse LVIS	ViT-Lens	Top 1 Accuracy	52.0	# 4	Compare
Zero-Shot Transfer 3D Point Cloud Classification	ScanObjectNN	ViT-Lens	OBJ_ONLY Accuracy(%)	60.1	# 3	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove