ImageBind: One Embedding Space To Bind Them All

We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Zero-shot Text to Audio Retrieval AudioCaps ImageBind R@10 42.3 # 4
Audio-to-text R@1 9.3 # 5
Zero-shot Audio Classification AudioSet ImageBind Test mAP 17.6 # 3
Zero-shot Text to Audio Retrieval Clotho ImageBind text-to-audio R@1 6.0 # 6
text-to-audio R@10 28.4 # 4
Zero-Shot Environment Sound Classification ESC-50 ImageBind Accuracy 66.9 # 5
Zero-shot Classification (unified classes) LLVIP ImageBind Balanced Accuracy 63.4 # 2
Zero-Shot Video Retrieval MSR-VTT ImageBind text-to-video R@1 36.8 # 12
text-to-video R@5 61.8 # 11
text-to-video R@10 70.0 # 11
Zero-shot Scene Classification (unified classes) NYU Depth v2 ImageBind Balanced Accuracy 54 # 2
Zero-shot Audio Classification VGG-Sound ImageBind Acc@1 27.8 # 4

Methods


No methods listed for this paper. Add relevant methods here