text-to-audiovisual retrieval
2 papers with code • 1 benchmarks • 0 datasets
Most implemented papers
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA).