text-to-audiovisual retrieval

2 papers with code • 1 benchmarks • 0 datasets

This task has no description! Would you like to contribute one?

Benchmarks

Add a Result

These leaderboards are used to track progress in text-to-audiovisual retrieval

Trend	Dataset	Best Model	Paper	Code	Compare
	VALOR-32K	VALOR			See all

Most implemented papers

Most implemented Social Latest No code

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

TXH-mercury/VALOR • • 17 Apr 2023

Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner.

Paper
Code

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

txh-mercury/vast • • NeurIPS 2023

Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA).