Text to Audio Retrieval
9 papers with code • 4 benchmarks • 5 datasets
Most implemented papers
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation
In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources.
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
In this work, we explore a scalable way for building a general representation model toward unlimited modalities.
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.
Audio Retrieval with Natural Language Queries
We consider the task of retrieving audio using free-form natural language queries.
Audio Retrieval with Natural Language Queries: A Benchmark Study
Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho.
Cross Modal Retrieval with Querybank Normalisation
In this work we first show that, despite their effectiveness, state-of-the-art joint embeddings suffer significantly from the longstanding "hubness problem" in which a small number of gallery embeddings form the nearest neighbours of many queries.
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA).
Advancing Natural-Language Based Audio Retrieval with PaSST and Large Audio-Caption Data Sets
This work presents a text-to-audio-retrieval system based on pre-trained text and spectrogram transformers.