Text to Audio Retrieval

9 papers with code • 4 benchmarks • 5 datasets

This task has no description! Would you like to contribute one?

Benchmarks

Add a Result

These leaderboards are used to track progress in Text to Audio Retrieval

Dataset	Best Model	Compare
AudioCaps	InternVideo2-6B	See all
Clotho	InternVideo2-6B	See all
SoundDescs	CE	See all
Localized Narratives	OPT	See all

Datasets

Most implemented papers

Most implemented Social Latest No code

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

mindspore-ai/models • • 1 Jul 2021

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources.

Paper
Code

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

OFA-Sys/ONE-PEACE • • 18 May 2023

In this work, we explore a scalable way for building a general representation model toward unlimited modalities.

Paper
Code

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

opengvlab/internvideo2 • 22 Mar 2024

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.

Paper
Code

Audio Retrieval with Natural Language Queries

oncescuandreea/audio-retrieval • • 5 May 2021

We consider the task of retrieving audio using free-form natural language queries.

Paper
Code

Audio Retrieval with Natural Language Queries: A Benchmark Study

akoepke/audio-retrieval-benchmark • • 17 Dec 2021

Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho.

Paper
Code

Cross Modal Retrieval with Querybank Normalisation

ioanacroi/qb-norm • CVPR 2022

In this work we first show that, despite their effectiveness, state-of-the-art joint embeddings suffer significantly from the longstanding "hubness problem" in which a small number of gallery embeddings form the nearest neighbours of many queries.

Paper
Code

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

TXH-mercury/VALOR • • 17 Apr 2023

Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner.

Paper
Code

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

txh-mercury/vast • • NeurIPS 2023

Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA).

Paper
Code

Advancing Natural-Language Based Audio Retrieval with PaSST and Large Audio-Caption Data Sets

optimusprimus/dcase2023_task6b • • 8 Aug 2023

This work presents a text-to-audio-retrieval system based on pre-trained text and spectrogram transformers.

Paper
Code

Text to Audio Retrieval

Benchmarks Add a Result

Datasets

Most implemented papers

Content

Benchmarks

Add a Result