2 dataset results for audio-visual learning AND Videos AND English

Acappella comprises around 46 hours of a cappella solo singing videos sourced from YouTbe, sampled across different singers and languages. Four languages are considered: English, Spanish, Hindi and others.

5 PAPERS • NO BENCHMARKS YET

AVMIT (Audiovisual Moments in Time)

Audiovisual Moments in Time (AVMIT) is a large-scale dataset of audiovisual action events. The dataset includes the annotation of 57,177 audiovisual videos from the Moments in Time dataset, each independently evaluated by 3 of 11 trained participants. Each annotation pertains to whether the labelled audiovisual action event is present and whether it is the most prominent feature of the video. The dataset also provides a curated test set of 960 videos across 16 classes, suitable for comparative experiments involving computational models and human participants, specifically when addressing research questions where audiovisual correspondence is of critical importance.

1 PAPER • NO BENCHMARKS YET

Datasets

2 dataset results for audio-visual learning AND Videos AND English