BAVL (Blind Audio-Visual Localization (BAVL))

Blind Audio-Visual Localization (BAVL) Dataset consists of 20 audio-visual recordings of sound sources, which could be talking faces or music instruments. Most audio-visual recordings (19) are videos from Youtube except V8, which is from [1]. Besides, the video V7 was also used in[2][3], and V16 used in [3]. All 20 videos are annotated by ourselves in a uniform manner. Details of the video sequences are listed in Table 1.

The videos in the dataset have average duration of 10 seconds, and they are all recorded by one camera and one microphone. The audio files (.wav) was sampled at a 16 kHz for V7, V8, V16, and 44.1 kHz for the rest. The video frames contain the sound-making object (sound source) and distracting objects (e.g. pedestrian on the street), while the audio signals consists of the sound produced by the sound source (human speech or instrumental music), environmental noise and sometimes other sounds. The distracting objects and other irrelevant noise/sounds do not exist in all videos. The primary usage of the dataset is to evaluate the performance of sound source localization method, in the presence of distracting motions and noise.

[1] Kidron, Einat, Yoav Y. Schechner, and Michael Elad. "Pixels that sound."Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 1. IEEE, 2005.

[2] Izadinia, Hamid, Imran Saleemi, and Mubarak Shah. "Multimodal analysis for identification and segmentation of moving-sounding objects."IEEE Transactions on Multimedia 15.2 (2013): 378-390.

[3] Li, Kai, Jun Ye, and Kien A. Hua. "What's making that sound?."Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014.

Source: Sound of Pixels

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


License


  • Unknown

Modalities


Languages