VGS (VideoGazeSpeech)

Gaze following attracts much attention recently while existing databases commonly lack audio information. In this work, we collect the first gaze following dataset containing audios, the VideoGazeSpeech Dataset. The dataset is used to evaluate our method and also encourage future research in multi-modal gaze following. Our dataset comprises a total of $35,231$ frames of $29$ videos. Each video in the dataset has an average duration of approximately $20$ seconds and is recorded at a frame rate of $25$ frames per second (fps). The resolution of each video is $1280 \times 720$ pixels, and the entire dataset occupies a storage space of $7.2$ GB.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


  • Unknown

Modalities


Languages