Gaze following attracts much attention recently while existing databases commonly lack audio information. In this work, we collect the first gaze following dataset containing audios, the VideoGazeSpeech Dataset. The dataset is used to evaluate our method and also encourage future research in multi-modal gaze following. Our dataset comprises a total of $35,231$ frames of $29$ videos. Each video in the dataset has an average duration of approximately $20$ seconds and is recorded at a frame rate of $25$ frames per second (fps). The resolution of each video is $1280 \times 720$ pixels, and the entire dataset occupies a storage space of $7.2$ GB.
Paper | Code | Results | Date | Stars |
---|