HiREST (HIerarchical REtrieval and STep-captioning) dataset is a benchmark that covers hierarchical information retrieval and visual/textual stepwise summarization from an instructional video corpus. It consists of 3.4K text-video pairs from a video dataset, where 1.1K videos have annotations of moment spans relevant to text query and breakdown of each moment into key instruction steps with caption and timestamps (totaling 8.6K step captions). The dataset consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
Source: Hierarchical Video-Moment Retrieval and Step-CaptioningPaper | Code | Results | Date | Stars |
---|