We use Something-Something v2 dataset to obtain the generation prompts and ground truth masks from real action videos. We filter out a set of 295 prompts. The details for this filtering are in the "Peekaboo: Interactive Video Generation via Masked-Diffusion" paper. We then use an off-the-shelf OWL-ViT-large open-vocabulary object detector to obtain the bounding box (bbox) annotations of the object in the videos. This set represents bbox and prompt pairs of real-world videos, serving as a test bed for both the quality and control of methods for generating realistic videos with spatio-temporal control.
Paper | Code | Results | Date | Stars |
---|