Vript (🎬 Vript: Refine Video Captioning into Video Scripting)

We construct a fine-grained video-text dataset with 12K annotated high-resolution videos (~400k clips). The annotation of this dataset is inspired by the video script. If we want to make a video, we have to first write a script to organize how to shoot the scenes in the videos. To shoot a scene, we need to decide the content, shot type (medium shot, close-up, etc), and how the camera moves (panning, tilting, etc). Therefore, we extend video captioning to video scripting by annotating the videos in the format of video scripts. Different from the previous video-text datasets, we densely annotate the entire videos without discarding any scenes and each scene has a caption with ~145 words. Besides the vision modality, we transcribe the voice-over into text and put it along with the video title to give more background information for annotating the videos.

In addition, we propose Vript-Bench, a new benchmark consisting of three challenging video understanding tasks (which are carefully double-checked by humans):

Vript-CAP (Caption): A benchmark with detailed captions rather than short captions.
Vript-RR (Retrieve then Reason): A video reasoning benchmark by first giving a detailed description of the scene as a hint and then asking questions about details in the scene.
Vript-ERO (Event Re-ordering): A benchmark that tests the temporal understanding by offering the descriptions of scenes located in two/four different timelines of the same video, and asks the model to give the right temporal order of the scenes.

Homepage

Benchmarks

Add a new result Link an existing benchmark

No benchmarks yet. Start a new benchmark or link an existing one.

Papers

Paper	Code	Results	Date	Stars

Dataset Loaders

Add Remove

No data loaders found. You can submit your data loader here.

Vript (🎬 Vript: Refine Video Captioning into Video Scripting)

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

License Edit

Modalities Edit

Languages Edit

Benchmarks

Add a new result Link an existing benchmark

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages