no code implementations • 30 Nov 2023 • Rohan Myer Krishnan, Zitian Tang, Zhiqiu Yu, Chen Sun
To do this, video-language models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel domains.