no code implementations • 21 Nov 2023 • Gengyuan Zhang, Jinhe Bi, Jindong Gu, Yanyu Chen, Volker Tresp
This raises a question: with such weak supervision, can video representation in video-language models gain the ability to distinguish even factual discrepancies in textual description and understand fine-grained events?