no code implementations • 20 Feb 2024 • Adnen Abdessaied, Manuel von Hochmeister, Andreas Bulling
OLViT addresses these challenges by maintaining a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST): while the OST attends to the most important objects within the video, the LST keeps track of the most important linguistic co-references to previous dialog turns.