no code implementations • 24 Sep 2022 • Thao Minh Le
Visual perception and language understanding are - fundamental components of human intelligence, enabling them to understand and reason about objects and their interactions.
1 code implementation • 8 Jul 2022 • Hoang-Anh Pham, Thao Minh Le, Vuong Le, Tu Minh Phuong, Truyen Tran
To tackle these challenges we present a new object-centric framework for video dialog that supports neural reasoning dubbed COST - which stands for Conversation about Objects in Space-Time.
no code implementations • 25 May 2022 • Thao Minh Le, Vuong Le, Sunil Gupta, Svetha Venkatesh, Truyen Tran
This grounding guides the attention mechanism inside VQA models through a duality of mechanisms: pre-training attention weight calculation and directly guiding the weights at inference time on a case-by-case basis.
no code implementations • 25 Jun 2021 • Long Hoang Dang, Thao Minh Le, Vuong Le, Truyen Tran
Toward reaching this goal we propose an object-oriented reasoning approach in that video is abstracted as a dynamic stream of interacting objects.
no code implementations • 12 Apr 2021 • Long Hoang Dang, Thao Minh Le, Vuong Le, Truyen Tran
Video question answering (Video QA) presents a powerful testbed for human-like intelligent behaviors.
no code implementations • 18 Oct 2020 • Thao Minh Le, Vuong Le, Svetha Venkatesh, Truyen Tran
Video QA challenges modelers in multiple fronts.
1 code implementation • 25 Sep 2020 • Tri Minh Nguyen, Thin Nguyen, Thao Minh Le, Truyen Tran
In addition, previous DTA methods learn protein representation solely based on a small number of protein sequences in DTA datasets while neglecting the use of proteins outside of the DTA datasets.
1 code implementation • 30 Apr 2020 • Thao Minh Le, Vuong Le, Svetha Venkatesh, Truyen Tran
We present Language-binding Object Graph Network, the first neural reasoning method with dynamic relational structures across both visual and textual domains with applications in visual question answering.
1 code implementation • CVPR 2020 • Thao Minh Le, Vuong Le, Svetha Venkatesh, Truyen Tran
Video question answering (VideoQA) is challenging as it requires modeling capacity to distill dynamic visual artifacts and distant relations and to associate them with linguistic concepts.
Ranked #3 on Audio-Visual Question Answering (AVQA) on AVQA
Audio-Visual Question Answering (AVQA) Question Answering +4
no code implementations • 10 Jul 2019 • Thao Minh Le, Vuong Le, Svetha Venkatesh, Truyen Tran
While recent advances in lingual and visual question answering have enabled sophisticated representations and neural reasoning mechanisms, major challenges in Video QA remain on dynamic grounding of concepts, relations and actions to support the reasoning process.
no code implementations • 12 Sep 2018 • Thao Minh Le, Nobuyuki Shimizu, Takashi Miyazaki, Koichi Shinoda
With the widespread use of intelligent systems, such as smart speakers, addressee recognition has become a concern in human-computer interaction, as more and more people expect such systems to understand complicated social scenes, including those outdoors, in cafeterias, and hospitals.
no code implementations • 30 May 2018 • Thao Minh Le, Nakamasa Inoue, Koichi Shinoda
This paper presents a new framework for human action recognition from a 3D skeleton sequence.
Ranked #99 on Skeleton Based Action Recognition on NTU RGB+D