1 code implementation • 4 Apr 2024 • Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, Mohamed Elhoseiny
This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding.
Ranked #3 on Zero-Shot Video Question Answer on TVQA
1 code implementation • 14 Oct 2023 • Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, Mohamed Elhoseiny
Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others.
Ranked #10 on Visual Question Answering on BenchLMM
no code implementations • 1 Jun 2023 • Jun Chen, Deyao Zhu, Guocheng Qian, Bernard Ghanem, Zhicheng Yan, Chenchen Zhu, Fanyi Xiao, Mohamed Elhoseiny, Sean Chang Culatana
Although acquired extensive knowledge of visual concepts, it is non-trivial to exploit knowledge from these VL models to the task of semantic segmentation, as they are usually trained at an image level.
5 code implementations • 20 Apr 2023 • Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny
Our work, for the first time, uncovers that properly aligning the visual features with an advanced large language model can possess numerous advanced multi-modal abilities demonstrated by GPT-4, such as detailed image description generation and website creation from hand-drawn drafts.
Ranked #9 on Visual Question Answering on BenchLMM
1 code implementation • 9 Apr 2023 • Jun Chen, Deyao Zhu, Kilichbek Haydarov, Xiang Li, Mohamed Elhoseiny
Video captioning aims to convey dynamic scenes from videos using natural language, facilitating the understanding of spatiotemporal information within our environment.
1 code implementation • 12 Mar 2023 • Deyao Zhu, Jun Chen, Kilichbek Haydarov, Xiaoqian Shen, Wenxuan Zhang, Mohamed Elhoseiny
By keeping acquiring new visual information from BLIP-2's answers, ChatCaptioner is able to generate more enriched image descriptions.
1 code implementation • 30 Jan 2023 • Deyao Zhu, Yuhui Wang, Jürgen Schmidhuber, Mohamed Elhoseiny
In this paper, we investigate the potential of using action-free offline datasets to improve online reinforcement learning, name this problem Reinforcement Learning with Action-Free Offline Pretraining (AFP-RL).
no code implementations • ICCV 2023 • Jun Chen, Deyao Zhu, Guocheng Qian, Bernard Ghanem, Zhicheng Yan, Chenchen Zhu, Fanyi Xiao, Sean Chang Culatana, Mohamed Elhoseiny
Semantic segmentation is a crucial task in computer vision that involves segmenting images into semantically meaningful regions at the pixel level.
1 code implementation • 9 Jun 2022 • Deyao Zhu, Li Erran Li, Mohamed Elhoseiny
In some complex environments with continuous state-action spaces, sparse rewards, and/or long temporal horizons, learning a good policy in the original environments can be difficult.
1 code implementation • 6 Mar 2022 • Abduallah Mohamed, Deyao Zhu, Warren Vu, Mohamed Elhoseiny, Christian Claudel
AMD is a metric that quantifies how close the whole generated samples are to the ground truth.
Ranked #1 on Trajectory Prediction on Stanford Drone (ADE (in world coordinates) metric)
no code implementations • 29 Sep 2021 • Deyao Zhu, Li Erran Li, Mohamed Elhoseiny
Deep reinforcement learning agents trained in real-world environments with a limited diversity of object properties to learn manipulation tasks tend to suffer overfitting and fail to generalize to unseen testing environments.
1 code implementation • CVPR 2022 • Jun Chen, Aniket Agarwal, Sherif Abdelkarim, Deyao Zhu, Mohamed Elhoseiny
This paper shows that modeling an effective message-passing flow through an attention mechanism can be critical to tackling the compositionality and long-tail challenges in VRR.
no code implementations • ICLR 2021 • Deyao Zhu, Mohamed Zahran, Li Erran Li, Mohamed Elhoseiny
Our model's learned representation leads to better and more semantically meaningful coverage of the trajectory distribution.
no code implementations • 1 Jan 2021 • Deyao Zhu, Mohamed Zahran, Li Erran Li, Mohamed Elhoseiny
We propose a new objective, unlikelihood training, which forces generated trajectories that conflicts with contextual information to be assigned a lower probability by our model.