no code implementations • 15 Apr 2024 • Han Lin, Jaemin Cho, Abhay Zala, Mohit Bansal
Ctrl-Adapter provides diverse capabilities including image control, video control, video control with sparse frames, multi-condition control, compatibility with different backbones, adaptation to unseen control conditions, and video editing.
no code implementations • 18 Mar 2024 • Abhay Zala, Jaemin Cho, Han Lin, Jaehong Yoon, Mohit Bansal
Instead of directly employing LLMs as agents, can we use LLMs' reasoning capabilities to adaptively create training environments to help smaller embodied RL agents learn useful skills that they are weak at?
no code implementations • 18 Oct 2023 • Abhay Zala, Han Lin, Jaemin Cho, Mohit Bansal
In the first stage, we use LLMs to generate and iteratively refine 'diagram plans' (in a planner-auditor feedback loop) which describe all the entities (objects and text labels), their relationships (arrows or lines), and their bounding box layouts.
no code implementations • 26 Sep 2023 • Han Lin, Abhay Zala, Jaemin Cho, Mohit Bansal
Our experiments demonstrate that VideoDirectorGPT framework substantially improves layout and movement control in both single- and multi-scene video generation and can generate multi-scene videos with visual consistency across scenes, while achieving competitive performance with SOTAs in open-domain single-scene T2V generation.
no code implementations • 24 May 2023 • Jaemin Cho, Abhay Zala, Mohit Bansal
First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation.
1 code implementation • CVPR 2023 • Abhay Zala, Jaemin Cho, Satwik Kottur, Xilun Chen, Barlas Oğuz, Yasher Mehdad, Mohit Bansal
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
1 code implementation • NAACL 2022 • Hyounghun Kim, Abhay Zala, Mohit Bansal
Next, a counterfactual imagined scene change (in textual form) is applied, and the model has to predict the new response to the initial question based on this scene change.
2 code implementations • ICCV 2023 • Jaemin Cho, Abhay Zala, Mohit Bansal
In this work, we investigate the visual reasoning capabilities and social biases of different text-to-image models, covering both multimodal transformer language models and diffusion models.
1 code implementation • 4 Apr 2021 • Hyounghun Kim, Abhay Zala, Graham Burri, Mohit Bansal
During the correctional-captioning task, models must generate descriptions of how to move from the current to target pose image, whereas in the retrieval task, models should select the correct target pose given the initial pose and correctional description.
no code implementations • Findings of the Association for Computational Linguistics 2020 • Hyounghun Kim, Abhay Zala, Graham Burri, Hao Tan, Mohit Bansal
During this task, the agent (similar to a PokeMON GO player) is asked to find and collect different target objects one-by-one by navigating based on natural language instructions in a complex, realistic outdoor environment, but then also ARRAnge the collected objects part-by-part in an egocentric grid-layout environment.