no code implementations • 18 Feb 2024 • Tanzila Rahman, Shweta Mahajan, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Leonid Sigal
We illustrate that such joint alternating refinement leads to the learning of better tokens for concepts and, as a bi-product, latent masks.
no code implementations • 19 Dec 2023 • Shweta Mahajan, Tanzila Rahman, Kwang Moo Yi, Leonid Sigal
Further, we leverage the findings that different timesteps of the diffusion process cater to different levels of detail in an image.
1 code implementation • CVPR 2023 • Tanzila Rahman, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Shweta Mahajan, Leonid Sigal
Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, which are consistent with the story, but also models appropriate correspondences between the characters and the background.
1 code implementation • NeurIPS 2021 • Tanzila Rahman, Mengyu Yang, Leonid Sigal
In this work, we introduce TriBERT -- a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of flexible co-attention.
1 code implementation • 26 Oct 2021 • Tanzila Rahman, Mengyu Yang, Leonid Sigal
In this work, we introduce TriBERT -- a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of flexible co-attention.
no code implementations • 25 Mar 2021 • Tanzila Rahman, Leonid Sigal
Learning how to localize and separate individual object sounds in the audio channel of the video is a difficult task.
1 code implementation • 4 Nov 2020 • Tanzila Rahman, Shih-Han Chou, Leonid Sigal, Giuseppe Carenini
We also propose multimodal fusion module to combine both visual and textual information.
no code implementations • ICCV 2019 • Tanzila Rahman, Bicheng Xu, Leonid Sigal
Multi-modal learning, particularly among imaging and linguistic modalities, has made amazing strides in many high-level fundamental visual understanding problems, ranging from language grounding to dense event captioning.
no code implementations • 9 Apr 2019 • Tanzila Rahman, Mrigank Rochan, Yang Wang
A common approach for person re-identification is to first extract image features for all frames in the video, then aggregate all the features to form a video-level feature.
1 code implementation • 26 Oct 2018 • Shivansh Rao, Tanzila Rahman, Mrigank Rochan, Yang Wang
The goal is to identify a person from videos captured under different cameras.