no code implementations • 4 Jan 2024 • Zihao Xiao, Longlong Jing, Shangxuan Wu, Alex Zihao Zhu, Jingwei Ji, Chiyu Max Jiang, Wei-Chih Hung, Thomas Funkhouser, Weicheng Kuo, Anelia Angelova, Yin Zhou, Shiwei Sheng
3D panoptic segmentation is a challenging perception task, especially in autonomous driving.
1 code implementation • 29 Sep 2023 • Dahun Kim, Anelia Angelova, Weicheng Kuo
We present a new open-vocabulary detection approach based on detection-oriented image-text pretraining to bridge the gap between image-level pretraining and open-vocabulary object detection.
Ranked #1 on Open Vocabulary Object Detection on LVIS v1.0
no code implementations • ICCV 2023 • Dahun Kim, Anelia Angelova, Weicheng Kuo
We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD).
Ranked #5 on Open Vocabulary Object Detection on LVIS v1.0
2 code implementations • CVPR 2023 • Dahun Kim, Anelia Angelova, Weicheng Kuo
We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection.
Ranked #5 on Zero-Shot Cross-Modal Retrieval on Flickr30k
no code implementations • 12 Apr 2023 • Runze Li, Dahun Kim, Bir Bhanu, Weicheng Kuo
We present RECLIP (Resource-efficient CLIP), a simple method that minimizes computational resource footprint for CLIP (Contrastive Language Image Pretraining).
no code implementations • 29 Mar 2023 • Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, Claire Cui, Anelia Angelova
We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
Ranked #1 on Video Captioning on MSVD
1 code implementation • CVPR 2023 • AJ Piergiovanni, Weicheng Kuo, Anelia Angelova
We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs.
Ranked #2 on Action Classification on Kinetics-600 (using extra training data)
1 code implementation • 30 Sep 2022 • Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova
We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models.
1 code implementation • 14 Sep 2022 • Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut
PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages.
no code implementations • 9 Sep 2022 • AJ Piergiovanni, Weicheng Kuo, Anelia Angelova
We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks.
no code implementations • 1 Aug 2022 • AJ Piergiovanni, Kairo Morton, Weicheng Kuo, Michael S. Ryoo, Anelia Angelova
Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video.
Ranked #4 on Video Question Answering on iVQA
no code implementations • 2 May 2022 • AJ Piergiovanni, Wei Li, Weicheng Kuo, Mohammad Saffar, Fred Bertsch, Anelia Angelova
We present Answer-Me, a task-aware multi-task framework which unifies a variety of question answering tasks, such as, visual question answering, visual entailment, visual reasoning.
no code implementations • 31 Mar 2022 • Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, Anelia Angelova
We propose FindIt, a simple and versatile framework that unifies a variety of visual grounding and localization tasks including referring expression comprehension, text-based localization, and object detection.
no code implementations • ICCV 2021 • Weicheng Kuo, Anelia Angelova, Tsung-Yi Lin, Angela Dai
3D perception of object shapes from RGB image input is fundamental towards semantic scene understanding, grounding image-based perception in our spatially 3-dimensional real-world environments.
3 code implementations • 15 Aug 2021 • Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon, Weicheng Kuo
In this paper, we identify that the problem is that the binary classifiers in existing proposal methods tend to overfit to the training categories.
Ranked #2 on Open World Object Detection on COCO VOC to non-VOC
4 code implementations • ICLR 2022 • Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui
On COCO, ViLD outperforms the previous state-of-the-art by 4. 8 on novel AP and 11. 4 on overall AP.
Ranked #2 on Open Vocabulary Object Detection on Objects365