Search Results for author: Weicheng Kuo

Found 16 papers, 7 papers with code

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

no code implementations • 4 Jan 2024 • Zihao Xiao, Longlong Jing, Shangxuan Wu, Alex Zihao Zhu, Jingwei Ji, Chiyu Max Jiang, Wei-Chih Hung, Thomas Funkhouser, Weicheng Kuo, Anelia Angelova, Yin Zhou, Shiwei Sheng

3D panoptic segmentation is a challenging perception task, especially in autonomous driving.

Autonomous Driving Classification +3

Paper
Add Code

Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection

1 code implementation • 29 Sep 2023 • Dahun Kim, Anelia Angelova, Weicheng Kuo

We present a new open-vocabulary detection approach based on detection-oriented image-text pretraining to bridge the gap between image-level pretraining and open-vocabulary object detection.

Ranked #1 on Open Vocabulary Object Detection on LVIS v1.0

Contrastive Learning Object +2

32,819

Paper
Code

Contrastive Feature Masking Open-Vocabulary Vision Transformer

no code implementations • ICCV 2023 • Dahun Kim, Anelia Angelova, Weicheng Kuo

We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD).

Ranked #5 on Open Vocabulary Object Detection on LVIS v1.0

Contrastive Learning object-detection +3

Paper
Add Code

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

2 code implementations • CVPR 2023 • Dahun Kim, Anelia Angelova, Weicheng Kuo

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection.

Ranked #5 on Zero-Shot Cross-Modal Retrieval on Flickr30k

Contrastive Learning object-detection +4

32,820

Paper
Code

RECLIP: Resource-efficient CLIP by Training with Small Images

no code implementations • 12 Apr 2023 • Runze Li, Dahun Kim, Bir Bhanu, Weicheng Kuo

We present RECLIP (Resource-efficient CLIP), a simple method that minimizes computational resource footprint for CLIP (Contrastive Language Image Pretraining).

Contrastive Learning Retrieval +3

Paper
Add Code

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

no code implementations • 29 Mar 2023 • Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, Claire Cui, Anelia Angelova

We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.

Ranked #1 on Video Captioning on MSVD

Cross-Modal Retrieval Image Retrieval +7

Paper
Add Code

Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

1 code implementation • CVPR 2023 • AJ Piergiovanni, Weicheng Kuo, Anelia Angelova

We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs.

Ranked #2 on Action Classification on Kinetics-600 (using extra training data)

Action Classification Action Recognition In Videos

Paper
Code

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

1 code implementation • 30 Sep 2022 • Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova

We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models.

Knowledge Distillation object-detection +1

32,821

Paper
Code

PaLI: A Jointly-Scaled Multilingual Language-Image Model

1 code implementation • 14 Sep 2022 • Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut

PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages.

Ranked #1 on Zero-Shot Transfer Image Classification on ImageNet-S

Few-Shot Image Classification Image Captioning +5

1,557

Paper
Code

Pre-training image-language transformers for open-vocabulary tasks

no code implementations • 9 Sep 2022 • AJ Piergiovanni, Weicheng Kuo, Anelia Angelova

We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks.

Question Answering Visual Entailment +1

Paper
Add Code

Video Question Answering with Iterative Video-Text Co-Tokenization

no code implementations • 1 Aug 2022 • AJ Piergiovanni, Kairo Morton, Weicheng Kuo, Michael S. Ryoo, Anelia Angelova

Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video.

Ranked #4 on Video Question Answering on iVQA

Question Answering Video Question Answering +1

Paper
Add Code

Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

no code implementations • 2 May 2022 • AJ Piergiovanni, Wei Li, Weicheng Kuo, Mohammad Saffar, Fred Bertsch, Anelia Angelova

We present Answer-Me, a task-aware multi-task framework which unifies a variety of question answering tasks, such as, visual question answering, visual entailment, visual reasoning.

Image Captioning Question Answering +4

Paper
Add Code

FindIt: Generalized Localization with Natural Language Queries

no code implementations • 31 Mar 2022 • Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, Anelia Angelova

We propose FindIt, a simple and versatile framework that unifies a variety of visual grounding and localization tasks including referring expression comprehension, text-based localization, and object detection.

Natural Language Queries Object +5

Paper
Add Code

Patch2CAD: Patchwise Embedding Learning for In-the-Wild Shape Retrieval from a Single Image

no code implementations • ICCV 2021 • Weicheng Kuo, Anelia Angelova, Tsung-Yi Lin, Angela Dai

3D perception of object shapes from RGB image input is fundamental towards semantic scene understanding, grounding image-based perception in our spatially 3-dimensional real-world environments.

Retrieval Scene Understanding