Search Results for author: Weicheng Kuo

Found 16 papers, 7 papers with code

Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection

1 code implementation29 Sep 2023 Dahun Kim, Anelia Angelova, Weicheng Kuo

We present a new open-vocabulary detection approach based on detection-oriented image-text pretraining to bridge the gap between image-level pretraining and open-vocabulary object detection.

Contrastive Learning Object +2

Contrastive Feature Masking Open-Vocabulary Vision Transformer

no code implementations ICCV 2023 Dahun Kim, Anelia Angelova, Weicheng Kuo

We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD).

Contrastive Learning object-detection +3

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

2 code implementations CVPR 2023 Dahun Kim, Anelia Angelova, Weicheng Kuo

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection.

Contrastive Learning object-detection +4

RECLIP: Resource-efficient CLIP by Training with Small Images

no code implementations12 Apr 2023 Runze Li, Dahun Kim, Bir Bhanu, Weicheng Kuo

We present RECLIP (Resource-efficient CLIP), a simple method that minimizes computational resource footprint for CLIP (Contrastive Language Image Pretraining).

Contrastive Learning Retrieval +3

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

no code implementations29 Mar 2023 Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, Claire Cui, Anelia Angelova

We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.

Cross-Modal Retrieval Image Retrieval +7

Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

1 code implementation CVPR 2023 AJ Piergiovanni, Weicheng Kuo, Anelia Angelova

We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs.

Ranked #2 on Action Classification on Kinetics-600 (using extra training data)

Action Classification Action Recognition In Videos

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

1 code implementation30 Sep 2022 Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova

We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models.

Knowledge Distillation object-detection +1

Pre-training image-language transformers for open-vocabulary tasks

no code implementations9 Sep 2022 AJ Piergiovanni, Weicheng Kuo, Anelia Angelova

We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks.

Question Answering Visual Entailment +1

Video Question Answering with Iterative Video-Text Co-Tokenization

no code implementations1 Aug 2022 AJ Piergiovanni, Kairo Morton, Weicheng Kuo, Michael S. Ryoo, Anelia Angelova

Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video.

Question Answering Video Question Answering +1

Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

no code implementations2 May 2022 AJ Piergiovanni, Wei Li, Weicheng Kuo, Mohammad Saffar, Fred Bertsch, Anelia Angelova

We present Answer-Me, a task-aware multi-task framework which unifies a variety of question answering tasks, such as, visual question answering, visual entailment, visual reasoning.

Image Captioning Question Answering +4

FindIt: Generalized Localization with Natural Language Queries

no code implementations31 Mar 2022 Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, Anelia Angelova

We propose FindIt, a simple and versatile framework that unifies a variety of visual grounding and localization tasks including referring expression comprehension, text-based localization, and object detection.

Natural Language Queries Object +5

Patch2CAD: Patchwise Embedding Learning for In-the-Wild Shape Retrieval from a Single Image

no code implementations ICCV 2021 Weicheng Kuo, Anelia Angelova, Tsung-Yi Lin, Angela Dai

3D perception of object shapes from RGB image input is fundamental towards semantic scene understanding, grounding image-based perception in our spatially 3-dimensional real-world environments.

Retrieval Scene Understanding

Learning Open-World Object Proposals without Learning to Classify

3 code implementations15 Aug 2021 Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon, Weicheng Kuo

In this paper, we identify that the problem is that the binary classifiers in existing proposal methods tend to overfit to the training categories.

Object object-detection +4

Cannot find the paper you are looking for? You can Submit a new open access paper.