Search Results for author: Ronghang Hu

Found 25 papers, 17 papers with code

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

10 code implementations • CVPR 2023 • Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie

This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation.

Ranked #45 on Semantic Segmentation on ADE20K

Object Detection Representation Learning +2

29,774

Paper
Code

UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding

no code implementations • ICCV 2023 • Dave Zhenyu Chen, Ronghang Hu, Xinlei Chen, Matthias Nießner, Angel X. Chang

Performing 3D dense captioning and visual grounding requires a common and shared understanding of the underlying multimodal relationships.

3D dense captioning Dense Captioning +1

Paper
Add Code

Scaling Language-Image Pre-training via Masking

4 code implementations • CVPR 2023 • Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, Kaiming He

We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP.

8,451

Paper
Code

Exploring Long-Sequence Masked Autoencoders

1 code implementation • 13 Oct 2022 • Ronghang Hu, Shoubhik Debnath, Saining Xie, Xinlei Chen

Masked Autoencoding (MAE) has emerged as an effective approach for pre-training representations across multiple domains.

Object Detection Segmentation +1

Paper
Code

FLAVA: A Foundational Language And Vision Alignment Model

3 code implementations • CVPR 2022 • Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks.

Ranked #4 on Image Retrieval on MS COCO

Image Retrieval Image-to-Text Retrieval +3

1,294

Paper
Code

UniT: Multimodal Multitask Learning with a Unified Transformer

1 code implementation • ICCV 2021 • Ronghang Hu, Amanpreet Singh

We propose UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning.

Multimodal Reasoning Multi-Task Learning +4

5,415

Paper
Code

Worldsheet: Wrapping the World in a 3D Sheet for View Synthesis from a Single Image

1 code implementation • ICCV 2021 • Ronghang Hu, Nikhila Ravi, Alexander C. Berg, Deepak Pathak

We present Worldsheet, a method for novel view synthesis using just a single RGB image as input.

Novel View Synthesis

Paper
Code

TextCaps: a Dataset for Image Captioning with Reading Comprehension

no code implementations • ECCV 2020 • Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, Amanpreet Singh

Image descriptions can help visually impaired people to quickly understand the image content.

Image Captioning Optical Character Recognition +3

Paper
Add Code

Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

1 code implementation • CVPR 2020 • Ronghang Hu, Amanpreet Singh, Trevor Darrell, Marcus Rohrbach

Recent work has explored the TextVQA task that requires reading and understanding text in images to answer a question.

General Classification

Paper
Code

Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation

no code implementations • ACL 2019 • Ronghang Hu, Daniel Fried, Anna Rohrbach, Dan Klein, Trevor Darrell, Kate Saenko

The actual grounding can connect language to the environment through multiple modalities, e. g. "stop at the door" might ground into visual objects, while "turn right" might rely only on the geometric structure of a route.

Vision and Language Navigation

Paper
Add Code

Language-Conditioned Graph Networks for Relational Reasoning

1 code implementation • ICCV 2019 • Ronghang Hu, Anna Rohrbach, Trevor Darrell, Kate Saenko

E. g., conditioning on the "on" relationship to the plate, the object "mug" gathers messages from the object "plate" to update its representation to "mug on the plate", which can be easily consumed by a simple classifier for answer prediction.

Ranked #3 on Referring Expression Comprehension on CLEVR-Ref+

Object Referring Expression Comprehension +2

Paper
Code

Grounding Visual Explanations

no code implementations • ECCV 2018 • Lisa Anne Hendricks, Ronghang Hu, Trevor Darrell, Zeynep Akata

Our model improves the textual explanation quality of fine-grained classification decisions on the CUB dataset by mentioning phrases that are grounded in the image.

General Classification Sentence

Paper
Add Code

Explainable Neural Computation via Stack Neural Module Networks

1 code implementation • ECCV 2018 • Ronghang Hu, Jacob Andreas, Trevor Darrell, Kate Saenko

In complex inferential tasks like question answering, machine learning models must confront two challenges: the need to implement a compositional reasoning process, and, in many applications, the need for this reasoning process to be interpretable to assist users in both development and prediction.

Ranked #14 on Referring Expression Comprehension on Talk2Car

Decision Making Question Answering +1

Paper
Code

Generating Counterfactual Explanations with Natural Language

no code implementations • 26 Jun 2018 • Lisa Anne Hendricks, Ronghang Hu, Trevor Darrell, Zeynep Akata

We call such textual explanations counterfactual explanations, and propose an intuitive method to generate counterfactual explanations by inspecting which evidence in an input is missing, but might contribute to a different classification decision if present in the image.

Classification counterfactual +2

Paper
Add Code

Speaker-Follower Models for Vision-and-Language Navigation

1 code implementation • NeurIPS 2018 • Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, Trevor Darrell

We use this speaker model to (1) synthesize new instructions for data augmentation and to (2) implement pragmatic reasoning, which evaluates how well candidate action sequences explain an instruction.

Data Augmentation Vision and Language Navigation

124

Paper
Code

Learning to Segment Every Thing

3 code implementations • CVPR 2018 • Ronghang Hu, Piotr Dollár, Kaiming He, Trevor Darrell, Ross Girshick

Most methods for object instance segmentation require all training examples to be labeled with segmentation masks.

Instance Segmentation Segmentation +1

26,139

Paper
Code

Grounding Visual Explanations (Extended Abstract)

no code implementations • 17 Nov 2017 • Lisa Anne Hendricks, Ronghang Hu, Trevor Darrell, Zeynep Akata

Existing models which generate textual explanations enforce task relevance through a discriminative term loss function, but such mechanisms only weakly constrain mentioned object parts to actually be present in the image.

Attribute

Paper
Add Code

Learning to Reason: End-to-End Module Networks for Visual Question Answering

1 code implementation • ICCV 2017 • Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Kate Saenko

Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems.

Ranked #43 on Visual Question Answering (VQA) on VQA v2 test-dev

Visual Dialog Visual Question Answering

270

Paper
Code

Modeling Relationships in Referential Expressions with Compositional Modular Networks

2 code implementations • CVPR 2017 • Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, Kate Saenko

In this paper we instead present a modular deep architecture capable of analyzing referential expressions into their component parts, identifying entities and relationships mentioned in the input expression and grounding them all in the scene.

Ranked #1 on Visual Question Answering (VQA) on Visual7W

Visual Question Answering (VQA)

745

Paper
Code

Utilizing Large Scale Vision and Text Datasets for Image Segmentation from Referring Expressions

no code implementations • 30 Aug 2016 • Ronghang Hu, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell

Image segmentation from referring expressions is a joint vision and language modeling task, where the input is an image and a textual expression describing a particular region in the image; and the goal is to localize and segment the specific image region based on the given expression.

Image Captioning Image Segmentation +3

Paper
Add Code

Segmentation from Natural Language Expressions

4 code implementations • 20 Mar 2016 • Ronghang Hu, Marcus Rohrbach, Trevor Darrell

To produce pixelwise segmentation for the language expression, we propose an end-to-end trainable recurrent and convolutional network model that jointly learns to process visual and linguistic information.

Ranked #16 on Referring Expression Segmentation on J-HMDB

Referring Expression Segmentation Segmentation +1

Paper
Code

Natural Language Object Retrieval

1 code implementation • CVPR 2016 • Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, Trevor Darrell

In this paper, we address the task of natural language object retrieval, to localize a target object within a given image based on a natural language query of the object.

Ranked #12 on Referring Expression Comprehension on Talk2Car

Image Captioning Image Retrieval +4

112

Paper
Code

Grounding of Textual Phrases in Images by Reconstruction

3 code implementations • 12 Nov 2015 • Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, Bernt Schiele

We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.

Ranked #12 on Phrase Grounding on Flickr30k Entities Test

Language Modelling Natural Language Visual Grounding +2

218

Paper
Code

Spatial Semantic Regularisation for Large Scale Object Detection

no code implementations • ICCV 2015 • Damian Mrowca, Marcus Rohrbach, Judy Hoffman, Ronghang Hu, Kate Saenko, Trevor Darrell

Our approach proves to be especially useful in large scale settings with thousands of classes, where spatial and semantic interactions are very frequent and only weakly supervised detectors can be built due to a lack of bounding box annotations.

Clustering Object +2

Paper
Add Code

LSDA: Large Scale Detection Through Adaptation

1 code implementation • NeurIPS 2014 • Judy Hoffman, Sergio Guadarrama, Eric Tzeng, Ronghang Hu, Jeff Donahue, Ross Girshick, Trevor Darrell, Kate Saenko

A major challenge in scaling object detection is the difficulty of obtaining labeled images for large numbers of categories.

Classification General Classification +2

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.