Search Results for author: ShiZhe Chen

Found 37 papers, 19 papers with code

ViViDex: Learning Vision-based Dexterous Manipulation from Human Videos

no code implementations • 24 Apr 2024 • Zerui Chen, ShiZhe Chen, Cordelia Schmid, Ivan Laptev

In this work, we aim to learn a unified vision-based policy for a multi-fingered robot hand to manipulate different objects in diverse poses.

Paper
Add Code

Think-Program-reCtify: 3D Situated Reasoning with Large Language Models

no code implementations • 23 Apr 2024 • Qingrong He, Kejun Lin, ShiZhe Chen, Anwen Hu, Qin Jin

The Think phase first decomposes the compositional question into a sequence of steps, and then the Program phase grounds each step to a piece of code and calls carefully designed 3D visual perception modules.

Visual Reasoning

Paper
Add Code

SUGAR: Pre-training 3D Visual Representations for Robotics

no code implementations • 1 Apr 2024 • ShiZhe Chen, Ricardo Garcia, Ivan Laptev, Cordelia Schmid

SUGAR employs a versatile transformer-based model to jointly address five pre-training tasks, namely cross-modal knowledge distillation for semantic learning, masked point modeling to understand geometry structures, grasping pose synthesis for object affordance, 3D instance segmentation and referring expression grounding to analyze cluttered scenes.

3D Instance Segmentation 3D Object Recognition +5

Paper
Add Code

PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation

1 code implementation • 27 Sep 2023 • ShiZhe Chen, Ricardo Garcia, Cordelia Schmid, Ivan Laptev

The ability for robots to comprehend and execute manipulation tasks based on natural language instructions is a long-term goal in robotics.

Ranked #5 on Robot Manipulation on RLBench

Multi-Task Learning Robot Manipulation

Paper
Code

Explore and Tell: Embodied Visual Captioning in 3D Environments

no code implementations • ICCV 2023 • Anwen Hu, ShiZhe Chen, Liang Zhang, Qin Jin

To overcome this limitation, we propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities, enabling them to actively explore the scene and reduce visual ambiguity from suboptimal viewpoints.

Image Captioning Navigate +1

Paper
Add Code

Object Goal Navigation with Recursive Implicit Maps

no code implementations • 10 Aug 2023 • ShiZhe Chen, Thomas Chabal, Ivan Laptev, Cordelia Schmid

Object goal navigation aims to navigate an agent to locations of a given object category in unseen environments.

Navigate Object

Paper
Add Code

Robust Visual Sim-to-Real Transfer for Robotic Manipulation

no code implementations • 28 Jul 2023 • Ricardo Garcia, Robin Strudel, ShiZhe Chen, Etienne Arlaud, Ivan Laptev, Cordelia Schmid

While previous work mainly evaluates DR for disembodied tasks, such as pose estimation and object detection, here we systematically explore visual domain randomization methods and benchmark them on a rich set of challenging robotic manipulation tasks.

object-detection Object Detection +1

Paper
Add Code

InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation

1 code implementation • 10 May 2023 • Anwen Hu, ShiZhe Chen, Liang Zhang, Qin Jin

Existing metrics only provide a single score to measure caption qualities, which are less explainable and informative.

Benchmarking Image Captioning

Paper
Code

gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction

1 code implementation • CVPR 2023 • Zerui Chen, ShiZhe Chen, Cordelia Schmid, Ivan Laptev

In particular, we address reconstruction of hands and manipulated objects from monocular RGB images.

Ranked #5 on hand-object pose on DexYCB

3D Reconstruction 3D Shape Reconstruction +2

Paper
Code

TeViS:Translating Text Synopses to Video Storyboards

no code implementations • 31 Dec 2022 • Xu Gu, Yuchong Sun, Feiyue Ni, ShiZhe Chen, Xihua Wang, Ruihua Song, Boyuan Li, Xiang Cao

In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images as the video storyboard to visualize the text synopsis.

Language Modelling Quantization

Paper
Add Code

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

1 code implementation • 17 Nov 2022 • ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.

Object Relation

Paper
Code

Instruction-driven history-aware policies for robotic manipulations

2 code implementations • 11 Sep 2022 • Pierre-Louis Guhur, ShiZhe Chen, Ricardo Garcia, Makarand Tapaswi, Ivan Laptev, Cordelia Schmid

In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions.

Ranked #2 on Robot Manipulation on RLBench (Succ. Rate (10 tasks, 100 demos/task) metric)

Robot Manipulation

Paper
Code

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

1 code implementation • 24 Aug 2022 • ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

Our resulting HM3D-AutoVLN dataset is an order of magnitude larger than existing VLN datasets in terms of navigation environments and instructions.

Ranked #1 on Visual Navigation on SOON Test

Language Modelling Navigate +3

Paper
Code

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

1 code implementation • CVPR 2022 • ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers.

Ranked #4 on Visual Navigation on SOON Test

Efficient Exploration Navigate +2

Paper
Code

VRDFormer: End-to-End Video Visual Relation Detection With Transformers

no code implementations • CVPR 2022 • Sipeng Zheng, ShiZhe Chen, Qin Jin

Most previous works adopt a multi-stage framework for video visual relation detection (VidVRD), which cannot capture long-term spatiotemporal contexts in different stages and also suffers from inefficiency.

Object Relation +3

Paper
Add Code

History Aware Multimodal Transformer for Vision-and-Language Navigation

1 code implementation • NeurIPS 2021 • ShiZhe Chen, Pierre-Louis Guhur, Cordelia Schmid, Ivan Laptev

Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes.

Ranked #3 on Vision and Language Navigation on RxR

Decision Making Navigate +2

Paper
Code

Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training

1 code implementation • 25 Aug 2021 • Yuqing Song, ShiZhe Chen, Qin Jin, Wei Luo, Jun Xie, Fei Huang

Firstly, there are many specialized jargons in the product description, which are ambiguous to translate without the product image.

Machine Translation Translation

Paper
Code

Airbert: In-domain Pretraining for Vision-and-Language Navigation

2 code implementations • ICCV 2021 • Pierre-Louis Guhur, Makarand Tapaswi, ShiZhe Chen, Ivan Laptev, Cordelia Schmid

Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging.

Ranked #3 on Vision and Language Navigation on VLN Challenge

Navigate Referring Expression +1

Paper
Code

Elaborative Rehearsal for Zero-shot Action Recognition

1 code implementation • ICCV 2021 • ShiZhe Chen, Dong Huang

However, due to the complexity and diversity of actions, it remains challenging to semantically represent action classes and transfer knowledge from seen data.

Ranked #3 on Zero-Shot Action Recognition on Olympics

Action Recognition Few-Shot Learning +4

Paper
Code

Question-controlled Text-aware Image Captioning

1 code implementation • 4 Aug 2021 • Anwen Hu, ShiZhe Chen, Qin Jin

To explore how to generate personalized text-aware captions, we define a new challenging task, namely Question-controlled Text-aware Image Captioning (Qc-TextCap).

Decoder Image Captioning +1

Paper
Code

ICECAP: Information Concentrated Entity-aware Image Captioning

1 code implementation • 4 Aug 2021 • Anwen Hu, ShiZhe Chen, Qin Jin

In this work, we focus on the entity-aware news image captioning task which aims to generate informative captions by leveraging the associated news articles to provide background knowledge about the target image.

Image Captioning Retrieval +1

Paper
Code

Sketch, Ground, and Refine: Top-Down Dense Video Captioning

no code implementations • CVPR 2021 • Chaorui Deng, ShiZhe Chen, Da Chen, Yuan He, Qi Wu

The dense video captioning task aims to detect and describe a sequence of events in a video for detailed and coherent storytelling.

Dense Video Captioning Sentence

Paper
Add Code

Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object Localization

1 code implementation • 11 Jun 2021 • Ludan Ruan, Jieting Chen, Yuqing Song, ShiZhe Chen, Qin Jin

For the object grounding, we fine-tune the state-of-the-art detection model MDETR and design a post processing method to make the grounding results more faithful.

Caption Generation Object +1

155

Paper
Code

Towards Diverse Paragraph Captioning for Untrimmed Videos

1 code implementation • CVPR 2021 • Yuqing Song, ShiZhe Chen, Qin Jin

Video paragraph captioning aims to describe multiple events in untrimmed videos with descriptive paragraphs.

Descriptive Event Detection

Paper
Code

WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

2 code implementations • 11 Mar 2021 • Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu, Yizhao Gao, Guoxing Yang, Jingyuan Wen, Heng Zhang, Baogui Xu, Weihao Zheng, Zongzheng Xi, Yueqian Yang, Anwen Hu, Jinming Zhao, Ruichen Li, Yida Zhao, Liang Zhang, Yuqing Song, Xin Hong, Wanqing Cui, Danyang Hou, Yingyan Li, Junyi Li, Peiyu Liu, Zheng Gong, Chuhao Jin, Yuchong Sun, ShiZhe Chen, Zhiwu Lu, Zhicheng Dou, Qin Jin, Yanyan Lan, Wayne Xin Zhao, Ruihua Song, Ji-Rong Wen

We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model.

Ranked #1 on Image Retrieval on RUC-CAS-WenLan

Contrastive Learning Image Captioning +2

274

Paper
Code

YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos

1 code implementation • 12 Apr 2020 • Shizhe Chen, Weiying Wang, Ludan Ruan, Linli Yao, Qin Jin

The goal of the YouMakeup VQA Challenge 2020 is to provide a common benchmark for fine-grained action understanding in domain-specific videos e. g. makeup instructional videos.

Action Understanding Question Answering +2

Paper
Code

Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

1 code implementation • CVPR 2020 • Shizhe Chen, Qin Jin, Peng Wang, Qi Wu

From the ASG, we propose a novel ASG2Caption model, which is able to recognise user intentions and semantics in the graph, and therefore generate desired captions according to the graph structure.

Attribute Caption Generation +1

198

Paper
Code

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

4 code implementations • CVPR 2020 • Shizhe Chen, Yida Zhao, Qin Jin, Qi Wu

To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels.

Cross-Modal Retrieval Retrieval +3

220

Paper
Code

Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences

no code implementations • 24 Nov 2019 • Shizhe Chen, Bei Liu, Jianlong Fu, Ruihua Song, Qin Jin, Pingping Lin, Xiaoyu Qi, Chunting Wang, Jin Zhou

A storyboard is a sequence of images to illustrate a story containing multiple sentences, which has been a key process to create different story products.

Paper
Add Code

Integrating Temporal and Spatial Attentions for VATEX Video Captioning Challenge 2019

no code implementations • 15 Oct 2019 • Shizhe Chen, Yida Zhao, Yuqing Song, Qin Jin, Qi Wu

This notebook paper presents our model in the VATEX video captioning challenge.

Video Captioning

Paper
Add Code

Activitynet 2019 Task 3: Exploring Contexts for Dense Captioning Events in Videos

no code implementations • 11 Jul 2019 • Shizhe Chen, Yuqing Song, Yida Zhao, Qin Jin, Zhaoyang Zeng, Bei Liu, Jianlong Fu, Alexander Hauptmann

The overall system achieves the state-of-the-art performance on the dense-captioning events in video task with 9. 91 METEOR score on the challenge testing set.

Dense Captioning Dense Video Captioning

Paper
Add Code

From Words to Sentences: A Progressive Learning Approach for Zero-resource Machine Translation with Visual Pivots

no code implementations • 3 Jun 2019 • Shizhe Chen, Qin Jin, Jianlong Fu

However, a picture tells a thousand words, which makes multi-lingual sentences pivoted by the same image noisy as mutual translations and thus hinders the translation model learning.

Machine Translation Sentence +2

Paper
Add Code

Unsupervised Bilingual Lexicon Induction from Mono-lingual Multimodal Data

no code implementations • 2 Jun 2019 • Shizhe Chen, Qin Jin, Alexander Hauptmann

The linguistic feature is learned from the sentence contexts with visual semantic constraints, which is beneficial to learn translation for words that are less visual-relevant.

Bilingual Lexicon Induction Sentence +2

Paper
Add Code

RUC+CMU: System Report for Dense Captioning Events in Videos

no code implementations • 22 Jun 2018 • Shizhe Chen, Yuqing Song, Yida Zhao, Jiarong Qiu, Qin Jin, Alexander Hauptmann

This notebook paper presents our system in the ActivityNet Dense Captioning in Video task (task 3).

Caption Generation Dense Captioning +1

Paper
Add Code

Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction

no code implementations • 4 Sep 2017 • Shizhe Chen, Qin Jin

Continuous dimensional emotion prediction is a challenging task where the fusion of various modalities usually achieves state-of-the-art performance such as early fusion or late fusion.

Paper
Add Code

Video Captioning with Guidance of Multimodal Latent Topics

no code implementations • 31 Aug 2017 • Shizhe Chen, Jia Chen, Qin Jin, Alexander Hauptmann

For the topic prediction task, we use the mined topics as the teacher to train a student topic prediction model, which learns to predict the latent topics from multimodal contents of videos.

Caption Generation Decoder +2

Paper
Add Code

Generating Video Descriptions with Topic Guidance

no code implementations • 31 Aug 2017 • Shizhe Chen, Jia Chen, Qin Jin

In addition to predefined topics, i. e., category tags crawled from the web, we also mine topics in a data-driven way based on training captions by an unsupervised topic mining model.

Decoder Image Captioning +1

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.