Search Results for author: Xingyi Zhou

Found 23 papers, 18 papers with code

Streaming Dense Video Captioning

1 code implementation • 1 Apr 2024 • Xingyi Zhou, Anurag Arnab, Shyamal Buch, Shen Yan, Austin Myers, Xuehan Xiong, Arsha Nagrani, Cordelia Schmid

An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video.

Dense Video Captioning

2,996

Paper
Code

Distilling Vision-Language Models on Millions of Videos

no code implementations • 11 Jan 2024 • Yue Zhao, Long Zhao, Xingyi Zhou, Jialin Wu, Chun-Te Chu, Hui Miao, Florian Schroff, Hartwig Adam, Ting Liu, Boqing Gong, Philipp Krähenbühl, Liangzhe Yuan

Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%.

Language Modelling Retrieval +2

Paper
Add Code

Pixel Aligned Language Models

no code implementations • 14 Dec 2023 • Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid

When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region.

Language Modelling

Paper
Add Code

MaskConver: Revisiting Pure Convolution Model for Panoptic Segmentation

1 code implementation • 11 Dec 2023 • Abdullah Rashwan, Jiageng Zhang, Ali Taalimi, Fan Yang, Xingyi Zhou, Chaochao Yan, Liang-Chieh Chen, Yeqing Li

With ResNet50 backbone, our MaskConver achieves 53. 6% PQ on the COCO panoptic val set, outperforming the modern convolution-based model, Panoptic FCN, by 9. 3% as well as transformer-based models such as Mask2Former (+1. 7% PQ) and kMaX-DeepLab (+0. 6% PQ).

Ranked #8 on Panoptic Segmentation on COCO test-dev

Panoptic Segmentation

76,594

Paper
Code

Does Visual Pretraining Help End-to-End Reasoning?

no code implementations • NeurIPS 2023 • Chen Sun, Calvin Luo, Xingyi Zhou, Anurag Arnab, Cordelia Schmid

A positive result would refute the common belief that explicit visual abstraction (e. g. object detection) is essential for compositional generalization on visual reasoning, and confirm the feasibility of a neural network "generalist" to solve visual recognition and reasoning tasks.

Image Classification Object +3

Paper
Add Code

How can objects help action recognition?

1 code implementation • CVPR 2023 • Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy.

Action Recognition Object

2,996

Paper
Code

Dense Video Object Captioning from Disjoint Supervision

1 code implementation • 20 Jun 2023 • Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

We propose a new task and model for dense video object captioning -- detecting, tracking and captioning trajectories of objects in a video.

Object Sentence +2

2,996

Paper
Code

NMS Strikes Back

1 code implementation • 12 Dec 2022 • Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl

Our detector that trains Deformable-DETR with traditional IoU-based label assignment achieved 50. 2 COCO mAP within 12 epochs (1x schedule) with ResNet50 backbone, outperforming all existing traditional or transformer-based detectors in this setting.

Ranked #2 on Object Detection on COCO-O (using extra training data)

Attribute object-detection +1

232

Paper
Code

Global Tracking Transformers

1 code implementation • CVPR 2022 • Xingyi Zhou, Tianwei Yin, Vladlen Koltun, Philipp Krähenbühl

The transformer encodes object features from all frames, and uses trajectory queries to group them into trajectories.

Ranked #13 on Multi-Object Tracking on SportsMOT (using extra training data)

Multi-Object Tracking Object

365

Paper
Code

Detecting Twenty-thousand Classes using Image-level Supervision

1 code implementation • 7 Jan 2022 • Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra

For the first time, we train a detector with all the twenty-one-thousand classes of the ImageNet dataset and show that it generalizes to new datasets without finetuning.

Ranked #2 on Open Vocabulary Object Detection on OpenImages-v4

Image Classification Open Vocabulary Object Detection

1,765

Paper
Code

Multimodal Virtual Point 3D Detection

1 code implementation • NeurIPS 2021 • Tianwei Yin, Xingyi Zhou, Philipp Krähenbühl

For autonomous driving, this means that large objects close to the sensors are easily visible, but far-away or small objects comprise only one measurement or two.

Ranked #63 on 3D Object Detection on nuScenes

3D Object Detection Autonomous Driving

257

Paper
Code

Probabilistic two-stage detection

2 code implementations • 12 Mar 2021 • Xingyi Zhou, Vladlen Koltun, Philipp Krähenbühl

We develop a probabilistic interpretation of two-stage object detection.

Ranked #20 on Object Detection on COCO-O

object-detection Object Detection +2

1,187

Paper
Code

Simple multi-dataset detection

1 code implementation • CVPR 2022 • Xingyi Zhou, Vladlen Koltun, Philipp Krähenbühl

Experiments show our learned taxonomy outperforms a expert-designed taxonomy in all datasets.

Instance Segmentation object-detection +2

480

Paper
Code

Learning a unified label space

no code implementations • 1 Jan 2021 • Xingyi Zhou, Vladlen Koltun, Philipp Kraehenbuehl

These labels span many diverse datasets with potentially inconsistent semantic labels.

Instance Segmentation Object +3

Paper
Add Code

Center-based 3D Object Detection and Tracking

11 code implementations • CVPR 2021 • Tianwei Yin, Xingyi Zhou, Philipp Krähenbühl

Three-dimensional objects are commonly represented as 3D boxes in a point-cloud.

Ranked #1 on Robust 3D Object Detection on nuScenes-C

3D Multi-Object Tracking 3D Object Tracking +4

4,810

Paper
Code

Tracking Objects as Points

7 code implementations • ECCV 2020 • Xingyi Zhou, Vladlen Koltun, Philipp Krähenbühl

Nowadays, tracking is dominated by pipelines that perform object detection followed by temporal association, also known as tracking-by-detection.

Ranked #4 on Multiple Object Tracking on KITTI Tracking test

Multi-Object Tracking Multiple Object Tracking +2

12,059

Paper
Code

Objects as Points

77 code implementations • 16 Apr 2019 • Xingyi Zhou, Dequan Wang, Philipp Krähenbühl

We model an object as a single point --- the center point of its bounding box.

Ranked #4 on One-stage Anchor-free Oriented Object Detection on SKU110K-R

Keypoint Detection Keypoint Estimation +3

76,591

Paper
Code

Bottom-up Object Detection by Grouping Extreme and Center Points

2 code implementations • CVPR 2019 • Xingyi Zhou, Jiacheng Zhuo, Philipp Krähenbühl

With the advent of deep learning, object detection drifted from a bottom-up to a top-down recognition problem.

Ranked #128 on Object Detection on COCO minival

Keypoint Estimation Object +2

1,030

Paper
Code

StarMap for Category-Agnostic Keypoint and Viewpoint Estimation

1 code implementation • ECCV 2018 • Xingyi Zhou, Arjun Karpur, Linjie Luo, Qi-Xing Huang

Existing methods define semantic keypoints separately for each category with a fixed number of semantic labels in fixed indices.

Ranked #2 on Keypoint Detection on Pascal3D+

Keypoint Detection Viewpoint Estimation

102

Paper
Code

Unsupervised Domain Adaptation for 3D Keypoint Estimation via View Consistency

1 code implementation • ECCV 2018 • Xingyi Zhou, Arjun Karpur, Chuang Gan, Linjie Luo, Qi-Xing Huang

In this paper, we introduce a novel unsupervised domain adaptation technique for the task of 3D keypoint prediction from a single depth scan or image.

Keypoint Estimation Unsupervised Domain Adaptation

Paper
Code

Towards 3D Human Pose Estimation in the Wild: a Weakly-supervised Approach

6 code implementations • ICCV 2017 • Xingyi Zhou, Qi-Xing Huang, Xiao Sun, xiangyang xue, Yichen Wei

We propose a weakly-supervised transfer learning method that uses mixed 2D and 3D labels in a unified deep neutral network that presents two-stage cascaded structure.

Ranked #1 on 3D Human Pose Estimation on Geometric Pose Affordance

2D Pose Estimation 3D Multi-Person Pose Estimation (absolute) +4

609

Paper
Code

Deep Kinematic Pose Regression

no code implementations • 17 Sep 2016 • Xingyi Zhou, Xiao Sun, Wei zhang, Shuang Liang, Yichen Wei

In this work, we propose to directly embed a kinematic object model into the deep neutral network learning for general articulated object pose estimation.

Ranked #307 on 3D Human Pose Estimation on Human3.6M

3D Human Pose Estimation Object +2

Paper
Add Code

Model-based Deep Hand Pose Estimation

1 code implementation • 22 Jun 2016 • Xingyi Zhou, Qingfu Wan, Wei zhang, xiangyang xue, Yichen Wei

For the first time, we show that embedding such a non-linear generative process in deep learning is feasible for hand pose estimation.

Hand Pose Estimation valid

111

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.