1 code implementation • 1 Apr 2024 • Xingyi Zhou, Anurag Arnab, Shyamal Buch, Shen Yan, Austin Myers, Xuehan Xiong, Arsha Nagrani, Cordelia Schmid
An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video.
no code implementations • 11 Jan 2024 • Yue Zhao, Long Zhao, Xingyi Zhou, Jialin Wu, Chun-Te Chu, Hui Miao, Florian Schroff, Hartwig Adam, Ting Liu, Boqing Gong, Philipp Krähenbühl, Liangzhe Yuan
Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%.
no code implementations • 14 Dec 2023 • Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid
When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region.
1 code implementation • 11 Dec 2023 • Abdullah Rashwan, Jiageng Zhang, Ali Taalimi, Fan Yang, Xingyi Zhou, Chaochao Yan, Liang-Chieh Chen, Yeqing Li
With ResNet50 backbone, our MaskConver achieves 53. 6% PQ on the COCO panoptic val set, outperforming the modern convolution-based model, Panoptic FCN, by 9. 3% as well as transformer-based models such as Mask2Former (+1. 7% PQ) and kMaX-DeepLab (+0. 6% PQ).
Ranked #8 on Panoptic Segmentation on COCO test-dev
no code implementations • NeurIPS 2023 • Chen Sun, Calvin Luo, Xingyi Zhou, Anurag Arnab, Cordelia Schmid
A positive result would refute the common belief that explicit visual abstraction (e. g. object detection) is essential for compositional generalization on visual reasoning, and confirm the feasibility of a neural network "generalist" to solve visual recognition and reasoning tasks.
1 code implementation • CVPR 2023 • Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid
In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy.
1 code implementation • 20 Jun 2023 • Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid
We propose a new task and model for dense video object captioning -- detecting, tracking and captioning trajectories of objects in a video.
1 code implementation • 12 Dec 2022 • Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl
Our detector that trains Deformable-DETR with traditional IoU-based label assignment achieved 50. 2 COCO mAP within 12 epochs (1x schedule) with ResNet50 backbone, outperforming all existing traditional or transformer-based detectors in this setting.
Ranked #2 on Object Detection on COCO-O (using extra training data)
1 code implementation • CVPR 2022 • Xingyi Zhou, Tianwei Yin, Vladlen Koltun, Philipp Krähenbühl
The transformer encodes object features from all frames, and uses trajectory queries to group them into trajectories.
Ranked #13 on Multi-Object Tracking on SportsMOT (using extra training data)
1 code implementation • 7 Jan 2022 • Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra
For the first time, we train a detector with all the twenty-one-thousand classes of the ImageNet dataset and show that it generalizes to new datasets without finetuning.
Ranked #2 on Open Vocabulary Object Detection on OpenImages-v4
1 code implementation • NeurIPS 2021 • Tianwei Yin, Xingyi Zhou, Philipp Krähenbühl
For autonomous driving, this means that large objects close to the sensors are easily visible, but far-away or small objects comprise only one measurement or two.
Ranked #63 on 3D Object Detection on nuScenes
2 code implementations • 12 Mar 2021 • Xingyi Zhou, Vladlen Koltun, Philipp Krähenbühl
We develop a probabilistic interpretation of two-stage object detection.
Ranked #20 on Object Detection on COCO-O
1 code implementation • CVPR 2022 • Xingyi Zhou, Vladlen Koltun, Philipp Krähenbühl
Experiments show our learned taxonomy outperforms a expert-designed taxonomy in all datasets.
no code implementations • 1 Jan 2021 • Xingyi Zhou, Vladlen Koltun, Philipp Kraehenbuehl
These labels span many diverse datasets with potentially inconsistent semantic labels.
11 code implementations • CVPR 2021 • Tianwei Yin, Xingyi Zhou, Philipp Krähenbühl
Three-dimensional objects are commonly represented as 3D boxes in a point-cloud.
Ranked #1 on Robust 3D Object Detection on nuScenes-C
7 code implementations • ECCV 2020 • Xingyi Zhou, Vladlen Koltun, Philipp Krähenbühl
Nowadays, tracking is dominated by pipelines that perform object detection followed by temporal association, also known as tracking-by-detection.
Ranked #4 on Multiple Object Tracking on KITTI Tracking test
77 code implementations • 16 Apr 2019 • Xingyi Zhou, Dequan Wang, Philipp Krähenbühl
We model an object as a single point --- the center point of its bounding box.
2 code implementations • CVPR 2019 • Xingyi Zhou, Jiacheng Zhuo, Philipp Krähenbühl
With the advent of deep learning, object detection drifted from a bottom-up to a top-down recognition problem.
Ranked #128 on Object Detection on COCO minival
1 code implementation • ECCV 2018 • Xingyi Zhou, Arjun Karpur, Linjie Luo, Qi-Xing Huang
Existing methods define semantic keypoints separately for each category with a fixed number of semantic labels in fixed indices.
Ranked #2 on Keypoint Detection on Pascal3D+
1 code implementation • ECCV 2018 • Xingyi Zhou, Arjun Karpur, Chuang Gan, Linjie Luo, Qi-Xing Huang
In this paper, we introduce a novel unsupervised domain adaptation technique for the task of 3D keypoint prediction from a single depth scan or image.
6 code implementations • ICCV 2017 • Xingyi Zhou, Qi-Xing Huang, Xiao Sun, xiangyang xue, Yichen Wei
We propose a weakly-supervised transfer learning method that uses mixed 2D and 3D labels in a unified deep neutral network that presents two-stage cascaded structure.
2D Pose Estimation 3D Multi-Person Pose Estimation (absolute) +4
no code implementations • 17 Sep 2016 • Xingyi Zhou, Xiao Sun, Wei zhang, Shuang Liang, Yichen Wei
In this work, we propose to directly embed a kinematic object model into the deep neutral network learning for general articulated object pose estimation.
Ranked #307 on 3D Human Pose Estimation on Human3.6M
1 code implementation • 22 Jun 2016 • Xingyi Zhou, Qingfu Wan, Wei zhang, xiangyang xue, Yichen Wei
For the first time, we show that embedding such a non-linear generative process in deep learning is feasible for hand pose estimation.