no code implementations • 15 Dec 2023 • Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, Gao Huang
Generalized Referring Expression Segmentation (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image.
Generalized Referring Expression Segmentation Referring Expression +1
1 code implementation • 4 Sep 2023 • Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, Gao Huang
On the one hand, using dense attention in ViT leads to excessive memory and computational cost, and features can be influenced by irrelevant parts that are beyond the region of interests.
Ranked #4 on Object Detection on COCO 2017
1 code implementation • ICCV 2023 • Dongchen Han, Xuran Pan, Yizeng Han, Shiji Song, Gao Huang
The quadratic computation complexity of self-attention has been a persistent challenge when applying Transformer models to vision tasks.
1 code implementation • ICCV 2023 • Yizeng Han, Dongchen Han, Zeyu Liu, Yulin Wang, Xuran Pan, Yifan Pu, Chao Deng, Junlan Feng, Shiji Song, Gao Huang
Early exits are placed exclusively within the classification branch, thus eliminating the need for linear separability in low-level features.
1 code implementation • CVPR 2023 • Xuran Pan, Tianzhu Ye, Zhuofan Xia, Shiji Song, Gao Huang
Self-attention mechanism has been a key factor in the recent progress of Vision Transformer (ViT), which enables adaptive feature extraction from global contexts.
no code implementations • 18 Jan 2023 • Rui Huang, Xuran Pan, Henry Zheng, Haojun Jiang, Zhifeng Xie, Shiji Song, Gao Huang
During the pre-training stage, we establish the correspondence of images and point clouds based on the readily available RGB-D data and use contrastive learning to align the image and point cloud representations.
no code implementations • 17 Oct 2022 • Xuran Pan, Tianzhu Ye, Dongchen Han, Shiji Song, Gao Huang
Recent years have witnessed the fast development of large-scale pre-training frameworks that can extract multi-modal representations in a unified form and achieve promising performances when transferred to downstream tasks.
1 code implementation • 18 Sep 2022 • Xuran Pan, Zihang Lai, Shiji Song, Gao Huang
In this paper, we present a novel learning framework, ActiveNeRF, aiming to model a 3D scene with a constrained input budget.
2 code implementations • CVPR 2022 • Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, Gao Huang
On the one hand, using dense attention e. g., in ViT, leads to excessive memory and computational cost, and features can be influenced by irrelevant parts which are beyond the region of interests.
Ranked #107 on Object Detection on COCO test-dev
2 code implementations • CVPR 2022 • Xuran Pan, Chunjiang Ge, Rui Lu, Shiji Song, Guanfu Chen, Zeyi Huang, Gao Huang
In this paper, we show that there exists a strong underlying relation between them, in the sense that the bulk of computations of these two paradigms are in fact done with the same operation.
no code implementations • 1 Jan 2021 • Xuran Pan, Shiji Song, Gao Huang
In this paper, we take a step forward to establish a unified framework for convolution-based graph neural networks, by formulating the basic graph convolution operation as an optimization problem in the graph Fourier space.
1 code implementation • CVPR 2021 • Xuran Pan, Zhuofan Xia, Shiji Song, Li Erran Li, Gao Huang
In this paper, we propose Pointformer, a Transformer backbone designed for 3D point clouds to learn features effectively.
1 code implementation • 21 Jul 2020 • Yulin Wang, Gao Huang, Shiji Song, Xuran Pan, Yitong Xia, Cheng Wu
The proposed method is inspired by the intriguing property that deep networks are effective in learning linearized features, i. e., certain directions in the deep feature space correspond to meaningful semantic transformations, e. g., changing the background or view angle of an object.
1 code implementation • NeurIPS 2019 • Yulin Wang, Xuran Pan, Shiji Song, Hong Zhang, Cheng Wu, Gao Huang
Our work is motivated by the intriguing property that deep networks are surprisingly good at linearizing features, such that certain directions in the deep feature space correspond to meaningful semantic transformations, e. g., adding sunglasses or changing backgrounds.