1 code implementation • 19 Mar 2024 • Elaine Sui, Xiaohan Wang, Serena Yeung-Levy
Advancements in vision-language models (VLMs) have propelled the field of computer vision, particularly in the zero-shot learning setting.
no code implementations • 15 Mar 2024 • Xiaohan Wang, Yuhui Zhang, Orr Zohar, Serena Yeung-Levy
Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences.
Ranked #1 on Zero-Shot Video Question Answer on NExT-QA
1 code implementation • 10 Mar 2024 • Xiaohan Wang, Shengyu Mao, Ningyu Zhang, Shumin Deng, Yunzhi Yao, Yue Shen, Lei Liang, Jinjie Gu, Huajun Chen
Recently, there has been a growing interest in knowledge editing for Large Language Models (LLMs).
1 code implementation • 19 Jan 2024 • Xiangpeng Yang, Linchao Zhu, Xiaohan Wang, Yi Yang
(2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap.
1 code implementation • 5 Dec 2023 • Lisa Dunlap, Yuhui Zhang, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell, Jacob Steinhardt, Joseph E. Gonzalez, Serena Yeung-Levy
To aid in this discovery process, we explore the task of automatically describing the differences between two $\textbf{sets}$ of images, which we term Set Difference Captioning.
no code implementations • 29 Nov 2023 • Yuebing Liang, Yichao Liu, Xiaohan Wang, Zhan Zhao
Accurate human mobility prediction for public events is thus crucial for event planning as well as traffic or crowd management.
no code implementations • IEEE Transactions on Multimedia 2023 • Yuanzhi Liang, Linchao Zhu, Xiaohan Wang, Yi Yang
Video captioning is a more challenging task compared to image captioning, primarily due to differences in content density.
Ranked #5 on Video Captioning on VATEX (using extra training data)
1 code implementation • 3 Oct 2023 • Shengyu Mao, Xiaohan Wang, Mengru Wang, Yong Jiang, Pengjun Xie, Fei Huang, Ningyu Zhang
This task seeks to adjust the models' responses to opinion-related questions on specified topics since an individual's personality often manifests in the form of their expressed opinions, thereby showcasing different personality traits.
1 code implementation • 4 Sep 2023 • Yunhong Lou, Linchao Zhu, Yaxiong Wang, Xiaohan Wang, Yi Yang
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions while preserving motion diversity. Despite the recent significant process in text-based human motion generation, existing methods often prioritize fitting training motions at the expense of action diversity.
Ranked #3 on Motion Synthesis on HumanML3D (using extra training data)
2 code implementations • 14 Aug 2023 • Peng Wang, Ningyu Zhang, Bozhong Tian, Zekun Xi, Yunzhi Yao, Ziwen Xu, Mengru Wang, Shengyu Mao, Xiaohan Wang, Siyuan Cheng, Kangwei Liu, Yuansheng Ni, Guozhou Zheng, Huajun Chen
Large Language Models (LLMs) usually suffer from knowledge cutoff or fallacy issues, which means they are unaware of unseen events or generate text with incorrect facts owing to outdated/noisy data.
1 code implementation • ICCV 2023 • Rui Liu, Xiaohan Wang, Wenguan Wang, Yi Yang
Vision-language navigation (VLN), which entails an agent to navigate 3D environments following human instructions, has shown great advances.
no code implementations • 9 Aug 2023 • Liping Wang, Jiawei Li, Lifan Zhao, Zhizhuo Kou, Xiaohan Wang, Xinyi Zhu, Hao Wang, Yanyan Shen, Lei Chen
Predicting stock prices presents a challenging research problem due to the inherent volatility and non-linear nature of the stock market.
1 code implementation • ICCV 2023 • Jiahao Li, Zongxin Yang, Xiaohan Wang, Jianxin Ma, Chang Zhou, Yi Yang
Our method includes an encoder-decoder transformer architecture to fuse 2D and 3D representations for achieving 2D$\&$3D aligned results in a coarse-to-fine manner and a novel 3D joint contrastive learning approach for adding explicitly global supervision for the 3D feature space.
1 code implementation • ICCV 2023 • Tuo Feng, Wenguan Wang, Xiaohan Wang, Yi Yang, Qinghua Zheng
The mined patterns are, in turn, used to repaint the embedding space, so as to respect the underlying distribution of the entire training dataset and improve the robustness to the variations.
1 code implementation • 25 Jul 2023 • Haitian Zeng, Xiaohan Wang, Wenguan Wang, Yi Yang
We introduce a novel speaker model \textsc{Kefa} for navigation instruction generation.
1 code implementation • 15 Jun 2023 • Jiayi Shao, Xiaohan Wang, Ruijie Quan, Yi Yang
This report presents ReLER submission to two tracks in the Ego4D Episodic Memory Benchmark in CVPR 2023, including Natural Language Queries and Moment Queries.
Ranked #1 on Moment Queries on Ego4D
no code implementations • 3 Jun 2023 • Xu Zhang, Zhedong Zheng, Xiaohan Wang, Yi Yang
We propose a novel Consensus Network (Css-Net) that self-adaptively learns from noisy triplets to minimize the negative effects of triplet ambiguity.
1 code implementation • 29 May 2023 • Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang
Given a single test sample, the VLM is forced to maximize the CLIP reward between the input and sampled results from the VLM output distribution.
1 code implementation • 28 May 2023 • Wenjie Zhuo, Yifan Sun, Xiaohan Wang, Linchao Zhu, Yi Yang
Consequently, using multiple positive samples with enhanced diversity further improves contrastive learning due to better alignment.
no code implementations • ICCV 2023 • Jiayi Shao, Xiaohan Wang, Ruijie Quan, Junjun Zheng, Jiang Yang, Yi Yang
Temporal action localization (TAL), which involves recognizing and locating action instances, is a challenging task in video understanding.
Ranked #9 on Temporal Action Localization on THUMOS’14
1 code implementation • 22 May 2023 • Kezhou Lin, Xiaohan Wang, Linchao Zhu, Ke Sun, Bang Zhang, Yi Yang
In this paper, we tackle the problem of sign language translation (SLT) without gloss annotations.
1 code implementation • 22 May 2023 • Yuqi Zhu, Xiaohan Wang, Jing Chen, Shuofei Qiao, Yixin Ou, Yunzhi Yao, Shumin Deng, Huajun Chen, Ningyu Zhang
We engage in experiments across eight diverse datasets, focusing on four representative tasks encompassing entity and relation extraction, event extraction, link prediction, and question-answering, thereby thoroughly exploring LLMs' performance in the domain of construction and inference.
1 code implementation • 15 May 2023 • Xiang Chen, Ningyu Zhang, Jintian Zhang, Xiaohan Wang, Tongtong Wu, Xi Chen, Yongheng Wang, Huajun Chen
Multimodal Knowledge Graph Construction (MKGC) involves creating structured representations of entities and relations using multiple modalities, such as text and images.
2 code implementations • 2 May 2023 • Xin Xu, Yuqi Zhu, Xiaohan Wang, Ningyu Zhang
Scaling language models have revolutionized widespread NLP tasks, yet little comprehensively explored few-shot relation extraction with large language models.
1 code implementation • CVPR 2023 • Xiaolong Shen, Zongxin Yang, Xiaohan Wang, Jianxin Ma, Chang Zhou, Yi Yang
However, using a single kind of modeling structure is difficult to balance the learning of short-term and long-term temporal correlations, and may bias the network to one of them, leading to undesirable predictions like global location shift, temporal inconsistency, and insufficient local details.
Ranked #46 on 3D Human Pose Estimation on 3DPW
1 code implementation • CVPR 2023 • Xiaohan Wang, Wenguan Wang, Jiayi Shao, Yi Yang
Recently, visual-language navigation (VLN) -- entailing robot agents to follow navigation instructions -- has shown great advance.
1 code implementation • ICCV 2023 • Yuanzhi Liang, Xiaohan Wang, Linchao Zhu, Yi Yang
Experimental results and visualizations, based on a large-scale dataset PartNet-Mobility, show the effectiveness of MAAL in learning multi-modal data and solving the 3D articulated object affordance problem.
no code implementations • CVPR 2023 • Guangrui Li, Guoliang Kang, Xiaohan Wang, Yunchao Wei, Yi Yang
With the help of adversarial training, the masking module can learn to generate source masks to mimic the pattern of irregular target noise, thereby narrowing the domain gap.
5 code implementations • CVPR 2023 • Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang
In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition.
Ranked #1 on Zero-Shot Action Recognition on ActivityNet
1 code implementation • 7 Dec 2022 • Zheng Zhang, Qingrui Zhang, Bo Zhu, Xiaohan Wang, Tianjiang Hu
In this paper, a novel algorithm named EASpace (Enhanced Action Space) is proposed, which formulates macro actions in an alternative form to accelerate the learning process using multiple available sub-optimal expert policies.
1 code implementation • IEEE Transactions on Neural Networks and Learning Systems 2022 • Yuanzhi Liang, Linchao Zhu, Xiaohan Wang, Yi Yang
Second, we instantiate the loss function and provide a strong baseline for FGVC, where the performance of a naive backbone can be boosted and be comparable with recent methods.
Ranked #28 on Fine-Grained Image Classification on CUB-200-2011
Fine-Grained Image Classification Fine-Grained Visual Recognition
1 code implementation • 17 Nov 2022 • Jiayi Shao, Xiaohan Wang, Yi Yang
Moreover, in order to better capture the long-term temporal dependencies in the long videos, we propose a segment-level recurrence mechanism.
2 code implementations • 1 Oct 2022 • Xin Xie, Zhoubo Li, Xiaohan Wang, Zekun Xi, Ningyu Zhang
Knowledge Graphs (KGs) often have two characteristics: heterogeneous graph structure and text-rich entity/relation information.
no code implementations • 30 Sep 2022 • Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang
In this work, we present a one-stage solution to obtain pre-trained small models without the need for extra teachers, namely, slimmable networks for contrastive self-supervised learning (\emph{SlimCLR}).
no code implementations • 8 Jul 2022 • Yucheng Suo, Zhedong Zheng, Xiaohan Wang, Bang Zhang, Yi Yang
We optimize the two losses and keypoint detector network in an end-to-end manner.
1 code implementation • 1 Jul 2022 • Naiyuan Liu, Xiaohan Wang, Xiaobo Li, Yi Yang, Yueting Zhuang
In this report, we present the ReLER@ZJU-Alibaba submission to the Ego4D Natural Language Queries (NLQ) Challenge in CVPR 2022.
Ranked #3 on Natural Language Queries on Ego4D
1 code implementation • 2 May 2022 • Shuai Zhao, Linchao Zhu, Xiaohan Wang, Yi Yang
In this paper, to reduce the number of redundant video tokens, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.
Ranked #11 on Video Retrieval on MSVD (using extra training data)
2 code implementations • 22 Mar 2022 • Zongxin Yang, Jiaxu Miao, Yunchao Wei, Wenguan Wang, Xiaohan Wang, Yi Yang
This paper delves into the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object Segmentation (VOS).
no code implementations • 9 Mar 2022 • Zheng Zhang, Xiaohan Wang, Qingrui Zhang, Tianjiang Hu
It is shown by numerical simulations that the proposed hybrid design outperforms the pursuit policies either learned from vanilla reinforcement learning or designed by the potential field method.
no code implementations • 17 Jan 2022 • Xu Chen, Yahong Han, Xiaohan Wang, Yifan Sun, Yi Yang
An effective approach is to select informative content from the holistic video, yielding a popular family of dynamic video recognition methods.
Ranked #42 on Action Recognition on Something-Something V1
1 code implementation • 14 Jan 2022 • Peng Wang, Xin Xie, Xiaohan Wang, Ningyu Zhang
Previous knowledge graph embedding approaches usually map entities to representations and utilize score functions to predict the target entities, yet they typically struggle to reason rare or emerging unseen entities.
Ranked #1 on Link Prediction on FB15k-237-ind
1 code implementation • CVPR 2022 • Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, Yi Yang
In contrast, our large-scale VIdeo Panoptic Segmentation in the Wild (VIPSeg) dataset provides 3, 536 videos and 84, 750 frames with pixel-level panoptic annotations, covering a wide range of real-world scenarios and categories.
2 code implementations • CVPR 2022 • Yuanzhi Liang, Linchao Zhu, Xiaohan Wang, Yi Yang
In this paper, we propose an episodic linear probing (ELP) classifier to reflect the generalization of visual representations in an online manner.
Ranked #13 on Fine-Grained Image Classification on CUB-200-2011
1 code implementation • 1 Sep 2021 • Chao Sun, Zhedong Zheng, Xiaohan Wang, Mingliang Xu, Yi Yang
Albeit simple, the pre-trained encoder can capture the key points of an unseen point cloud and surpasses the encoder trained from scratch on downstream tasks.
Ranked #43 on 3D Part Segmentation on ShapeNet-Part
no code implementations • ICCV 2021 • Haitian Zeng, Yuchao Dai, Xin Yu, Xiaohan Wang, Yi Yang
As NRSfM is a highly under-constrained problem, we propose two new pairwise regularization to further regularize the reconstruction.
no code implementations • 3 Jun 2021 • Kezhou Lin, Xiaohan Wang, Zhedong Zheng, Linchao Zhu, Yi Yang
Obtaining viewer responses from videos can be useful for creators and streaming platforms to analyze the video performance and improve the future user experience.
1 code implementation • 31 May 2021 • Shuai Bai, Zhedong Zheng, Xiaohan Wang, Junyang Lin, Zhu Zhang, Chang Zhou, Yi Yang, Hongxia Yang
In this paper, we apply one new modality, i. e., the language description, to search the vehicle of interest and explore the potential of this task in the real-world scenario.
1 code implementation • CVPR 2021 • Xiaohan Wang, Linchao Zhu, Yi Yang
Moreover, a global alignment method is proposed to provide a global cross-modal measurement that is complementary to the local perspective.
no code implementations • 13 Jan 2021 • Yu Wu, Linchao Zhu, Xiaohan Wang, Yi Yang, Fei Wu
We further improve ImagineRNN by residual anticipation, i. e., changing its target to predicting the feature difference of adjacent frames instead of the frame content.
no code implementations • ICCV 2021 • Xiaohan Wang, Linchao Zhu, Heng Wang, Yi Yang
To avoid these additional costs, we propose an end-to-end Interactive Prototype Learning (IPL) framework to learn better active object representations by leveraging the motion cues from the actor.
no code implementations • 8 Feb 2020 • Tengyu Ma, Joel Michelson, James Ainooson, Deepayan Sanyal, Xiaohan Wang, Maithilee Kunda
For the problem of 3D object recognition, researchers using deep learning methods have developed several very different input representations, including "multi-view" snapshots taken from discrete viewpoints around an object, as well as "spherical" representations consisting of a dense map of essentially ray-traced samples of the object from all directions.
no code implementations • 8 Feb 2020 • Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang
Due to the large action vocabulary in egocentric video datasets, recent studies usually utilize a two-branch structure for action recognition, ie, one branch for verb classification and the other branch for noun classification.
Ranked #4 on Egocentric Activity Recognition on EGTEA
no code implementations • 22 Jun 2019 • Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang
In this report, we present the Baidu-UTS submission to the EPIC-Kitchens Action Recognition Challenge in CVPR 2019.
no code implementations • 15 Jun 2018 • Xiaohan Wang, Tengyu Ma, James Ainooson, Seunghwan Cha, Xiaotian Wang, Azhar Molla, Maithilee Kunda
In object recognition research, many commonly used datasets (e. g., ImageNet and similar) contain relatively sparse distributions of object instances and views, e. g., one might see a thousand different pictures of a thousand different giraffes, mostly taken from a few conventionally photographed angles.