Search Results for author: Linchao Zhu

Found 75 papers, 36 papers with code

Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback

no code implementations22 Apr 2024 Wenyi Xiao, Ziwei Huang, Leilei Gan, Wanggui He, Haoyuan Li, Zhelun Yu, Hao Jiang, Fei Wu, Linchao Zhu

The rapidly developing Large Vision Language Models (LVLMs) have shown notable capabilities on a range of multi-modal tasks, but still face the hallucination phenomena where the generated texts do not align with the given contexts, significantly restricting the usages of LVLMs.

EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing

no code implementations24 Mar 2024 Xiangpeng Yang, Linchao Zhu, Hehe Fan, Yi Yang

We find that the crux of the issue stems from the imprecise distribution of attention weights across designated regions, including inaccurate text-to-attribute control and attention leakage.

Attribute Video Editing

Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval

no code implementations24 Mar 2024 Yucheng Suo, Fan Ma, Linchao Zhu, Yi Yang

The pseudo-word tokens generated in this stream are explicitly aligned with fine-grained semantics in the text embedding space.

Attribute Image Retrieval +2

Ghost Sentence: A Tool for Everyday Users to Copyright Data from Large Language Models

no code implementations23 Mar 2024 Shuai Zhao, Linchao Zhu, Ruijie Quan, Yi Yang

These concealed passphrases in user documents, referred to as \textit{ghost sentences}, once they are identified in the generated content of LLMs, users can be sure that their data is used for training.

Sentence

CapHuman: Capture Your Moments in Parallel Universes

1 code implementation1 Feb 2024 Chao Liang, Fan Ma, Linchao Zhu, Yingying Deng, Yi Yang

Moreover, we introduce the 3D facial prior to equip our model with control over the human head in a flexible and 3D-consistent manner.

Image Generation

DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

1 code implementation19 Jan 2024 Xiangpeng Yang, Linchao Zhu, Xiaohan Wang, Yi Yang

(2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap.

Retrieval Video Retrieval

AntEval: Evaluation of Social Interaction Competencies in LLM-Driven Agents

no code implementations12 Jan 2024 Yuanzhi Liang, Linchao Zhu, Yi Yang

To address this challenge, we introduce the Multi-Agent Interaction Evaluation Framework (AntEval), encompassing a novel interaction framework and evaluation methods.

Informativeness

FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax

no code implementations27 Nov 2023 Yu Lu, Linchao Zhu, Hehe Fan, Yi Yang

Text-to-video (T2V) generation is a rapidly growing research area that aims to translate the scenes, objects, and actions within complex video text into a sequence of coherent visual frames.

Video Generation

Text Augmented Spatial-aware Zero-shot Referring Image Segmentation

no code implementations27 Oct 2023 Yucheng Suo, Linchao Zhu, Yi Yang

This task aims to identify the instance mask that is most related to a referring expression without training on pixel-level annotations.

Image Segmentation Referring Expression +4

IcoCap: Improving Video Captioning by Compounding Images

no code implementations IEEE Transactions on Multimedia 2023 Yuanzhi Liang, Linchao Zhu, Xiaohan Wang, Yi Yang

Video captioning is a more challenging task compared to image captioning, primarily due to differences in content density.

Ranked #5 on Video Captioning on VATEX (using extra training data)

Image Captioning Video Captioning

DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion

1 code implementation4 Sep 2023 Yunhong Lou, Linchao Zhu, Yaxiong Wang, Xiaohan Wang, Yi Yang

We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions while preserving motion diversity. Despite the recent significant process in text-based human motion generation, existing methods often prioritize fitting training motions at the expense of action diversity.

Ranked #3 on Motion Synthesis on HumanML3D (using extra training data)

Language Modelling Motion Synthesis

Tachikuma: Understading Complex Interactions with Multi-Character and Novel Objects by Large Language Models

1 code implementation24 Jul 2023 Yuanzhi Liang, Linchao Zhu, Yi Yang

MOE challenges models to understand characters' intentions and accurately determine their actions within intricate contexts involving multi-character and novel object interactions.

Co-Learning Meets Stitch-Up for Noisy Multi-label Visual Recognition

1 code implementation3 Jul 2023 Chao Liang, Zongxin Yang, Linchao Zhu, Yi Yang

In real-world scenarios, collected and annotated data often exhibit the characteristics of multiple classes and long-tailed distribution.

Learning with noisy labels Multi-Label Classification +1

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models

1 code implementation29 May 2023 Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang

Given a single test sample, the VLM is forced to maximize the CLIP reward between the input and sampled results from the VLM output distribution.

Image Captioning Image Classification +5

Whitening-based Contrastive Learning of Sentence Embeddings

1 code implementation28 May 2023 Wenjie Zhuo, Yifan Sun, Xiaohan Wang, Linchao Zhu, Yi Yang

Consequently, using multiple positive samples with enhanced diversity further improves contrastive learning due to better alignment.

Contrastive Learning Semantic Textual Similarity +4

CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model

1 code implementation23 May 2023 Shuai Zhao, Xiaohan Wang, Linchao Zhu, Ruijie Quan, Yi Yang

With such merits, we transform CLIP into a scene text reader and introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.

 Ranked #1 on Scene Text Recognition on WOST (using extra training data)

Language Modelling Scene Text Recognition

Gloss-Free End-to-End Sign Language Translation

1 code implementation22 May 2023 Kezhou Lin, Xiaohan Wang, Linchao Zhu, Ke Sun, Bang Zhang, Yi Yang

In this paper, we tackle the problem of sign language translation (SLT) without gloss annotations.

Sign Language Translation Translation

Efficient Multimodal Fusion via Interactive Prompting

no code implementations CVPR 2023 Yaowei Li, Ruijie Quan, Linchao Zhu, Yi Yang

Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era.

DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training

1 code implementation6 Mar 2023 Wei Li, Linchao Zhu, Longyin Wen, Yi Yang

This decoder is both data-efficient and computation-efficient: 1) it only requires the text data for training, easing the burden on the collection of paired data.

Image Captioning Text Generation

Variational Cross-Graph Reasoning and Adaptive Structured Semantics Learning for Compositional Temporal Grounding

no code implementations22 Jan 2023 Juncheng Li, Siliang Tang, Linchao Zhu, Wenqiao Zhang, Yi Yang, Tat-Seng Chua, Fei Wu, Yueting Zhuang

To systematically benchmark the compositional generalizability of temporal grounding models, we introduce a new Compositional Temporal Grounding task and construct two new dataset splits, i. e., Charades-CG and ActivityNet-CG.

Semantic correspondence Sentence

Temporal Perceiving Video-Language Pre-training

no code implementations18 Jan 2023 Fan Ma, Xiaojie Jin, Heng Wang, Jingjia Huang, Linchao Zhu, Jiashi Feng, Yi Yang

Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description, and text localization which matches the subset of texts with the video features.

Contrastive Learning Moment Retrieval +7

PointListNet: Deep Learning on 3D Point Lists

no code implementations CVPR 2023 Hehe Fan, Linchao Zhu, Yi Yang, Mohan Kankanhalli

Deep neural networks on regular 1D lists (e. g., natural languages) and irregular 3D sets (e. g., point clouds) have made tremendous achievements.

Discriminative Radial Domain Adaptation

1 code implementation1 Jan 2023 Zenan Huang, Jun Wen, Siheng Chen, Linchao Zhu, Nenggan Zheng

Domain adaptation methods reduce domain shift typically by learning domain-invariant features.

Domain Generalization Unsupervised Domain Adaptation

MAAL: Multimodality-Aware Autoencoder-Based Affordance Learning for 3D Articulated Objects

1 code implementation ICCV 2023 Yuanzhi Liang, Xiaohan Wang, Linchao Zhu, Yi Yang

Experimental results and visualizations, based on a large-scale dataset PartNet-Mobility, show the effectiveness of MAAL in learning multi-modal data and solving the 3D articulated object affordance problem.

Object

MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

1 code implementation CVPR 2023 Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, Mike Zheng Shou

To build Video Question Answering (VideoQA) systems capable of assisting humans in daily activities, seeking answers from long-form videos with diverse and complex events is a must.

Question Answering Video Question Answering +2

Slimmable Networks for Contrastive Self-supervised Learning

no code implementations30 Sep 2022 Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang

In this work, we present a one-stage solution to obtain pre-trained small models without the need for extra teachers, namely, slimmable networks for contrastive self-supervised learning (\emph{SlimCLR}).

Contrastive Learning Knowledge Distillation +1

AFE-CNN: 3D Skeleton-based Action Recognition with Action Feature Enhancement

no code implementations6 Aug 2022 Shannan Guan, Haiyan Lu, Linchao Zhu, Gengfa Fang

Existing 3D skeleton-based action recognition approaches reach impressive performance by encoding handcrafted action features to image format and decoding by CNNs.

Action Recognition Skeleton Based Action Recognition

Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos

1 code implementation3 Aug 2022 Juncheng Li, Junlin Xie, Linchao Zhu, Long Qian, Siliang Tang, Wenqiao Zhang, Haochen Shi, Shengyu Zhang, Longhui Wei, Qi Tian, Yueting Zhuang

In this paper, we introduce a new task, named Temporal Emotion Localization in videos~(TEL), which aims to detect human emotions and localize their corresponding temporal boundaries in untrimmed videos with aligned subtitles.

Emotion Classification Temporal Action Localization +1

CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

1 code implementation2 May 2022 Shuai Zhao, Linchao Zhu, Xiaohan Wang, Yi Yang

In this paper, to reduce the number of redundant video tokens, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.

Ranked #11 on Video Retrieval on MSVD (using extra training data)

Clustering Retrieval +1

Unified Transformer Tracker for Object Tracking

1 code implementation CVPR 2022 Fan Ma, Mike Zheng Shou, Linchao Zhu, Haoqi Fan, Yilei Xu, Yi Yang, Zhicheng Yan

Although UniTrack \cite{wang2021different} demonstrates that a shared appearance model with multiple heads can be used to tackle individual tracking tasks, it fails to exploit the large-scale tracking datasets for training and performs poorly on single object tracking.

Multiple Object Tracking Object

Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning

1 code implementation CVPR 2022 Juncheng Li, Junlin Xie, Long Qian, Linchao Zhu, Siliang Tang, Fei Wu, Yi Yang, Yueting Zhuang, Xin Eric Wang

To systematically measure the compositional generalizability of temporal grounding models, we introduce a new Compositional Temporal Grounding task and construct two new dataset splits, i. e., Charades-CG and ActivityNet-CG.

Semantic correspondence Sentence

Complex Video Action Reasoning via Learnable Markov Logic Network

no code implementations CVPR 2022 Yang Jin, Linchao Zhu, Yadong Mu

The main contributions of this work are two-fold: 1) Different from existing black-box models, the proposed model simultaneously implements the localization of temporal boundaries and the recognition of action categories by grounding the logical rules of MLN in videos.

Action Recognition Human-Object Interaction Detection +1

SEEG: Semantic Energized Co-Speech Gesture Generation

1 code implementation CVPR 2022 Yuanzhi Liang, Qianyu Feng, Linchao Zhu, Li Hu, Pan Pan, Yi Yang

Talking gesture generation is a practical yet challenging task which aims to synthesize gestures in line with speech.

Gesture Generation

Vector-Decomposed Disentanglement for Domain-Invariant Object Detection

1 code implementation ICCV 2021 Aming Wu, Rui Liu, Yahong Han, Linchao Zhu, Yi Yang

Secondly, domain-specific representations are introduced as the differences between the input and domain-invariant representations.

Disentanglement Object +2

Adaptive Hierarchical Graph Reasoning with Semantic Coherence for Video-and-Language Inference

no code implementations ICCV 2021 Juncheng Li, Siliang Tang, Linchao Zhu, Haochen Shi, Xuanwen Huang, Fei Wu, Yi Yang, Yueting Zhuang

Secondly, we introduce semantic coherence learning to explicitly encourage the semantic coherence of the adaptive hierarchical graph network from three hierarchies.

Less is More: Sparse Sampling for Dense Reaction Predictions

no code implementations3 Jun 2021 Kezhou Lin, Xiaohan Wang, Zhedong Zheng, Linchao Zhu, Yi Yang

Obtaining viewer responses from videos can be useful for creators and streaming platforms to analyze the video performance and improve the future user experience.

OR-Net: Pointwise Relational Inference for Data Completion under Partial Observation

no code implementations2 May 2021 Qianyu Feng, Linchao Zhu, Bang Zhang, Pan Pan, Yi Yang

Specifically, we expect to approximate the real joint distribution over the partial observation and latent variables, thus infer the unseen targets respectively.

T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval

1 code implementation CVPR 2021 Xiaohan Wang, Linchao Zhu, Yi Yang

Moreover, a global alignment method is proposed to provide a global cross-modal measurement that is complementary to the local perspective.

Retrieval Video Retrieval

Universal-Prototype Enhancing for Few-Shot Object Detection

1 code implementation ICCV 2021 Aming Wu, Yahong Han, Linchao Zhu, Yi Yang

Thus, we develop a new framework of few-shot object detection with universal prototypes ({FSOD}^{up}) that owns the merit of feature generalization towards novel objects.

Few-Shot Object Detection Meta-Learning +3

Learning to Anticipate Egocentric Actions by Imagination

no code implementations13 Jan 2021 Yu Wu, Linchao Zhu, Xiaohan Wang, Yi Yang, Fei Wu

We further improve ImagineRNN by residual anticipation, i. e., changing its target to predicting the feature difference of adjacent frames instead of the frame content.

Action Anticipation Autonomous Driving +1

Asynchronous Modeling: A Dual-phase Perspective for Long-Tailed Recognition

no code implementations1 Jan 2021 Hu Zhang, Linchao Zhu, Yi Yang

Motivated by such phenomenon, we propose to disentangle the distinctive effects of data-rich and data-poor gradient and asynchronously train a model via a dual-phase learning process.

Classification General Classification +1

Interactive Prototype Learning for Egocentric Action Recognition

no code implementations ICCV 2021 Xiaohan Wang, Linchao Zhu, Heng Wang, Yi Yang

To avoid these additional costs, we propose an end-to-end Interactive Prototype Learning (IPL) framework to learn better active object representations by leveraging the motion cues from the actor.

Action Recognition Object +1

A Multi-Mode Modulator for Multi-Domain Few-Shot Classification

1 code implementation ICCV 2021 Yanbin Liu, Juho Lee, Linchao Zhu, Ling Chen, Humphrey Shi, Yi Yang

Most existing few-shot classification methods only consider generalization on one dataset (i. e., single-domain), failing to transfer across various seen and unseen domains.

Classification Domain Generalization

Feature-Robust Optimal Transport for High-Dimensional Data

no code implementations1 Jan 2021 Mathis Petrovich, Chao Liang, Ryoma Sato, Yanbin Liu, Yao-Hung Hubert Tsai, Linchao Zhu, Yi Yang, Ruslan Salakhutdinov, Makoto Yamada

To show the effectiveness of FROT, we propose using the FROT algorithm for the layer selection problem in deep neural networks for semantic correspondence.

feature selection Semantic correspondence +1

SemGloVe: Semantic Co-occurrences for GloVe from BERT

3 code implementations30 Dec 2020 Leilei Gan, Zhiyang Teng, Yue Zhang, Linchao Zhu, Fei Wu, Yi Yang

In this paper, we propose SemGloVe, which distills semantic co-occurrences from BERT into static GloVe word embeddings.

Language Modelling Word Embeddings +1

ActBERT: Learning Global-Local Video-Text Representations

1 code implementation CVPR 2020 Linchao Zhu, Yi Yang

In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data.

Action Segmentation Question Answering +5

Feature Robust Optimal Transport for High-dimensional Data

1 code implementation25 May 2020 Mathis Petrovich, Chao Liang, Ryoma Sato, Yanbin Liu, Yao-Hung Hubert Tsai, Linchao Zhu, Yi Yang, Ruslan Salakhutdinov, Makoto Yamada

To show the effectiveness of FROT, we propose using the FROT algorithm for the layer selection problem in deep neural networks for semantic correspondence.

feature selection Semantic correspondence +1

Motion-Excited Sampler: Video Adversarial Attack with Sparked Prior

1 code implementation ECCV 2020 Hu Zhang, Linchao Zhu, Yi Zhu, Yi Yang

Most of previous work on adversarial attack mainly focus on image models, while the vulnerability of video models is less explored.

Adversarial Attack Video Classification

Symbiotic Attention with Privileged Information for Egocentric Action Recognition

no code implementations8 Feb 2020 Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang

Due to the large action vocabulary in egocentric video datasets, recent studies usually utilize a two-branch structure for action recognition, ie, one branch for verb classification and the other branch for noun classification.

Action Recognition Egocentric Activity Recognition +5

Connective Cognition Network for Directional Visual Commonsense Reasoning

1 code implementation NeurIPS 2019 Aming Wu, Linchao Zhu, Yahong Han, Yi Yang

Inspired by this idea, towards VCR, we propose a connective cognition network (CCN) to dynamically reorganize the visual neuron connectivity that is contextualized by the meaning of questions and answers.

Sentence Visual Commonsense Reasoning

Instance-Invariant Domain Adaptive Object Detection via Progressive Disentanglement

no code implementations20 Nov 2019 Aming Wu, Yahong Han, Linchao Zhu, Yi Yang

Most state-of-the-art methods of object detection suffer from poor generalization ability when the training and test data are from different domains, e. g., with different styles.

Disentanglement Object +2

Gated Channel Transformation for Visual Recognition

3 code implementations CVPR 2020 Zongxin Yang, Linchao Zhu, Yu Wu, Yi Yang

This lightweight layer incorporates a simple l2 normalization, enabling our transformation unit applicable to operator-level without much increase of additional parameters.

General Classification Image Classification +5

Learning to Transfer Learn: Reinforcement Learning-Based Selection for Adaptive Transfer Learning

no code implementations ECCV 2020 Linchao Zhu, Sercan O. Arik, Yi Yang, Tomas Pfister

We propose a novel adaptive transfer learning framework, learning to transfer learn (L2TL), to improve performance on a target dataset by careful extraction of the related information from a source dataset.

reinforcement-learning Reinforcement Learning (RL) +1

Baidu-UTS Submission to the EPIC-Kitchens Action Recognition Challenge 2019

no code implementations22 Jun 2019 Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang

In this report, we present the Baidu-UTS submission to the EPIC-Kitchens Action Recognition Challenge in CVPR 2019.

Action Recognition Object +2

FASTER Recurrent Networks for Efficient Video Classification

no code implementations10 Jun 2019 Linchao Zhu, Laura Sevilla-Lara, Du Tran, Matt Feiszli, Yi Yang, Heng Wang

FASTER aims to leverage the redundancy between neighboring clips and reduce the computational cost by learning to aggregate the predictions from models of different complexities.

Action Classification Action Recognition +3

Cubic LSTMs for Video Prediction

no code implementations20 Apr 2019 Hehe Fan, Linchao Zhu, Yi Yang

Predicting future frames in videos has become a promising direction of research for both computer vision and robot learning communities.

motion prediction Video Prediction

Sim-Real Joint Reinforcement Transfer for 3D Indoor Navigation

no code implementations CVPR 2019 Fengda Zhu, Linchao Zhu, Yi Yang

Specifically, our method employs an adversarial feature adaptation model for visual representation transfer and a policy mimic strategy for policy behavior imitation.

Filter Pruning by Switching to Neighboring CNNs with Good Attributes

no code implementations8 Apr 2019 Yang He, Ping Liu, Linchao Zhu, Yi Yang

In addition, when evaluating the filter importance, only the magnitude information of the filters is considered.

Attribute Image Classification

Compound Memory Networks for Few-shot Video Classification

no code implementations ECCV 2018 Linchao Zhu, Yi Yang

In this paper, we propose a new memory network structure for few-shot video classification by making the following contributions.

Classification General Classification +1

Decoupled Novel Object Captioner

1 code implementation11 Apr 2018 Yu Wu, Linchao Zhu, Lu Jiang, Yi Yang

Thus, the sequence model can be decoupled from the novel object descriptions.

Image Captioning Novel Concepts +2

UTS submission to Google YouTube-8M Challenge 2017

1 code implementation13 Jul 2017 Linchao Zhu, Yanbin Liu, Yi Yang

In this paper, we present our solution to Google YouTube-8M Video Classification Challenge 2017.

Classification General Classification +1

Few-Shot Object Recognition from Machine-Labeled Web Images

no code implementations CVPR 2017 Zhongwen Xu, Linchao Zhu, Yi Yang

Then, we demonstrate that with our model, machine-labeled image annotations are very effective and abundant resources to perform object recognition on novel categories.

Few-Shot Learning Object +1

Uncovering Temporal Context for Video Question and Answering

no code implementations15 Nov 2015 Linchao Zhu, Zhongwen Xu, Yi Yang, Alexander G. Hauptmann

In this work, we introduce Video Question Answering in temporal domain to infer the past, describe the present and predict the future.

Multiple-choice Question Answering +1

Cannot find the paper you are looking for? You can Submit a new open access paper.