Search Results for author: Yehao Li

Found 31 papers, 12 papers with code

SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer

no code implementations25 Mar 2024 Rui Zhu, Yingwei Pan, Yehao Li, Ting Yao, Zhenglong Sun, Tao Mei, Chang Wen Chen

Despite this progress, mask strategy still suffers from two inherent limitations: (a) training-inference discrepancy and (b) fuzzy relations between mask reconstruction & generative diffusion process, resulting in sub-optimal training of DiT.

Image Generation

HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs

no code implementations18 Mar 2024 Ting Yao, Yehao Li, Yingwei Pan, Tao Mei

Instead, we present a new hybrid backbone with HIgh-Resolution Inputs (namely HIRI-ViT), that upgrades prevalent four-stage ViT to five-stage ViT tailored for high-resolution inputs.

Control3D: Towards Controllable Text-to-3D Generation

no code implementations9 Nov 2023 Yang Chen, Yingwei Pan, Yehao Li, Ting Yao, Tao Mei

In particular, a 2D conditioned diffusion model (ControlNet) is remoulded to guide the learning of 3D scene parameterized as NeRF, encouraging each view of 3D scene aligned with the given text prompt and hand-drawn sketch.

3D Generation Text to 3D

HGNet: Learning Hierarchical Geometry From Points, Edges, and Surfaces

no code implementations CVPR 2023 Ting Yao, Yehao Li, Yingwei Pan, Tao Mei

Next, as every two neighbor edges compose a surface, we obtain the edge-level representation of each anchor edge via surface-to-edge aggregation over all neighbor surfaces.

3D Object Classification Semantic Segmentation

Semantic-Conditional Diffusion Networks for Image Captioning

1 code implementation CVPR 2023 Jianjie Luo, Yehao Li, Yingwei Pan, Ting Yao, Jianlin Feng, Hongyang Chao, Tao Mei

The rich semantics are further regarded as semantic prior to trigger the learning of Diffusion Transformer, which produces the output sentence in a diffusion process.

Cross-Modal Retrieval Image Captioning +3

SPE-Net: Boosting Point Cloud Analysis via Rotation Robustness Enhancement

1 code implementation15 Nov 2022 Zhaofan Qiu, Yehao Li, Yu Wang, Yingwei Pan, Ting Yao, Tao Mei

In this paper, we propose a novel deep architecture tailored for 3D point cloud applications, named as SPE-Net.

Dual Vision Transformer

1 code implementation11 Jul 2022 Ting Yao, Yehao Li, Yingwei Pan, Yu Wang, Xiao-Ping Zhang, Tao Mei

Dual-ViT is henceforth able to reduce the computational complexity without compromising much accuracy.

Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning

2 code implementations11 Jul 2022 Ting Yao, Yingwei Pan, Yehao Li, Chong-Wah Ngo, Tao Mei

Motivated by the wavelet theory, we construct a new Wavelet Vision Transformer (\textbf{Wave-ViT}) that formulates the invertible down-sampling with wavelet transforms and self-attention learning in a unified way.

Image Classification Instance Segmentation +4

Comprehending and Ordering Semantics for Image Captioning

1 code implementation CVPR 2022 Yehao Li, Yingwei Pan, Ting Yao, Tao Mei

In this paper, we propose a new recipe of Transformer-style structure, namely Comprehending and Ordering Semantics Networks (COS-Net), that novelly unifies an enriched semantic comprehending and a learnable semantic ordering processes into a single architecture.

Cross-Modal Retrieval Image Captioning +2

Silver-Bullet-3D at ManiSkill 2021: Learning-from-Demonstrations and Heuristic Rule-based Methods for Object Manipulation

1 code implementation13 Jun 2022 Yingwei Pan, Yehao Li, Yiheng Zhang, Qi Cai, Fuchen Long, Zhaofan Qiu, Ting Yao, Tao Mei

This paper presents an overview and comparative analysis of our systems designed for the following two tracks in SAPIEN ManiSkill Challenge 2021: No Interaction Track: The No Interaction track targets for learning policies from pre-collected demonstration trajectories.

Imitation Learning

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

no code implementations11 Jan 2022 Yehao Li, Jiahao Fan, Yingwei Pan, Ting Yao, Weiyao Lin, Tao Mei

Vision-language pre-training has been an emerging and fast-developing research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks.

Image Captioning Language Modelling +3

CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising

no code implementations14 Dec 2021 Jianjie Luo, Yehao Li, Yingwei Pan, Ting Yao, Hongyang Chao, Tao Mei

BERT-type structure has led to the revolution of vision-language pre-training and the achievement of state-of-the-art results on numerous vision-language downstream tasks.

Cross-Modal Retrieval Denoising +6

X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

2 code implementations18 Aug 2021 Yehao Li, Yingwei Pan, Jingwen Chen, Ting Yao, Tao Mei

Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion.

Cross-Modal Retrieval Image Captioning +5

Contextual Transformer Networks for Visual Recognition

7 code implementations26 Jul 2021 Yehao Li, Ting Yao, Yingwei Pan, Tao Mei

Such design fully capitalizes on the contextual information among input keys to guide the learning of dynamic attention matrix and thus strengthens the capacity of visual representation.

Image Classification Instance Segmentation +3

Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network

1 code implementation27 Jan 2021 Yehao Li, Yingwei Pan, Ting Yao, Jingwen Chen, Tao Mei

Despite having impressive vision-language (VL) pretraining with BERT-based encoder for VL understanding, the pretraining of a universal encoder-decoder for both VL understanding and generation remains challenging.

Pre-training for Video Captioning Challenge 2020 Summary

no code implementations27 Jul 2020 Yingwei Pan, Jun Xu, Yehao Li, Ting Yao, Tao Mei

The Pre-training for Video Captioning Challenge 2020 Summary: results and challenge participants' technical reports.

Video Captioning

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training

no code implementations5 Jul 2020 Yingwei Pan, Yehao Li, Jianjie Luo, Jun Xu, Ting Yao, Tao Mei

In this work, we present Auto-captions on GIF, which is a new large-scale pre-training dataset for generic video understanding.

Question Answering Sentence +3

Exploring Category-Agnostic Clusters for Open-Set Domain Adaptation

no code implementations CVPR 2020 Yingwei Pan, Ting Yao, Yehao Li, Chong-Wah Ngo, Tao Mei

A clustering branch is capitalized on to ensure that the learnt representation preserves such underlying structure by matching the estimated assignment distribution over clusters to the inherent cluster distribution for each target sample.

Clustering Unsupervised Domain Adaptation

X-Linear Attention Networks for Image Captioning

2 code implementations CVPR 2020 Yingwei Pan, Ting Yao, Yehao Li, Tao Mei

Recent progress on fine-grained visual recognition and visual question answering has featured Bilinear Pooling, which effectively models the 2$^{nd}$ order interactions across multi-modal inputs.

Fine-Grained Visual Recognition Image Captioning +3

Multi-Source Domain Adaptation and Semi-Supervised Domain Adaptation with Focus on Visual Domain Adaptation Challenge 2019

2 code implementations8 Oct 2019 Yingwei Pan, Yehao Li, Qi Cai, Yang Chen, Ting Yao

Semi-Supervised Domain Adaptation: For this task, we adopt a standard self-learning framework to construct a classifier based on the labeled source and target data, and generate the pseudo labels for unlabeled target data.

Domain Adaptation Self-Learning +1

Hierarchy Parsing for Image Captioning

no code implementations ICCV 2019 Ting Yao, Yingwei Pan, Yehao Li, Tao Mei

It is always well believed that parsing an image into constituent visual patterns would be helpful for understanding and representing an image.

Image Captioning

Deep Metric Learning with Density Adaptivity

no code implementations9 Sep 2019 Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, Tao Mei

The problem of distance metric learning is mostly considered from the perspective of learning an embedding space, where the distances between pairs of examples are in correspondence with a similarity metric.

Metric Learning

Trimmed Action Recognition, Dense-Captioning Events in Videos, and Spatio-temporal Action Localization with Focus on ActivityNet Challenge 2019

no code implementations14 Jun 2019 Zhaofan Qiu, Dong Li, Yehao Li, Qi Cai, Yingwei Pan, Ting Yao

This notebook paper presents an overview and comparative analysis of our systems designed for the following three tasks in ActivityNet Challenge 2019: trimmed action recognition, dense-captioning events in videos, and spatio-temporal action localization.

Action Recognition Dense Captioning +2

Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning

1 code implementation3 May 2019 Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Hongyang Chao, Tao Mei

Moreover, the inherently recurrent dependency in RNN prevents parallelization within a sequence during training and therefore limits the computations.

Sentence Video Captioning

Pointing Novel Objects in Image Captioning

no code implementations CVPR 2019 Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, Tao Mei

Image captioning has received significant attention with remarkable improvements in recent advances.

Image Captioning Object +2

Transferrable Prototypical Networks for Unsupervised Domain Adaptation

no code implementations CVPR 2019 Yingwei Pan, Ting Yao, Yehao Li, Yu Wang, Chong-Wah Ngo, Tao Mei

Specifically, we present Transferrable Prototypical Networks (TPN) for adaptation such that the prototypes for each class in source and target domains are close in the embedding space and the score distributions predicted by prototypes separately on source and target data are similar.

Pseudo Label Unsupervised Domain Adaptation

Exploring Visual Relationship for Image Captioning

no code implementations ECCV 2018 Ting Yao, Yingwei Pan, Yehao Li, Tao Mei

Technically, we build graphs over the detected objects in an image based on their spatial and semantic connections.

Image Captioning Sentence

Boosting Image Captioning with Attributes

no code implementations ICCV 2017 Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, Tao Mei

Automatically describing an image with a natural language has been an emerging challenge in both fields of computer vision and natural language processing.

Image Captioning

Cannot find the paper you are looking for? You can Submit a new open access paper.