Search Results for author: Hongfa Wang

Found 19 papers, 11 papers with code

Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts

2 code implementations • 13 Mar 2024 • Yue Ma, Yingqing He, Hongfa Wang, Andong Wang, Chenyang Qi, Chengfei Cai, Xiu Li, Zhifeng Li, Heung-Yeung Shum, Wei Liu, Qifeng Chen

Despite recent advances in image-to-video generation, better controllability and local animation are less explored.

Image Animation Image to Video Generation

735

Paper
Code

Inverse-like Antagonistic Scene Text Spotting via Reading-Order Estimation and Dynamic Sampling

no code implementations • 8 Jan 2024 • Shi-Xue Zhang, Chun Yang, Xiaobin Zhu, Hongyang Zhou, Hongfa Wang, Xu-Cheng Yin

Specifically, we propose an innovative reading-order estimation module (REM) that extracts reading-order information from the initial text boundary generated by an initial boundary module (IBM).

Text Detection Text Spotting

Paper
Add Code

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

4 code implementations • 3 Oct 2023 • Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, Hongfa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, Li Yuan

We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M.

Ranked #1 on Zero-shot Audio Classification on VGG-Sound (using extra training data)

Audio Classification Contrastive Learning +11

2,462

Paper
Code

Global and Local Semantic Completion Learning for Vision-Language Pre-training

1 code implementation • 12 Jun 2023 • Rong-Cheng Tu, Yatai Ji, Jie Jiang, Weijie Kong, Chengfei Cai, Wenzhe Zhao, Hongfa Wang, Yujiu Yang, Wei Liu

MGSC promotes learning more representative global features, which have a great impact on the performance of downstream tasks, while MLTC reconstructs modal-fusion local tokens, further enhancing accurate comprehension of multimodal data.

Language Modelling Masked Language Modeling +5

Paper
Code

Img2Vec: A Teacher of High Token-Diversity Helps Masked AutoEncoders

no code implementations • 25 Apr 2023 • Heng Pan, Chenyang Liu, Wenxiao Wang, Li Yuan, Hongfa Wang, Zhifeng Li, Wei Liu

To study which type of deep features is appropriate for MIM as a learning target, we propose a simple MIM framework with serials of well-trained self-supervised models to convert an Image to a feature Vector as the learning target of MIM, where the feature extractor is also known as a teacher model.

Attribute Vocal Bursts Intensity Prediction

Paper
Add Code

Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning

1 code implementation • CVPR 2023 • Yatai Ji, RongCheng Tu, Jie Jiang, Weijie Kong, Chengfei Cai, Wenzhe Zhao, Hongfa Wang, Yujiu Yang, Wei Liu

Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correct corresponding information across different modalities.

Ranked #8 on Zero-Shot Video Retrieval on LSMDC

Language Modelling Masked Language Modeling +6

Paper
Code

MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model

1 code implementation • CVPR 2023 • Yatai Ji, Junjie Wang, Yuan Gong, Lin Zhang, Yanru Zhu, Hongfa Wang, Jiaxing Zhang, Tetsuya Sakai, Yujiu Yang

Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets.

Contrastive Learning Image-text matching +9

Paper
Code

Unsupervised Hashing with Semantic Concept Mining

1 code implementation • 23 Sep 2022 • Rong-Cheng Tu, Xian-Ling Mao, Kevin Qinghong Lin, Chengfei Cai, Weize Qin, Hongfa Wang, Wei Wei, Heyan Huang

Recently, to improve the unsupervised image retrieval performance, plenty of unsupervised hashing methods have been proposed by designing a semantic similarity matrix, which is based on the similarities between image features extracted by a pre-trained CNN model.

Image Retrieval Prompt Engineering +4

Paper
Code

Adaptive Perception Transformer for Temporal Action Localization

no code implementations • 25 Aug 2022 • Yizheng Ouyang, Tianjin Zhang, Weibo Gu, Hongfa Wang

Besides, their multi-stage designs cannot generate action boundaries and categories straightforwardly.

Temporal Action Localization

Paper
Add Code

DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection

no code implementations • 21 Aug 2022 • Jingyu Lin, Jie Jiang, Yan Yan, Chunchao Guo, Hongfa Wang, Wei Liu, Hanzi Wang

We further propose a parallel design that integrates the convolutional network with a powerful self-attention mechanism to provide complementary clues between the attention path and convolutional path.

Scene Text Detection Text Detection

Paper
Add Code

Boosting Multi-Modal E-commerce Attribute Value Extraction via Unified Learning Scheme and Dynamic Range Minimization

no code implementations • 15 Jul 2022 • Mengyin Liu, Chao Zhu, Hongyu Gao, Weibo Gu, Hongfa Wang, Wei Liu, Xu-Cheng Yin

2) Secondly, a text-guided information range minimization method is proposed to adaptively encode descriptive parts of each modality into an identical space with a powerful pretrained linguistic model.

Attribute Attribute Value Extraction +2

Paper
Add Code

Egocentric Video-Language Pretraining @ EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022

1 code implementation • 4 Jul 2022 • Kevin Qinghong Lin, Alex Jinpeng Wang, Rui Yan, Eric Zhongcong Xu, RongCheng Tu, Yanru Zhu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Wei Liu, Mike Zheng Shou

In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for the EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge.

Language Modelling Multi-Instance Retrieval +1

205

Paper
Code

Egocentric Video-Language Pretraining @ Ego4D Challenge 2022

1 code implementation • 4 Jul 2022 • Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, RongCheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou

In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for four Ego4D challenge tasks, including Natural Language Query (NLQ), Moment Query (MQ), Object State Change Classification (OSCC), and PNR Localization (PNR).

Language Modelling Object State Change Classification

205

Paper
Code

Egocentric Video-Language Pretraining

2 code implementations • 3 Jun 2022 • Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, RongCheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou

Video-Language Pretraining (VLP), which aims to learn transferable representation to advance a wide range of video-text downstream tasks, has recently received increasing attention.

Ranked #2 on Video Summarization on Query-Focused Video Summarization Dataset

Action Recognition Contrastive Learning +11

205

Paper
Code

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

no code implementations • 7 Apr 2022 • Jie Jiang, Shaobo Min, Weijie Kong, Dihong Gong, Hongfa Wang, Zhifeng Li, Wei Liu

With multi-level representations for video and text, hierarchical contrastive learning is designed to explore fine-grained cross-modal relationships, i. e., frame-word, clip-phrase, and video-sentence, which enables HCMI to achieve a comprehensive semantic comparison between video and text modalities.

Ranked #1 on Video Retrieval on MSR-VTT-1kA (using extra training data)

Contrastive Learning Denoising +4

Paper
Add Code

Deep Unsupervised Hashing with Latent Semantic Components

no code implementations • 17 Mar 2022 • Qinghong Lin, Xiaojun Chen, Qin Zhang, Shaotian Cai, Wenzhe Zhao, Hongfa Wang

Firstly, DSCH constructs a semantic component structure by uncovering the fine-grained semantics components of images with a Gaussian Mixture Modal~(GMM), where an image is represented as a mixture of multiple components, and the semantics co-occurrence are exploited.

Common Sense Reasoning Image Retrieval +1

Paper
Add Code

Adaptive Boundary Proposal Network for Arbitrary Shape Text Detection

1 code implementation • ICCV 2021 • Shi-Xue Zhang, Xiaobin Zhu, Chun Yang, Hongfa Wang, Xu-Cheng Yin

In this work, we propose a novel adaptive boundary proposal network for arbitrary shape text detection, which can learn to directly produce accurate boundary for arbitrary shape text without any post-processing.

Decoder Text Detection

110

Paper
Code

Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection

2 code implementations • CVPR 2020 • Shi-Xue Zhang, Xiaobin Zhu, Jie-Bo Hou, Chang Liu, Chun Yang, Hongfa Wang, Xu-Cheng Yin

In this paper, we propose a novel unified relational reasoning graph network for arbitrary shape text detection.

graph construction Optical Character Recognition (OCR) +2

4,099

Paper
Code

AdaDNNs: Adaptive Ensemble of Deep Neural Networks for Scene Text Recognition

no code implementations • 10 Oct 2017 • Chun Yang, Xu-Cheng Yin, Zejun Li, Jianwei Wu, Chunchao Guo, Hongfa Wang, Lei Xiao

Recognizing text in the wild is a really challenging task because of complex backgrounds, various illuminations and diverse distortions, even with deep neural networks (convolutional neural networks and recurrent neural networks).

Scene Text Recognition

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.