Search Results for author: Kun Yao

Found 20 papers, 7 papers with code

FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition

no code implementations • 5 Feb 2024 • Xiaohu Huang, Hao Zhou, Kun Yao, Kai Han

To address these issues, FROSTER employs a residual feature distillation approach to ensure that CLIP retains its generalization capability while effectively adapting to the action recognition task.

Open Vocabulary Action Recognition

Paper
Add Code

HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception

1 code implementation • NeurIPS 2023 • Junkun Yuan, Xinyu Zhang, Hao Zhou, Jian Wang, Zhongwei Qiu, Zhiyin Shao, Shaofeng Zhang, Sifan Long, Kun Kuang, Kun Yao, Junyu Han, Errui Ding, Lanfen Lin, Fei Wu, Jingdong Wang

To further capture human characteristics, we propose a structure-invariant alignment loss that enforces different masked views, guided by the human part prior, to be closely aligned for the same image.

2D Pose Estimation Attribute +3

Paper
Code

GridFormer: Towards Accurate Table Structure Recognition via Grid Prediction

no code implementations • 26 Sep 2023 • Pengyuan Lyu, Weihong Ma, Hongyi Wang, Yuechen Yu, Chengquan Zhang, Kun Yao, Yang Xue, Jingdong Wang

In this representation, the vertexes and edges of the grid store the localization and adjacency information of the table.

Paper
Add Code

Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation

2 code implementations • ICCV 2023 • Huan Liu, Qiang Chen, Zichang Tan, Jiang-Jiang Liu, Jian Wang, Xiangbo Su, Xiaolong Li, Kun Yao, Junyu Han, Errui Ding, Yao Zhao, Jingdong Wang

State-of-the-art solutions adopt the DETR-like framework, and mainly develop the complex decoder, e. g., regarding pose estimation as keypoint box detection and combining with human detection in ED-Pose, hierarchically predicting with pose decoder and joint (keypoint) decoder in PETR.

Human Detection Multi-Person Pose Estimation

Paper
Code

Towards Robust Real-Time Scene Text Detection: From Semantic to Instance Representation Learning

no code implementations • 14 Aug 2023 • Xugong Qin, Pengyuan Lyu, Chengquan Zhang, Yu Zhou, Kun Yao, Peng Zhang, Hailun Lin, Weiping Wang

Different from existing methods which integrate multiple-granularity features or multiple outputs, we resort to the perspective of representation learning in which auxiliary tasks are utilized to enable the encoder to jointly learn robust features with the main task of per-pixel classification during optimization.

Representation Learning Scene Text Detection +1

Paper
Add Code

MataDoc: Margin and Text Aware Document Dewarping for Arbitrary Boundary

no code implementations • 24 Jul 2023 • Beiya Dai, Xing Li, Qunyi Xie, Yulin Li, Xiameng Qin, Chengquan Zhang, Kun Yao, Junyu Han

To produce a comprehensive evaluation of MataDoc, we propose a novel benchmark ArbDoc, mainly consisting of document images with arbitrary boundaries in four typical scenarios.

document understanding Optical Character Recognition (OCR)

Paper
Add Code

Learning Structure-Guided Diffusion Model for 2D Human Pose Estimation

no code implementations • 29 Jun 2023 • Zhongwei Qiu, Qiansheng Yang, Jian Wang, Xiyu Wang, Chang Xu, Dongmei Fu, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

One of the mainstream schemes for 2D human pose estimation (HPE) is learning keypoints heatmaps by a neural network.

2D Human Pose Estimation Denoising +1

Paper
Add Code

ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images

no code implementations • 5 Jun 2023 • Wenwen Yu, Chengquan Zhang, Haoyu Cao, Wei Hua, Bohan Li, Huang Chen, MingYu Liu, Mingrui Chen, Jianfeng Kuang, Mengjun Cheng, Yuning Du, Shikun Feng, Xiaoguang Hu, Pengyuan Lyu, Kun Yao, Yuechen Yu, Yuliang Liu, Wanxiang Che, Errui Ding, Cheng-Lin Liu, Jiebo Luo, Shuicheng Yan, Min Zhang, Dimosthenis Karatzas, Xing Sun, Jingdong Wang, Xiang Bai

It is hoped that this competition will attract many researchers in the field of CV and NLP, and bring some new thoughts to the field of Document AI.

Document AI Entity Linking +1

Paper
Add Code

Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided Dynamic Token Merge for Document Understanding

no code implementations • 19 May 2023 • Mingliang Zhai, Yulin Li, Xiameng Qin, Chen Yi, Qunyi Xie, Chengquan Zhang, Kun Yao, Yuwei Wu, Yunde Jia

Transformers achieve promising performance in document understanding because of their high effectiveness and still suffer from quadratic computational complexity dependency on the sequence length.

document understanding

Paper
Add Code

StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

1 code implementation • 1 Mar 2023 • Yuechen Yu, Yulin Li, Chengquan Zhang, Xiaoqiang Zhang, Zengyuan Guo, Xiameng Qin, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

Compared to the masked multi-modal modeling methods for document image understanding that rely on both the image and text modalities, StrucTexTv2 models image-only input and potentially deals with more application scenarios free from OCR pre-processing.

Ranked #1 on Table Recognition on WTW

Document Image Classification Language Modelling +3

481

Paper
Code

CAE v2: Context Autoencoder with CLIP Target

no code implementations • 17 Nov 2022 • Xinyu Zhang, Jiahui Chen, Junkun Yuan, Qiang Chen, Jian Wang, Xiaodi Wang, Shumin Han, Xiaokang Chen, Jimin Pi, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

That is to say, the smaller the model, the lower the mask ratio needs to be.

Semantic Segmentation

Paper
Add Code

Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining

no code implementations • arXiv 2022 • Qiang Chen, Jian Wang, Chuchu Han, Shan Zhang, Zexian Li, Xiaokang Chen, Jiahui Chen, Xiaodi Wang, Shuming Han, Gang Zhang, Haocheng Feng, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

The training process consists of self-supervised pretraining and finetuning a ViT-Huge encoder on ImageNet-1K, pretraining the detector on Object365, and finally finetuning it on COCO.

Ranked #8 on Object Detection on COCO test-dev

Object object-detection +1

Paper
Add Code

TRUST: An Accurate and End-to-End Table structure Recognizer Using Splitting-based Transformers

no code implementations • 31 Aug 2022 • Zengyuan Guo, Yuechen Yu, Pengyuan Lv, Chengquan Zhang, Haojie Li, Zhihui Wang, Kun Yao, Jingtuo Liu, Jingdong Wang

The Vertex-based Merging Module is capable of aggregating local contextual information between adjacent basic grids, providing the ability to merge basic girds that belong to the same spanning cell accurately.

Ranked #5 on Table Recognition on PubTabNet

Table Recognition

Paper
Add Code

Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment

2 code implementations • ICCV 2023 • Qiang Chen, Xiaokang Chen, Jian Wang, Shan Zhang, Kun Yao, Haocheng Feng, Junyu Han, Errui Ding, Gang Zeng, Jingdong Wang

Detection transformer (DETR) relies on one-to-one assignment, assigning one ground-truth object to one prediction, for end-to-end detection without NMS post-processing.

Data Augmentation Object +2

12,069

Paper
Code

MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining

no code implementations • 1 Jun 2022 • Pengyuan Lyu, Chengquan Zhang, Shanshan Liu, Meina Qiao, Yangliu Xu, Liang Wu, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

Specifically, we transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder using a proposed masked image-language modeling scheme.

Ranked #2 on Optical Character Recognition (OCR) on Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study

Language Modelling Optical Character Recognition (OCR) +1

Paper
Add Code

ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval

no code implementations • CVPR 2022 • Mengjun Cheng, Yipeng Sun, Longchao Wang, Xiongwei Zhu, Kun Yao, Jie Chen, Guoli Song, Junyu Han, Jingtuo Liu, Errui Ding, Jingdong Wang

Visual appearance is considered to be the most important cue to understand images for cross-modal retrieval, while sometimes the scene text appearing in images can provide valuable information to understand the visual semantics.

Ranked #10 on Cross-Modal Retrieval on Flickr30k (using extra training data)