Search Results for author: Kun Yao

Found 20 papers, 7 papers with code

FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition

no code implementations5 Feb 2024 Xiaohu Huang, Hao Zhou, Kun Yao, Kai Han

To address these issues, FROSTER employs a residual feature distillation approach to ensure that CLIP retains its generalization capability while effectively adapting to the action recognition task.

Open Vocabulary Action Recognition

HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception

1 code implementation NeurIPS 2023 Junkun Yuan, Xinyu Zhang, Hao Zhou, Jian Wang, Zhongwei Qiu, Zhiyin Shao, Shaofeng Zhang, Sifan Long, Kun Kuang, Kun Yao, Junyu Han, Errui Ding, Lanfen Lin, Fei Wu, Jingdong Wang

To further capture human characteristics, we propose a structure-invariant alignment loss that enforces different masked views, guided by the human part prior, to be closely aligned for the same image.

2D Pose Estimation Attribute +3

GridFormer: Towards Accurate Table Structure Recognition via Grid Prediction

no code implementations26 Sep 2023 Pengyuan Lyu, Weihong Ma, Hongyi Wang, Yuechen Yu, Chengquan Zhang, Kun Yao, Yang Xue, Jingdong Wang

In this representation, the vertexes and edges of the grid store the localization and adjacency information of the table.

Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation

2 code implementations ICCV 2023 Huan Liu, Qiang Chen, Zichang Tan, Jiang-Jiang Liu, Jian Wang, Xiangbo Su, Xiaolong Li, Kun Yao, Junyu Han, Errui Ding, Yao Zhao, Jingdong Wang

State-of-the-art solutions adopt the DETR-like framework, and mainly develop the complex decoder, e. g., regarding pose estimation as keypoint box detection and combining with human detection in ED-Pose, hierarchically predicting with pose decoder and joint (keypoint) decoder in PETR.

Human Detection Multi-Person Pose Estimation

Towards Robust Real-Time Scene Text Detection: From Semantic to Instance Representation Learning

no code implementations14 Aug 2023 Xugong Qin, Pengyuan Lyu, Chengquan Zhang, Yu Zhou, Kun Yao, Peng Zhang, Hailun Lin, Weiping Wang

Different from existing methods which integrate multiple-granularity features or multiple outputs, we resort to the perspective of representation learning in which auxiliary tasks are utilized to enable the encoder to jointly learn robust features with the main task of per-pixel classification during optimization.

Representation Learning Scene Text Detection +1

MataDoc: Margin and Text Aware Document Dewarping for Arbitrary Boundary

no code implementations24 Jul 2023 Beiya Dai, Xing Li, Qunyi Xie, Yulin Li, Xiameng Qin, Chengquan Zhang, Kun Yao, Junyu Han

To produce a comprehensive evaluation of MataDoc, we propose a novel benchmark ArbDoc, mainly consisting of document images with arbitrary boundaries in four typical scenarios.

document understanding Optical Character Recognition (OCR)

Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided Dynamic Token Merge for Document Understanding

no code implementations19 May 2023 Mingliang Zhai, Yulin Li, Xiameng Qin, Chen Yi, Qunyi Xie, Chengquan Zhang, Kun Yao, Yuwei Wu, Yunde Jia

Transformers achieve promising performance in document understanding because of their high effectiveness and still suffer from quadratic computational complexity dependency on the sequence length.

document understanding

StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

1 code implementation1 Mar 2023 Yuechen Yu, Yulin Li, Chengquan Zhang, Xiaoqiang Zhang, Zengyuan Guo, Xiameng Qin, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

Compared to the masked multi-modal modeling methods for document image understanding that rely on both the image and text modalities, StrucTexTv2 models image-only input and potentially deals with more application scenarios free from OCR pre-processing.

Document Image Classification Language Modelling +3

Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining

no code implementations arXiv 2022 Qiang Chen, Jian Wang, Chuchu Han, Shan Zhang, Zexian Li, Xiaokang Chen, Jiahui Chen, Xiaodi Wang, Shuming Han, Gang Zhang, Haocheng Feng, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

The training process consists of self-supervised pretraining and finetuning a ViT-Huge encoder on ImageNet-1K, pretraining the detector on Object365, and finally finetuning it on COCO.

Object object-detection +1

TRUST: An Accurate and End-to-End Table structure Recognizer Using Splitting-based Transformers

no code implementations31 Aug 2022 Zengyuan Guo, Yuechen Yu, Pengyuan Lv, Chengquan Zhang, Haojie Li, Zhihui Wang, Kun Yao, Jingtuo Liu, Jingdong Wang

The Vertex-based Merging Module is capable of aggregating local contextual information between adjacent basic grids, providing the ability to merge basic girds that belong to the same spanning cell accurately.

Table Recognition

Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment

2 code implementations ICCV 2023 Qiang Chen, Xiaokang Chen, Jian Wang, Shan Zhang, Kun Yao, Haocheng Feng, Junyu Han, Errui Ding, Gang Zeng, Jingdong Wang

Detection transformer (DETR) relies on one-to-one assignment, assigning one ground-truth object to one prediction, for end-to-end detection without NMS post-processing.

Data Augmentation Object +2

MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining

no code implementations1 Jun 2022 Pengyuan Lyu, Chengquan Zhang, Shanshan Liu, Meina Qiao, Yangliu Xu, Liang Wu, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

Specifically, we transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder using a proposed masked image-language modeling scheme.

Language Modelling Optical Character Recognition (OCR) +1

ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval

no code implementations CVPR 2022 Mengjun Cheng, Yipeng Sun, Longchao Wang, Xiongwei Zhu, Kun Yao, Jie Chen, Guoli Song, Junyu Han, Jingtuo Liu, Errui Ding, Jingdong Wang

Visual appearance is considered to be the most important cue to understand images for cross-modal retrieval, while sometimes the scene text appearing in images can provide valuable information to understand the visual semantics.

Ranked #10 on Cross-Modal Retrieval on Flickr30k (using extra training data)

Contrastive Learning Cross-Modal Retrieval +1

Compressing physical properties of atomic species for improving predictive chemistry

2 code implementations31 Oct 2018 John E. Herr, Kevin Koh, Kun Yao, John Parkhill

The answers to many unsolved problems lie in the intractable chemical space of molecules and materials.

BIG-bench Machine Learning

The Many-Body Expansion Combined with Neural Networks

no code implementations22 Sep 2016 Kun Yao, John E. Herr, John Parkhill

Although they are fast, NNs suffer from their own curse of dimensionality; they must be trained on a representative sample of chemical space.

Cannot find the paper you are looking for? You can Submit a new open access paper.