TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic entity labeling	FUNSD	StrucTexTv2 (large)	F1	91.82	# 6
Semantic entity labeling	FUNSD	StrucTexTv2 (small)	F1	89.23	# 8
Document Image Classification	RVL-CDIP	StrucTexTv2 (large)	Accuracy	94.62%	# 14
Document Image Classification	RVL-CDIP	StrucTexTv2 (large)	Parameters	238M	# 27
Document Image Classification	RVL-CDIP	StrucTexTv2 (small)	Accuracy	93.4%	# 17
Document Image Classification	RVL-CDIP	StrucTexTv2 (small)	Parameters	28M	# 13
Table Recognition	WTW	StrucTexTv2 (small)	F1	78.9%	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/structextv2-masked-visual-textual-prediction/table-recognition-on-wtw)](https://paperswithcode.com/sota/table-recognition-on-wtw?p=structextv2-masked-visual-textual-prediction)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/structextv2-masked-visual-textual-prediction/semantic-entity-labeling-on-funsd)](https://paperswithcode.com/sota/semantic-entity-labeling-on-funsd?p=structextv2-masked-visual-textual-prediction)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/structextv2-masked-visual-textual-prediction/document-image-classification-on-rvl-cdip)](https://paperswithcode.com/sota/document-image-classification-on-rvl-cdip?p=structextv2-masked-visual-textual-prediction)`

StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

1 Mar 2023 · Yuechen Yu, Yulin Li, Chengquan Zhang, Xiaoqiang Zhang, Zengyuan Guo, Xiameng Qin, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang ·

In this paper, we present StrucTexTv2, an effective document image pre-training framework, by performing masked visual-textual prediction. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling, based on text region-level image masking. The proposed method randomly masks some image regions according to the bounding box coordinates of text words. The objectives of our pre-training tasks are reconstructing the pixels of masked image regions and the corresponding masked tokens simultaneously. Hence the pre-trained encoder can capture more textual semantics in comparison to the masked image modeling that usually predicts the masked image patches. Compared to the masked multi-modal modeling methods for document image understanding that rely on both the image and text modalities, StrucTexTv2 models image-only input and potentially deals with more application scenarios free from OCR pre-processing. Extensive experiments on mainstream benchmarks of document image understanding demonstrate the effectiveness of StrucTexTv2. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction under the end-to-end scenario.

PDF Abstract

Code

Add Remove Mark official

PaddlePaddle/VIMER official

481

Tasks

Add Remove

Document Image Classification

Image Classification

Language Modelling

Masked Language Modeling

Optical Character Recognition (OCR)

Semantic entity labeling

Datasets

FUNSD PubLayNet

RVL-CDIP

SROIE

WTW

Results from the Paper

Edit

Ranked #1 on Table Recognition on WTW

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic entity labeling	FUNSD	StrucTexTv2 (large)	F1	91.82	# 6	Compare
Semantic entity labeling	FUNSD	StrucTexTv2 (small)	F1	89.23	# 8	Compare
Document Image Classification	RVL-CDIP	StrucTexTv2 (large)	Accuracy	94.62%	# 14	Compare
Document Image Classification	RVL-CDIP	StrucTexTv2 (large)	Parameters	238M	# 27	Compare
Document Image Classification	RVL-CDIP	StrucTexTv2 (small)	Accuracy	93.4%	# 17	Compare
Document Image Classification	RVL-CDIP	StrucTexTv2 (small)	Parameters	28M	# 13	Compare
Table Recognition	WTW	StrucTexTv2 (small)	F1	78.9%	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove