Text Segmentation
35 papers with code • 3 benchmarks • 7 datasets
Text segmentation deals with the correct division of a document into semantically coherent blocks.
Datasets
Most implemented papers
Neural Sequence Segmentation as Determining the Leftmost Segments
Prior methods to text segmentation are mostly at token level.
Sefamerve ARGE at SemEval-2021 Task 5: Toxic Spans Detection Using Segmentation Based 1-D Convolutional Neural Network Model
This paper describes our contribution to SemEval-2021 Task 5: Toxic Spans Detection.
WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition
In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total.
Self-supervised Implicit Glyph Attention for Text Recognition
Supervised attention can alleviate the above issue, but it is character category-specific, which requires extra laborious character-level bounding box annotations and would be memory-intensive when handling languages with larger character categories.
Toward Unifying Text Segmentation and Long Document Summarization
The problem is only exacerbated by a lack of segmentation in transcripts of audio/video recordings.
Self-supervised Character-to-Character Distillation for Text Recognition
Therefore, exploring the robust text feature representations on unlabeled real images by self-supervised learning is a good solution.
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding
As phrase extraction can be regarded as a $1$D text segmentation problem, we formulate PEG as a dual detection problem and propose a novel DQ-DETR model, which introduces dual queries to probe different features from image and text for object prediction and phrase mask prediction.
Three-stage binarization of color document images based on discrete wavelet transform and generative adversarial networks
The efficient segmentation of foreground text information from the background in degraded color document images is a critical challenge in the preservation of ancient manuscripts.
CCDWT-GAN: Generative Adversarial Networks Based on Color Channel Using Discrete Wavelet Transform for Document Image Binarization
This work compares the performance of the proposed method with other state-of-the-art (SOTA) methods on DIBCO and H-DIBCO ((Handwritten) Document Image Binarization Competition) datasets.
Expanding Scope: Adapting English Adversarial Attacks to Chinese
Most existing studies focused on designing attacks to evaluate the robustness of NLP models in the English language alone.