Text Segmentation

35 papers with code • 3 benchmarks • 7 datasets

Text segmentation deals with the correct division of a document into semantically coherent blocks.

Benchmarks

Add a Result

These leaderboards are used to track progress in Text Segmentation

Dataset	Best Model	Compare
YTSeg	MiniSeg (pretrained on Wiki-727K)	See all
SPMRL Hebrew segmentation data	RFTokenizer	See all
Wiki5K Hebrew segmentation	RFTokenizer	See all

Datasets

Most implemented papers

Most implemented Social Latest No code

Neural Sequence Segmentation as Determining the Leftmost Segments

LeePleased/LeftmostSeg • • NAACL 2021

Prior methods to text segmentation are mostly at token level.

Paper
Code

Sefamerve ARGE at SemEval-2021 Task 5: Toxic Spans Detection Using Segmentation Based 1-D Convolutional Neural Network Model

birolkuyumcu/wawunet_for_toxicspan • • SEMEVAL 2021

This paper describes our contribution to SemEval-2021 Task 5: Toxic Spans Detection.

Paper
Code

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

wenet-e2e/wenetspeech • • 7 Oct 2021

In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total.

Paper
Code

Self-supervised Implicit Glyph Attention for Text Recognition

tongkunguan/siga • • CVPR 2023

Supervised attention can alleviate the above issue, but it is character category-specific, which requires extra laborious character-level bounding box annotations and would be memory-intensive when handling languages with larger character categories.

Paper
Code

Toward Unifying Text Segmentation and Long Document Summarization

tencent-ailab/lodoss • • 28 Oct 2022

The problem is only exacerbated by a lack of segmentation in transcripts of audio/video recordings.

Paper
Code

Self-supervised Character-to-Character Distillation for Text Recognition

tongkunguan/ccd • • ICCV 2023

Therefore, exploring the robust text feature representations on unlabeled real images by self-supervised learning is a good solution.

Paper
Code

DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

idea-research/dq-detr • • 28 Nov 2022

As phrase extraction can be regarded as a $1$D text segmentation problem, we formulate PEG as a dual detection problem and propose a novel DQ-DETR model, which introduces dual queries to probe different features from image and text for object prediction and phrase mask prediction.

Paper
Code

Three-stage binarization of color document images based on discrete wavelet transform and generative adversarial networks

abcpp12383/threestagebinarization • • 29 Nov 2022

The efficient segmentation of foreground text information from the background in degraded color document images is a critical challenge in the preservation of ancient manuscripts.

Paper
Code

CCDWT-GAN: Generative Adversarial Networks Based on Color Channel Using Discrete Wavelet Transform for Document Image Binarization

abcpp12383/threestagebinarization • • 27 May 2023

This work compares the performance of the proposed method with other state-of-the-art (SOTA) methods on DIBCO and H-DIBCO ((Handwritten) Document Image Binarization Competition) datasets.

Paper
Code

Expanding Scope: Adapting English Adversarial Attacks to Chinese

QData/TextAttack • 8 Jun 2023

Most existing studies focused on designing attacks to evaluate the robustness of NLP models in the English language alone.

Paper
Code

Text Segmentation

Benchmarks Add a Result

Datasets

Most implemented papers

Content

Benchmarks

Add a Result