Chinese Word Segmentation

48 papers with code • 6 benchmarks • 3 datasets

Chinese word segmentation is the task of splitting Chinese text (i.e. a sequence of Chinese characters) into words (Source: www.nlpprogress.com).

Benchmarks

Add a Result

These leaderboards are used to track progress in Chinese Word Segmentation

Dataset	Best Model	Compare
MSR	BABERT-LE	See all
PKU	BABERT-LE	See all
CTB6	LATTE (Linguistic units, lattices, PTMs, GNNs)	See all
MSRA	BABERT-LE	See all
CITYU	WMSeg + ZEN	See all
AS	Glyce + BERT	See all

Datasets

Most implemented papers

Most implemented Social Latest No code

ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations

sinovation/ZEN • • Findings of the Association for Computational Linguistics 2020

Moreover, it is shown that reasonable performance can be obtained when ZEN is trained on a small corpus, which is important for applying pre-training techniques to scenarios with limited data.

Paper
Code

PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation

lancopku/pkuseg-python • 27 Jun 2019

Through this method, we generate synthetic data using a large amount of unlabeled data in the target domain and then obtain a word segmentation model for the target domain.

Paper
Code

Simplifying Neural Machine Translation with Addition-Subtraction Twin-Gated Recurrent Networks

bzhangGo/zero • • EMNLP 2018

Experiments on WMT14 translation tasks demonstrate that ATR-based neural machine translation can yield competitive performance on English- German and English-French language pairs in terms of both translation quality and speed.

Paper
Code

Segmental Recurrent Neural Networks

ykrmm/TREMBA • • 18 Nov 2015

Representations of the input segments (i. e., contiguous subsequences of the input) are computed by encoding their constituent tokens using bidirectional recurrent neural nets, and these "segment embeddings" are used to define compatibility scores with output labels.

Paper
Code

LSICC: A Large Scale Informal Chinese Corpus

JaniceZhao/Douban-Dushu-Dataset • 26 Nov 2018

Deep learning based natural language processing model is proven powerful, but need large-scale dataset.

Paper
Code

Glyce: Glyph-vectors for Chinese Character Representations

ShannonAI/glyce • • NeurIPS 2019

However, due to the lack of rich pictographic evidence in glyphs and the weak generalization ability of standard computer vision models on character data, an effective way to utilize the glyph information remains to be found.

Paper
Code

Investigating Self-Attention Network for Chinese Word Segmentation

gump88/SAN-CWS • • 26 Jul 2019

Neural network has become the dominant method for Chinese word segmentation.

Paper
Code

Sub-Character Tokenization for Chinese Pretrained Language Models

thunlp/subchartokenization • • 1 Jun 2021

2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos.

Paper
Code

Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling

modelscope/AdaSeq • • 27 Oct 2022

We apply BABERT for feature induction of Chinese sequence labeling tasks.

Paper
Code

LATTE: Lattice ATTentive Encoding for Character-based Word Segmentation

tchayintr/latte-ptm-ws • • Journal of Natural Language Processing 2023

Our model employs the lattice structure to handle segmentation alternatives and utilizes graph neural networks along with an attention mechanism to attentively extract multi-granularity representation from the lattice for complementing character representations.

Paper
Code

Chinese Word Segmentation

Benchmarks Add a Result

Datasets

Most implemented papers

Content

Benchmarks

Add a Result