SentencePiece

Introduced by Kudo et al. in SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

SentencePiece is a subword tokenizer and detokenizer for natural language processing. It performs subword segmentation, supporting the byte-pair-encoding (BPE) algorithm and unigram language model, and then converts this text into an id sequence guarantee perfect reproducibility of the normalization and subword segmentation.

Source: SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	97	9.77%
Question Answering	60	6.04%
Sentence	48	4.83%
Text Generation	40	4.03%
Translation	32	3.22%
Retrieval	29	2.92%
Machine Translation	28	2.82%
Natural Language Understanding	19	1.91%
Sentiment Analysis	18	1.81%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
BPE	Subword Segmentation

Categories

Add Remove

Tokenizers