Tokenizers

SentencePiece is a subword tokenizer and detokenizer for natural language processing. It performs subword segmentation, supporting the byte-pair-encoding (BPE) algorithm and unigram language model, and then converts this text into an id sequence guarantee perfect reproducibility of the normalization and subword segmentation.

Source: SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Language Modelling 97 9.77%
Question Answering 60 6.04%
Sentence 48 4.83%
Text Generation 40 4.03%
Translation 32 3.22%
Retrieval 29 2.92%
Machine Translation 28 2.82%
Natural Language Understanding 19 1.91%
Sentiment Analysis 18 1.81%

Components


Component Type
BPE
Subword Segmentation

Categories