SentencePiece is a subword tokenizer and detokenizer for natural language processing. It performs subword segmentation, supporting the byte-pair-encoding (BPE) algorithm and unigram language model, and then converts this text into an id sequence guarantee perfect reproducibility of the normalization and subword segmentation.

Source: SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Latest Papers

PAPER DATE
Scientific Claim Verification with VERT5ERINI
Ronak PradeepXueguang MaRodrigo NogueiraJimmy Lin
2020-10-22
mT5: A massively multilingual pre-trained text-to-text transformer
| Linting XueNoah ConstantAdam RobertsMihir KaleRami Al-RfouAditya SiddhantAditya BaruaColin Raffel
2020-10-22
AutoMeTS: The Autocomplete for Medical Text Simplification
Hoang VanDavid KauchakGondy Leroy
2020-10-20
Parameter Norm Growth During Training of Transformers
William MerrillVivek RamanujanYoav GoldbergRoy SchwartzNoah Smith
2020-10-19
[email protected]: Sentiment Analysis of Code-Mixed Dravidian text using XLNet
Shubhanker BanerjeeArun JayapalSajeetha Thavareesan
2020-10-15
Aspect-based Document Similarity for Research Papers
| Malte OstendorffTerry RuasTill BlumeBela GippGeorg Rehm
2020-10-13
Interpreting Attention Models with Human Visual Attention in Machine Reading Comprehension
Ekta SoodSimon TannertDiego FrassinelliAndreas BullingNgoc Thang Vu
2020-10-13
Chatbot Interaction with Artificial Intelligence: Human Data Augmentation with T5 and Language Transformer Ensemble for Text Classification
Jordan J. BirdAnikó EkártDiego R. Faria
2020-10-12
Automated Concatenation of Embeddings for Structured Prediction
| Xinyu WangYong JiangNguyen BachTao WangZhongqiang HuangFei HuangKewei Tu
2020-10-10
TextSETTR: Label-Free Text Style Extraction and Tunable Targeted Restyling
Parker RileyNoah ConstantMandy GuoGirish KumarDavid UthusZarana Parekh
2020-10-08
Converting the Point of View of Messages Spoken to Virtual Assistants
| Isabelle G. LeeVera ZuSai Srujana BuddiDennis LiangJack G. M. FitzGerald
2020-10-06
Analyzing Individual Neurons in Pre-trained Language Models
Nadir DurraniHassan SajjadFahim DalviYonatan Belinkov
2020-10-06
How Effective is Task-Agnostic Data Augmentation for Pretrained Transformers?
Shayne LongpreYu WangChristopher DuBois
2020-10-05
PUM at SemEval-2020 Task 12: Aggregation of Transformer-based models' features for offensive language recognition
Piotr JaniszewskiMateusz SkibaUrszula Walińska
2020-10-05
Examining the rhetorical capacities of neural language models
Zining ZhuChuer PanMohamed AbdallaFrank Rudzicz
2020-10-01
Accelerating Multi-Model Inference by Merging DNNs of Different Weights
Joo Seong JeongSoojeong KimGyeong-In YuYunseong LeeByung-Gon Chun
2020-09-28
BET: A Backtranslation Approach for Easy Data Augmentation in Transformer-based Paraphrase Identification Context
Jean-Philippe CorbeilHadi Abdi Ghadivel
2020-09-25
MinTL: Minimalist Transfer Learning for Task-Oriented Dialogue Systems
Zhaojiang LinAndrea MadottoGenta Indra WinataPascale Fung
2020-09-25
Weird AI Yankovic: Generating Parody Lyrics
Mark Riedl
2020-09-25
Robustification of Segmentation Models Against Adversarial Perturbations In Medical Imaging
Hanwool ParkAmirhossein BayatMohammad SabokrouJan S. KirschkeBjoern H. Menze
2020-09-23
UCD-CS at W-NUT 2020 Shared Task-3: A Text to Text Approach for COVID-19 Event Extraction on Social Media
Congcong WangDavid Lillis
2020-09-21
Efficient Transformers: A Survey
Yi TayMostafa DehghaniDara BahriDonald Metzler
2020-09-14
Fine-tuning Pre-trained Contextual Embeddings for Citation Content Analysis in Scholarly Publication
Haihua ChenHuyen Nguyen
2020-09-12
Compressed Deep Networks: Goodbye SVD, Hello Robust Low-Rank Approximation
Murad TukanAlaa MaaloufMatan WekslerDan Feldman
2020-09-11
UPB at SemEval-2020 Task 6: Pretrained Language Models for DefinitionExtraction
Andrei-Marius AvramDumitru-Clementin CercelCostin-Gabriel Chiru
2020-09-11
EdinburghNLP at WNUT-2020 Task 2: Leveraging Transformers with Generalized Augmentation for Identifying Informativeness in COVID-19 Tweets
Nickil Maveli
2020-09-06
QiaoNing at SemEval-2020 Task 4: Commonsense Validation and Explanation system based on ensemble of language model
Pai Liu
2020-09-06
A Multitask Deep Learning Approach for User Depression Detection on Sina Weibo
Yiding WangZhenyi WangChenghao LiYilin ZhangHaizhou Wang
2020-08-26
Lite Training Strategies for Portuguese-English and English-Portuguese Translation
| Alexandre LopesRodrigo NogueiraRoberto LotufoHelio Pedrini
2020-08-20
PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data
| Diedre CarmoMarcos PiauIsrael CampiottiRodrigo NogueiraRoberto Lotufo
2020-08-20
KR-BERT: A Small-Scale Korean-Specific Language Model
| Sangah LeeHansol JangYunmee BaikSuzi ParkHyopil Shin
2020-08-10
Multi-node Bert-pretraining: Cost-efficient Approach
Jiahuang LinXin LiGennady Pekhimenko
2020-08-01
Neural Machine Translation with Error Correction
Kaitao SongXu TanJianfeng Lu
2020-07-21
Investigating Pretrained Language Models for Graph-to-Text Generation
Leonardo F. R. RibeiroMartin SchmittHinrich SchützeIryna Gurevych
2020-07-16
HyperGrid: Efficient Multi-Task Transformers with Grid-wise Decomposable Hyper Projections
Yi TayZhe ZhaoDara BahriDonald MetzlerDa-Cheng Juan
2020-07-12
Integrating Multimodal Information in Large Pretrained Transformers
Wasifur RahmanMd Kamrul HasanSangwu LeeAmirAli Bagher ZadehChengfeng MaoLouis-Philippe MorencyEhsan Hoque
2020-07-01
Detecting Sarcasm in Conversation Context Using Transformer-Based Models
Adithya AvvaruSanath VobilisettyRadhika Mamidi
2020-07-01
Metaphor Detection Using Contextual Word Embeddings From Transformers
Jerry LiuNathan O{'}HaraAlex RubinerRachel DraelosCynthia Rudin
2020-07-01
A Transformer Approach to Contextual Sarcasm Detection in Twitter
Hunter GregorySteven LiPouya MohammadiNatalie TarnRachel DraelosCynthia Rudin
2020-07-01
Multimodal and Multiresolution Speech Recognition with Transformers
Georgios ParaskevopoulosSrinivas ParthasarathyAparna KhareShiva Sundaram
2020-07-01
Normalizador Neural de Datas e Endereços
Gustavo PlensackPaulo Finardi
2020-06-27
Transferring Monolingual Model to Low-Resource Language: The Case of Tigrinya
Abrhalei TelaAbraham WoubieVille Hautamaki
2020-06-13
Using Large Pretrained Language Models for Answering User Queries from Product Specifications
Kalyani RoySmit ShahNithish PaiJaidam RamtejPrajit Prashant NadkarnJyotirmoy BanerjeePawan GoyalSurender Kumar
2020-05-29
A Comparative Study of Lexical Substitution Approaches based on Neural Language Models
Nikolay ArefyevBoris SheludkoAlexander PodolskiyAlexander Panchenko
2020-05-29
Text-to-Text Pre-Training for Data-to-Text Tasks
| Mihir Kale
2020-05-21
ImpactCite: An XLNet-based method for Citation Impact Analysis
Dominique MercierSyed Tahseen Raza RizviVikas RajashekarAndreas DengelSheraz Ahmed
2020-05-05
DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering
| Qingqing CaoHarsh TrivediAruna BalasubramanianNiranjan Balasubramanian
2020-05-02
SiBert: Enhanced Chinese Pre-trained Language Model with Sentence Insertion
Jiahao ChenChenjie CaoXiuyan Jiang
2020-05-01
Evaluating the Impact of Sub-word Information and Cross-lingual Word Embeddings on Mi'kmaq Language Modelling
Jeremie BoudreauAkankshya PatraAshima SuvarnaPaul Cook
2020-05-01
$R^3$: Reverse, Retrieve, and Rank for Sarcasm Generation with Commonsense Knowledge
| Tuhin ChakrabartyDebanjan GhoshSmaranda MuresanNanyun Peng
2020-04-28
Cross-lingual Information Retrieval with BERT
Zhuolin JiangAmro El-JaroudiWilliam HartmannDamianos KarakosLingjun Zhao
2020-04-24
StereoSet: Measuring stereotypical bias in pretrained language models
| Moin NadeemAnna BethkeSiva Reddy
2020-04-20
MPNet: Masked and Permuted Pre-training for Language Understanding
| Kaitao SongXu TanTao QinJianfeng LuTie-Yan Liu
2020-04-20
Poor Man's BERT: Smaller and Faster Transformer Models
| Hassan SajjadFahim DalviNadir DurraniPreslav Nakov
2020-04-08
Exploiting Redundancy in Pre-trained Language Models for Efficient Transfer Learning
Fahim DalviHassan SajjadNadir DurraniYonatan Belinkov
2020-04-08
Evaluating Machines by their Real-World Language Use
| Rowan ZellersAri HoltzmanElizabeth ClarkLianhui QinAli FarhadiYejin Choi
2020-04-07
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
| Kevin ClarkMinh-Thang LuongQuoc V. LeChristopher D. Manning
2020-03-23
Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles
| Malte OstendorffTerry RuasMoritz SchubotzGeorg RehmBela Gipp
2020-03-22
TTTTTackling WinoGrande Schemas
Sheng-Chieh LinJheng-Hong YangRodrigo NogueiraMing-Feng TsaiChuan-Ju WangJimmy Lin
2020-03-18
Neural Machine Translation with Joint Representation
| Yanyang LiQiang WangTong XiaoTongran LiuJingbo Zhu
2020-02-16
Stress Test Evaluation of Transformer-based Models in Natural Language Understanding Tasks
Carlos AspillagaAndrés CarvalloVladimir Araujo
2020-02-14
Reformer: The Efficient Transformer
| Nikita KitaevŁukasz KaiserAnselm Levskaya
2020-01-13
Resolving the Scope of Speculation and Negation using Transformer-Based Architectures
Benita Kathleen BrittoAditya Khandelwal
2020-01-09
BERT-AL: BERT for Arbitrarily Long Document Understanding
Ruixuan ZhangZhuoyu WeiYu ShiYining Chen
2020-01-01
Clinical XLNet: Modeling Sequential Clinical Notes and Predicting Prolonged Mechanical Ventilation
| Kexin HuangAbhishek SinghSitong ChenEdward T. MoseleyChih-ying DengNaomi GeorgeCharlotta Lindvall
2019-12-27
Make Lead Bias in Your Favor: Zero-shot Abstractive News Summarization
Chenguang ZhuZiyi YangRobert GmyrMichael ZengXuedong Huang
2019-12-25
WaLDORf: Wasteless Language-model Distillation On Reading-comprehension
James Yi TianAlexander P. KreuzerPai-Hung ChenHans-Martin Will
2019-12-13
An Exploration of Data Augmentation and Sampling Techniques for Domain-Agnostic Question Answering
Shayne LongpreYi LuZhucheng TuChris DuBois
2019-12-04
Evaluating Commonsense in Pre-trained Language Models
| Xuhui ZhouYue ZhangLeyang CuiDandan Huang
2019-11-27
Low Rank Factorization for Compact Multi-Head Self-Attention
| Sneha MehtaHuzefa RangwalaNaren Ramakrishnan
2019-11-26
Attending to Entities for Better Text Understanding
Pengxiang ChengKatrin Erk
2019-11-11
IIT-KGP at COIN 2019: Using pre-trained Language Models for modeling Machine Comprehension
Prakhar SharmaSumegh Roychowdhury
2019-11-01
Pingan Smart Health and SJTU at COIN - Shared Task: utilizing Pre-trained Language Models and Common-sense Knowledge in Machine Reading Tasks
Xiepeng LiZhexi ZhangWei ZhuZheng LiYuan NiPeng GaoJunchi YanGuotong Xie
2019-11-01
FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-Decoder Paradigm
Yuzhong HongXianguo YuNeng HeNan LiuJunhui Liu
2019-11-01
Generalizing Question Answering System with Pre-trained Language Model Fine-tuning
Dan SuYan XuGenta Indra WinataPeng XuHyeondey KimZihan LiuPascale Fung
2019-11-01
Transfer Learning from Transformers to Fake News Challenge Stance Detection (FNC-1) Task
Valeriya Slovikovskaya
2019-10-31
Modeling Inter-Speaker Relationship in XLNet for Contextual Spoken Language Understanding
Jonggu KimJong-Hyeok Lee
2019-10-28
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
| Colin RaffelNoam ShazeerAdam RobertsKatherine LeeSharan NarangMichael MatenaYanqi ZhouWei LiPeter J. Liu
2019-10-23
Speech-XLNet: Unsupervised Acoustic Model Pretraining For Self-Attention Networks
Xingchen SongGuangsen WangZhiyong WuYiheng HuangDan SuDong YuHelen Meng
2019-10-23
XL-Editor: Post-editing Sentences with XLNet
Yong-Siang ShihWei-Cheng ChangYiming Yang
2019-10-19
Multilingual Question Answering from Formatted Text applied to Conversational Agents
| Wissam SibliniCharlotte PasqualAxel LavielleCyril Cauchois
2019-10-10
Extreme Language Model Compression with Optimal Subwords and Shared Projections
Sanqiang ZhaoRaghav GuptaYang SongDenny Zhou
2019-09-25
Language models and Automated Essay Scoring
Pedro Uria RodriguezAmir JafariChristopher M. Ormerod
2019-09-18
Frustratingly Easy Natural Question Answering
Lin PanRishav ChakravartiAnthony FerrittoMichael GlassAlfio GliozzoSalim RoukosRadu FlorianAvirup Sil
2019-09-11
Reasoning Over Semantic-Level Graph for Fact Checking
| Wanjun ZhongJingjing XuDuyu TangZenan XuNan DuanMing ZhouJiahai WangJian Yin
2019-09-09
Transfer Learning Robustness in Multi-Class Categorization by Fine-Tuning Pre-Trained Contextualized Language Models
| Xinyi LiuArtit Wangperawong
2019-09-08
Integrating Multimodal Information in Large Pretrained Transformers
Wasifur RahmanMd. Kamrul HasanSangwu LeeAmir ZadehChengfeng MaoLouis-Philippe MorencyEhsan Hoque
2019-08-15
Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding
| Oren BarkanNoam RazinItzik MalkielOri KatzAvi CaciularuNoam Koenigstein
2019-08-14
BioFLAIR: Pretrained Pooled Contextualized Embeddings for Biomedical Sequence Labeling Tasks
| Shreyas SharmaRon Daniel Jr
2019-08-13
ERNIE 2.0: A Continual Pre-training Framework for Language Understanding
| Yu SunShuohuan WangYukun LiShikun FengHao TianHua WuHaifeng Wang
2019-07-29
XLNet: Generalized Autoregressive Pretraining for Language Understanding
| Zhilin YangZihang DaiYiming YangJaime CarbonellRuslan SalakhutdinovQuoc V. Le
2019-06-19
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
| Taku KudoJohn Richardson
2018-08-19

Components

COMPONENT TYPE
BPE
Subword Segmentation

Categories