GBST, or Gradient-based Subword Tokenization Module, is a soft gradient-based subword tokenization module that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network.
GBST learns a position-wise soft selection over candidate subword blocks by scoring them with a scoring network. In contrast to prior tokenization-free methods, GBST learns interpretable latent subwords, which enables easy inspection of lexical representations and is more efficient than other byte-based models.
Source: Charformer: Fast Character Transformers via Gradient-based Subword TokenizationPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
NMT | 2 | 18.18% |
Denoising | 1 | 9.09% |
Image Denoising | 1 | 9.09% |
Translation | 1 | 9.09% |
Toxic Comment Classification | 1 | 9.09% |
Linguistic Acceptability | 1 | 9.09% |
Natural Language Inference | 1 | 9.09% |
Paraphrase Identification | 1 | 9.09% |
Semantic Textual Similarity | 1 | 9.09% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |