ALBERT is a Transformer architecture based on BERT but with much fewer parameters. It achieves this through two parameter reduction techniques. The first is a factorized embeddings parameterization. By decomposing the large vocabulary embedding matrix into two small matrices, the size of the hidden layers is separated from the size of vocabulary embedding. This makes it easier to grow the hidden size without significantly increasing the parameter size of the vocabulary embeddings. The second technique is cross-layer parameter sharing. This technique prevents the parameter from growing with the depth of the network.

Additionally, ALBERT utilises a self-supervised loss for sentence-order prediction (SOP). SOP primary focuses on inter-sentence coherence and is designed to address the ineffectiveness of the next sentence prediction (NSP) loss proposed in the original BERT.

Source: ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Latest Papers

A Transformer Based Pitch Sequence Autoencoder with MIDI Augmentation
Mingshuo DingYinghao Ma
An Investigation on Different Underlying Quantization Schemes for Pre-trained Language Models
Zihan ZhaoYuncong LiuLu ChenQi LiuRao MaKai Yu
Infusing Disease Knowledge into BERT for Health Question Answering, Medical Inference and Disease Name Recognition
| Yun HeZiwei ZhuYin ZhangQin ChenJames Caverlee
Pretrained Language Model Embryology: The Birth of ALBERT
| David C. ChiangSung-Feng HuangHung-Yi Lee
On the Interplay Between Fine-tuning and Sentence-level Probing for Linguistic Knowledge in Pre-trained Transformers
| Marius MosbachAnna KhokhlovaMichael A. HedderichDietrich Klakow
BET: A Backtranslation Approach for Easy Data Augmentation in Transformer-based Paraphrase Identification Context
Jean-Philippe CorbeilHadi Abdi Ghadivel
BioALBERT: A Simple and Effective Pre-trained Language Model for Biomedical Named Entity Recognition
Usman NaseemMatloob KhushiVinay ReddySakthivel RajendranImran RazzakJinman Kim
Learning Universal Representations from Word to Sentence
Yian LiHai Zhao
Comparative Study of Language Models on Cross-Domain Data with Model Agnostic Explainability
Mayank ChhipaHrushikesh Mahesh VazurkarAbhijeet KumarMridul Mishra
ERNIE at SemEval-2020 Task 10: Learning Word Emphasis Selection by Pre-trained Language Model
Zhengjie HuangShikun FengWeiyue SuXuyi ChenShuohuan WangJiaxiang LiuXuan OuyangYu Sun
UPB at SemEval-2020 Task 8: Joint Textual and Visual Modeling in a Multi-Task Learning Architecture for Memotion Analysis
George-Alexandru VladGeorge-Eduard ZahariaDumitru-Clementin CercelCostin-Gabriel ChiruStefan Trausan-Matu
Variants of BERT, Random Forests and SVM approach for Multimodal Emotion-Target Sub-challenge
Hoang Manh HungHyung-Jeong YangSoo-Hyung KimGuee-Sang Lee
Deep Learning Brasil -- NLP at SemEval-2020 Task 9: Overview of Sentiment Analysis of Code-Mixed Tweets
Manoel Veríssimo dos Santos NetoAyrton Denner da Silva AmaralNádia Félix Felipe da SilvaAnderson da Silva Soares
Mono vs Multilingual Transformer-based Models: a Comparison across Several Language Tasks
Diego de Vargas FeijoViviane Pereira Moreira
LMVE at SemEval-2020 Task 4: Commonsense Validation and Explanation using Pretraining Language Model
Shilei LiuYu GuoBochao LiFeiliang Ren
A Transformer Approach to Contextual Sarcasm Detection in Twitter
Hunter GregorySteven LiPouya MohammadiNatalie TarnRachel DraelosCynthia Rudin
Deep Investing in Kyle's Single Period Model
Paul FriedrichJosef Teichmann
On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines
| Marius MosbachMaksym AndriushchenkoDietrich Klakow
BERT Loses Patience: Fast and Robust Inference with Early Exit
| Wangchunshu ZhouCanwen XuTao GeJulian McAuleyKe XuFuru Wei
BERT-based Ensembles for Modeling Disclosure and Support in Conversational Social Media Text
Tanvi DaduKartikey PantRadhika Mamidi
Language Representation Models for Fine-Grained Sentiment Classification
Brian CheangBailey WeiDavid KoganHowey QiuMasud Ahmed
Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio Representation
Po-Han ChiPei-Hung ChungTsung-Han WuChun-Cheng HsiehShang-Wen LiHung-yi Lee
ImpactCite: An XLNet-based method for Citation Impact Analysis
Dominique MercierSyed Tahseen Raza RizviVikas RajashekarAndreas DengelSheraz Ahmed
TAVAT: Token-Aware Virtual Adversarial Training for Language Understanding
Linyang LiXipeng Qiu
Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting
| Sanyuan ChenYutai HouYiming CuiWanxiang CheTing LiuXiangzhan Yu
UHH-LT at SemEval-2020 Task 12: Fine-Tuning of Pre-Trained Transformer Networks for Offensive Language Detection
Gregor WiedemannSeid Muhie YimamChris Biemann
Investigating the Effectiveness of Representations Based on Pretrained Transformer-based Language Models in Active Learning for Labelling Text Datasets
Jinghui LuBrian MacNamee
Gestalt: a Stacking Ensemble for SQuAD2.0
Mohamed El-Geish
Deep Entity Matching with Pre-Trained Language Models
Yuliang LiJinfeng LiYoshihiko SuharaAnHai DoanWang-Chiew Tan
Retrospective Reader for Machine Reading Comprehension
| Zhuosheng ZhangJunjie YangHai Zhao
PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination
| Saurabh GoyalAnamitra R. ChoudhurySaurabh M. RajeVenkatesan T. ChakaravarthyYogish SabharwalAshish Verma
Perceiving the arrow of time in autoregressive motion
Kristof MedingDominik JanzingBernhard SchölkopfFelix A. Wichmann
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
| Zhenzhong LanMingda ChenSebastian GoodmanKevin GimpelPiyush SharmaRadu Soricut