BERT

Introduced by Devlin et al. in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT, or Bidirectional Encoder Representations from Transformers, improves upon standard Transformers by removing the unidirectionality constraint by using a masked language model (MLM) pre-training objective. The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-to-right language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pre-train a deep bidirectional Transformer. In addition to the masked language model, BERT uses a next sentence prediction task that jointly pre-trains text-pair representations.

There are two steps in BERT: pre-training and fine-tuning. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For fine-tuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Each downstream task has separate fine-tuned models, even though they are initialized with the same pre-trained parameters.

Source: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Retrieval	118	12.29%
Language Modelling	107	11.15%
Question Answering	61	6.35%
Large Language Model	39	4.06%
Sentiment Analysis	33	3.44%
Text Classification	33	3.44%
Sentence	33	3.44%
Information Retrieval	22	2.29%
Text Generation	18	1.88%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Adam	Stochastic Optimization
Attention Dropout	Regularization
Dense Connections	Feedforward Networks
Dropout	Regularization
GELU	Activation Functions
Layer Normalization	Normalization
Linear Warmup With Linear Decay	Learning Rate Schedules
Multi-Head Attention	Attention Modules
Residual Connection	Skip Connections
Scaled Dot-Product Attention	Attention Mechanisms
Softmax	Output Functions
Weight Decay	Regularization
WordPiece	Subword Segmentation

Categories

Add Remove

Language Models

Transformers

Autoencoding Transformers