GPT-2

Introduced by Radford et al. in Language Models are Unsupervised Multitask Learners

GPT-2 is a Transformer architecture that was notable for its size (1.5 billion parameters) on its release. The model is pretrained on a WebText dataset - text from 45 million website links. It largely follows the previous GPT architecture with some modifications:

Layer normalization is moved to the input of each sub-block, similar to a pre-activation residual network and an additional layer normalization was added after the final self-attention block.
A modified initialization which accounts for the accumulation on the residual path with model depth is used. Weights of residual layers are scaled at initialization by a factor of $1/\sqrt{N}$ where $N$ is the number of residual layers.
The vocabulary is expanded to 50,257. The context size is expanded from 512 to 1024 tokens and a larger batch size of 512 is used.

Source: Language Models are Unsupervised Multitask Learners

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	164	19.85%
Text Generation	95	11.50%
Sentence	40	4.84%
Question Answering	25	3.03%
Retrieval	17	2.06%
Response Generation	12	1.45%
Dialogue Generation	12	1.45%
Translation	12	1.45%
Text Classification	10	1.21%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Adam	Stochastic Optimization
Attention Dropout	Regularization
BPE	Subword Segmentation
Dense Connections	Feedforward Networks
Discriminative Fine-Tuning	Fine-Tuning
Dropout	Regularization
GELU	Activation Functions
Layer Normalization	Normalization
Linear Warmup With Cosine Annealing	Learning Rate Schedules
Multi-Head Attention	Attention Modules
Residual Connection	Skip Connections
Scaled Dot-Product Attention	Attention Mechanisms
Softmax	Output Functions
Weight Decay	Regularization

Categories

Add Remove

Transformers

Autoregressive Transformers