Single Headed Attention RNN: Stop Thinking With Your Head

26 Nov 2019 Stephen Merity

The leading approaches in language modeling are all obsessed with TV shows of my youth - namely Transformers and Sesame Street. Transformers this, Transformers that, and over here a bonfire worth of GPU-TPU-neuromorphic wafer scale silicon... (read more)

PDF Abstract

Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Language Modelling enwik8 SHA-LSTM (4 layers, h=1024, no attention head) Bit per Character (BPC) 1.33 # 25
Number of params 51M # 12
Language Modelling enwik8 SHA-RNN (4 layers, h=1024, single attention head) Bit per Character (BPC) 1.076 # 14
Number of params 52M # 11
Language Modelling enwik8 SHA-RNN (4 layers, h=1024, attention head per layer) Bit per Character (BPC) 1.068 # 13
Number of params 54M # 10

Methods used in the Paper