Pay Attention when Required

Transformer-based models consist of interleaved feed-forward blocks - that capture content meaning, and relatively more expensive self-attention blocks - that capture context meaning. In this paper, we explored trade-offs and ordering of the blocks to improve upon the current Transformer architecture and proposed PAR Transformer... (read more)

PDF Abstract

Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Language Modelling enwiki8 PAR Transformer 24B Bit per Character (BPC) 1.11 # 1
Paraphrase Identification MRPC PAR BERT Base Accuracy 89.2 # 1
Question Answering SQuAD1.1 PAR BERT Base F1 score 0.874 # 1
Sentiment Analysis SST-2 Binary classification PAR BERT Base Accuracy 91.6 # 18
Language Modelling Text8 PAR Transformer 24B Bit per Character (BPC) 1.18 # 8
Language Modelling WikiText-103 PAR Transformer Base Test perplexity 22.7 # 19
Language Modelling WikiText-103 PAR Transformer Large Test perplexity 18.4 # 12

Methods used in the Paper