TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Language Modelling	enwik8	Skip Cross-Head Transformer-XL	Bit per Character (BPC)	1.033	# 24
Language Modelling	enwik8	Skip Cross-Head Transformer-XL	Number of params	41M	# 27
Paraphrase Identification	Quora Question Pairs Dev	BERT + SCH attn	Val F1 Score	88.436	# 1
Paraphrase Identification	Quora Question Pairs Dev	BERT + SCH attm	Val Accuracy	91.422	# 1
Language Modelling	WikiText-103	Skip Cross-Head Transformer-XL	Validation perplexity	21.87	# 22
Language Modelling	WikiText-103	Skip Cross-Head Transformer-XL	Test perplexity	22.91	# 49
Language Modelling	WikiText-103	Skip Cross-Head Transformer-XL	Number of params	122M	# 40

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/memory-efficient-stochastic-methods-for/paraphrase-identification-on-quora-question-1)](https://paperswithcode.com/sota/paraphrase-identification-on-quora-question-1?p=memory-efficient-stochastic-methods-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/memory-efficient-stochastic-methods-for/language-modelling-on-enwiki8)](https://paperswithcode.com/sota/language-modelling-on-enwiki8?p=memory-efficient-stochastic-methods-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/memory-efficient-stochastic-methods-for/language-modelling-on-wikitext-103)](https://paperswithcode.com/sota/language-modelling-on-wikitext-103?p=memory-efficient-stochastic-methods-for)`

Memory-efficient Stochastic methods for Memory-based Transformers

14 Nov 2023 · Vishwajit Kumar Vishnu, C. Chandra Sekhar ·

Training Memory-based transformers can require a large amount of memory and can be quite inefficient. We propose a novel two-phase training mechanism and a novel regularization technique to improve the training efficiency of memory-based transformers, which are often used for long-range context problems. For our experiments, we consider transformer-XL as our baseline model which is one of memorybased transformer models. We show that our resultant model, Skip Cross-head TransformerXL, outperforms the baseline on character level language modeling task with similar parameters and outperforms the baseline on word level language modelling task with almost 20% fewer parameters. Our proposed methods do not require any additional memory. We also demonstrate the effectiveness of our regularization mechanism on BERT which shows similar performance with reduction in standard deviation of scores of around 30% on multiple GLUE tasks.

PDF Abstract

Code

Add Remove Mark official

vishwajit-vishnu/memory-efficient-s… official

Tasks

Add Remove

Language Modelling

Paraphrase Identification

Datasets

GLUE

QNLI

WikiText-2

WikiText-103

Quora Question Pairs

Results from the Paper

Edit

Ranked #1 on Paraphrase Identification on Quora Question Pairs Dev (Val F1 Score metric)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Language Modelling	enwik8	Skip Cross-Head Transformer-XL	Bit per Character (BPC)	1.033	# 24	Compare
Language Modelling	enwik8	Skip Cross-Head Transformer-XL	Number of params	41M	# 27	Compare
Paraphrase Identification	Quora Question Pairs Dev	BERT + SCH attn	Val F1 Score	88.436	# 1	Compare
Paraphrase Identification	Quora Question Pairs Dev	BERT + SCH attm	Val Accuracy	91.422	# 1	Compare
Language Modelling	WikiText-103	Skip Cross-Head Transformer-XL	Validation perplexity	21.87	# 22	Compare
			Test perplexity	22.91	# 49	Compare
			Number of params	122M	# 40	Compare

Methods

Add Remove

Adam • Adaptive Input Representations • Adaptive Softmax • Attention Dropout • BERT • Cosine Annealing • Dense Connections • Dropout • GELU • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Linear Warmup With Linear Decay • Multi-Head Attention • ReLU • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer-XL • Variational Dropout • Weight Decay • WordPiece

Edit Social Preview

Memory-efficient Stochastic methods for Memory-based Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove