Not all layers are equally as important: Every Layer Counts BERT

3 Nov 2023  ·  Lucas Georges Gabriel Charpentier, David Samuel ·

This paper introduces a novel modification of the transformer architecture, tailored for the data-efficient pretraining of language models. This aspect is evaluated by participating in the BabyLM challenge, where our solution won both the strict and strict-small tracks. Our approach allows each transformer layer to select which outputs of previous layers to process. The empirical results verify the potential of this simple modification and show that not all layers are equally as important.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Linguistic Acceptability CoLA ELC-BERT-base 98M Accuracy 82.6 # 7
Linguistic Acceptability CoLA ELC-BERT-small 24M Accuracy 76.1 # 11
Linguistic Acceptability CoLA LTG-BERT-small 24M Accuracy 77.6 # 10
Linguistic Acceptability CoLA LTG-BERT-base 98M Accuracy 82.7 # 6
Natural Language Inference MultiNLI ELC-BERT-small 24M Matched 79.2 # 41
Mismatched 79.9 # 31
Natural Language Inference MultiNLI LTG-BERT-small 24M Matched 78 # 42
Mismatched 78.8 # 32
Natural Language Inference MultiNLI LTG-BERT-base 98M Matched 83 # 34
Mismatched 83.4 # 22
Natural Language Inference MultiNLI ELC-BERT-base 98M (zero init) Matched 84.4 # 30
Mismatched 84.5 # 19
Natural Language Inference RTE ELC-BERT-small 24M Accuracy 55.4 # 82
Natural Language Inference RTE ELC-BERT-base 98M (zero init) Accuracy 63 # 67
Natural Language Inference RTE LTG-BERT-base 98M Accuracy 54.7 # 84
Natural Language Inference RTE LTG-BERT-small 24M Accuracy 53.7 # 87

Methods


No methods listed for this paper. Add relevant methods here