On Position Embeddings in BERT

ICLR 2021 · Benyou Wang, Lifeng Shang, Christina Lioma, Xin Jiang, Hao Yang, Qun Liu, Jakob Grue Simonsen ·

Various Position Embeddings (PEs) have been proposed in Transformer based architectures~(e.g. BERT) to model word order. These are empirically-driven and perform well, but no formal framework exists to systematically study them. To address this, we present three expected properties of PEs that capture word distance in vector space: translation invariance, monotonicity, and symmetry. These properties formally capture the behaviour of PEs and allow us to reinterpret sinusoidal PEs in a principled way. An empirical evaluation of seven PEs (and their combinations) for classification and span prediction shows that fully-learnable absolute PEs perform better in classification, while relative PEs perform better in span prediction. We contribute the first formal analysis of desired properties for PEs and principled discussion to its connection to typical downstream tasks.

PDF Abstract