On Position Embeddings in BERT

Various Position Embeddings (PEs) have been proposed in Transformer based architectures~(e.g. BERT) to model word order. These are empirically-driven and perform well, but no formal framework exists to systematically study them. To address this, we present three expected properties of PEs that capture word distance in vector space: translation invariance, monotonicity, and symmetry. These properties formally capture the behaviour of PEs and allow us to reinterpret sinusoidal PEs in a principled way. An empirical evaluation of seven PEs (and their combinations) for classification and span prediction shows that fully-learnable absolute PEs perform better in classification, while relative PEs perform better in span prediction. We contribute the first formal analysis of desired properties for PEs and principled discussion to its connection to typical downstream tasks.

PDF Abstract

Datasets


Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods