Crossformer: Transformer with Alternated Cross-Layer Guidance

29 Sep 2021 · Shujian Zhang, Zhibin Duan, Huangjie Zheng, Pengcheng He, Bo Chen, Weizhu Chen, Mingyuan Zhou ·

Transformers with stacked attention layers have achieved state-of-the-art results on a wide range of tasks related to discrete sequences. Significant work has been done to better understand or interpret the capabilities of Transformer, which is often massively over-parameterized and prone to overfitting. There exist intensive interactions between Transformer layers, where the information from higher layers can and do distill the information from lower layers. This motivates us to inject a cross-layer inductive bias that not only uses higher layers, which are closer to the training objective, to guide lower ones, but also provides regularization customized to the stacked structure of Transformer. To this end, we propose Crossformer that either regularizes the differences between specific states of two adjacent layers or directly imposes alternated states sharing between all adjacent layers. Crossformer with states sharing not only provides the desired cross-layer guidance and regularization but also reduces the memory requirement. It is simple to convert a Transformer-based model to a Crossformer-based one. On a variety of neural machine translation tasks, we show that our method outperforms Transformer models while being more memory-efficient. We further demonstrate the general applicability and stability of Crossformer on visual question answering, graph node classification, and significantly deeper models, showing the great potential of incorporating our method into various Transformer-related tasks.

PDF Abstract