Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View.

ICLR Workshop DeepDiffEq 2019 · Yiping Lu*, Zhuohan Li*, Di He, Zhiqing Sun, Bin Dong, Tao Qin, LiWei Wang, Tie-Yan Liu ·

The Transformer architecture is widely used in natural language processing. Despite its success, the design principle of the Transformer remains elusive. In this paper, we provide a novel perspective towards understanding the architecture: we show that the Transformer can be mathematically interpreted as a \emph{numerical Ordinary Differential Equation (ODE) solver for a convection-diffusion equation in a multi-particle dynamic system}. In particular, how words in a sentence are abstracted into contexts by passing through the layers of the Transformer can be interpreted as approximating multiple particles' movement in the space using the Lie-Trotter splitting scheme and the Euler's method. Inspired from such a relationship, we propose to replace the Lie-Trotter splitting scheme by the more accurate Strang-Marchuk splitting scheme and design a new network architecture called Macaron Net. Through extensive experiments, we show that the Macaron Net is superior to the Transformer on both supervised and unsupervised learning tasks.

PDF Abstract