Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View.

The Transformer architecture is widely used in natural language processing. Despite its success, the design principle of the Transformer remains elusive. In this paper, we provide a novel perspective towards understanding the architecture: we show that the Transformer can be mathematically interpreted as a \emph{numerical Ordinary Differential Equation (ODE) solver for a convection-diffusion equation in a multi-particle dynamic system}. In particular, how words in a sentence are abstracted into contexts by passing through the layers of the Transformer can be interpreted as approximating multiple particles' movement in the space using the Lie-Trotter splitting scheme and the Euler's method. Inspired from such a relationship, we propose to replace the Lie-Trotter splitting scheme by the more accurate Strang-Marchuk splitting scheme and design a new network architecture called Macaron Net. Through extensive experiments, we show that the Macaron Net is superior to the Transformer on both supervised and unsupervised learning tasks.

PDF Abstract
No code implementations yet. Submit your code now

Datasets


Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here