Non-iterative Parallel Text Generation via Glancing Transformer
Although non-autoregressive models with one-iteration generation achieves remarkable inference speed-up, they still falls behind their autoregressive counterparts inprediction accuracy. The non-autoregressive models with best accuracy currently rely on multiple decoding iterations, which largely sacrifice the inference speed of non-autoregressive models. Inspired by the way of learning word dependencies in autoregressive and iterative-decoding models, we propose Glancing Transformer (GLAT) with glancing language model, which learns to capture the word dependency in a gradual fashion. Experiments on three benchmarks demonstrate that our approach can significantly improve the accuracy of non-autoregressive models without multiple decoding iterations. In particular, GLAT achieves state-of-the-art results among non-iterative models and even outperforms top iterative counterparts in some specific benchmarks.
PDF Abstract