Transformers

Compressive Transformer

Introduced by Rae et al. in Compressive Transformers for Long-Range Sequence Modelling

The Compressive Transformer is an extension to the Transformer which maps past hidden activations (memories) to a smaller set of compressed representations (compressed memories). The Compressive Transformer uses the same attention mechanism over its set of memories and compressed memories, learning to query both its short-term granular memory and longer-term coarse memory. It builds on the ideas of Transformer-XL which maintains a memory of past activations at each layer to preserve a longer history of context. The Transformer-XL discards past activations when they become sufficiently old (controlled by the size of the memory). The key principle of the Compressive Transformer is to compress these old memories, instead of discarding them, and store them in an additional compressed memory.

At each time step $t$, we discard the oldest compressed memories (FIFO) and then the oldest $n$ states from ordinary memory are compressed and shifted to the new slot in compressed memory. During training, the compressive memory component is optimized separately from the main language model (separate training loop).

Source: Compressive Transformers for Long-Range Sequence Modelling

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Data Visualization 1 20.00%
Dimensionality Reduction 1 20.00%
Embeddings Evaluation 1 20.00%
Sentence 1 20.00%
Language Modelling 1 20.00%

Categories