Multi-Query Attention

Introduced by Shazeer in Fast Transformer Decoding: One Write-Head is All You Need

Multi-head attention consists of multiple attention layers (heads) in parallel with different linear transformations on the queries, keys, values and outputs. Multi-query attention is identical except that the different heads share a single set of keys and values.

Source: Fast Transformer Decoding: One Write-Head is All You Need

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	6	18.18%
Answer Generation	1	3.03%
Document Classification	1	3.03%
Image Classification	1	3.03%
Auto Debugging	1	3.03%
Code Generation	1	3.03%
Common Sense Reasoning	1	3.03%
Coreference Resolution	1	3.03%
Cross-Lingual Question Answering	1	3.03%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Attention