Semantic Cross Attention (SCA) is based on cross attention, which we restrict with respect to a semantic mask.
The goal of SCA is two-fold depending on what is the query and what is the key. Either it allows to give the feature map information from a semantically restricted set of latents or, respectively, it allows a set of latents to retrieve information in a semantically restricted region of the feature map.
SCA is defined as:
\begin{equation} \text{SCA}(I_{1}, I_{2}, I_{3}) = \sigma\left(\frac{QK^T\odot I_{3} +\tau \left(1-I_{3}\right)}{\sqrt{d_{in}}}\right)V \quad , \end{equation}
where $I_{1},I_{2},I_{3}$ the inputs, with $I_{1}$ attending $I_{2}$, and $I_{3}$ the mask that forces tokens from $I_1$ to attend only specific tokens from $I_2$. The attention values requiring masking are filled with $-\infty$ before the softmax. (In practice $\tau{=}-10^9$), $Q {=} W_QI_{1}$, $K {=} W_KI_{2}$ and $V {=} W_VI_{2}$ the queries, keys and values, and $d_{in}$ the internal attention dimension. $\sigma(.)$ is the softmax operation.
Let $X\in\mathbb{R}^{n\times C}$ be the feature map with n the number of pixels, and C the number of channels. Let $Z\in\mathbb{R}^{m\times d}$ be a set of $m$ latents of dimension $d$ and $s$ the number of semantic labels. Each semantic label is attributed $k$ latents, such that $m=k\times s$. Each semantic label mask is assigned $k$ copies in $S{\in}{0;1}^{n \times m}$.
We can differentiate 3 types of SCA:
(a) SCA with pixels $X$ attending latents $Z$: $\text{SCA}(X, Z, S)$, where $W_{Q} {\in} \mathbb{R}^{n\times d_{in}}$ and $W_{K}, W_{V} {\in} \mathbb{R}^{m\times d_{in}}$. The idea is to force the pixels from a semantic region to attend latents that are associated with the same label.
(b) SCA with latents $Z$ attending pixels $X$: $\text{SCA}(Z, X, S)$, where $W_{Q}{\in} \mathbb{R}^{m\times d_{in}}$, $W_{K}, W_{V} {\in} \mathbb{R}^{n\times d_{in}}$. The idea is to semantically mask attention values to enforce latents to attend semantically corresponding pixels.
(c) SCA with latents $Z$ attending themselves: $\text{SCA}(Z, Z, M)$, where $W_{Q}, W_{K}, W_{V} {\in} \mathbb{R}^{n\times d_{in}}$. We denote $M\in\mathbb{N}^{m\times m}$ this mask, with $M_{\text{latents}}(i,j) {=} 1$ if the semantic label of latent $i$ is the same as the one of latent $j$; $0$ otherwise. The idea is to let the latents only attend latents that share the same semantic label.
Source: SCAM! Transferring humans between images with Semantic Cross Attention ModulationPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Image Generation | 2 | 14.29% |
Imputation | 1 | 7.14% |
Model extraction | 1 | 7.14% |
Sentence | 1 | 7.14% |
Sentence Embedding | 1 | 7.14% |
Conditional Image Generation | 1 | 7.14% |
Texture Synthesis | 1 | 7.14% |
Adversarial Robustness | 1 | 7.14% |
document understanding | 1 | 7.14% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |