A NonLocal Operation is a component for capturing longrange dependencies with deep neural networks. It is a generalization of the classical nonlocal mean operation in computer vision. Intuitively a nonlocal operation computes the response at a position as a weighted sum of the features at all positions in the input feature maps. The set of positions can be in space, time, or spacetime, implying that these operations are applicable for image, sequence, and video problems.
Following the nonlocal mean operation, a generic nonlocal operation for deep neural networks is defined as:
$$ \mathbb{y}_{i} = \frac{1}{\mathcal{C}\left(\mathbb{x}\right)}\sum_{\forall{j}}f\left(\mathbb{x}_{i}, \mathbb{x}_{j}\right)g\left(\mathbb{x}_{j}\right) $$
Here $i$ is the index of an output position (in space, time, or spacetime) whose response is to be computed and $j$ is the index that enumerates all possible positions. x is the input signal (image, sequence, video; often their features) and $y$ is the output signal of the same size as $x$. A pairwise function $f$ computes a scalar (representing relationship such as affinity) between $i$ and all $j$. The unary function $g$ computes a representation of the input signal at the position $j$. The response is normalized by a factor $C\left(x\right)$.
The nonlocal behavior is due to the fact that all positions ($\forall{j}$) are considered in the operation. As a comparison, a convolutional operation sums up the weighted input in a local neighborhood (e.g., $i − 1 \leq j \leq i + 1$ in a 1D case with kernel size 3), and a recurrent operation at time $i$ is often based only on the current and the latest time steps (e.g., $j = i$ or $i − 1$).
The nonlocal operation is also different from a fullyconnected (fc) layer. The equation above computes responses based on relationships between different locations, whereas fc uses learned weights. In other words, the relationship between $x_{j}$ and $x_{i}$ is not a function of the input data in fc, unlike in nonlocal layers. Furthermore, the formulation in the equation above supports inputs of variable sizes, and maintains the corresponding size in the output. On the contrary, an fc layer requires a fixedsize input/output and loses positional correspondence (e.g., that from $x_{i}$ to $y_{i}$ at the position $i$).
A nonlocal operation is a flexible building block and can be easily used together with convolutional/recurrent layers. It can be added into the earlier part of deep neural networks, unlike fc layers that are often used in the end. This allows us to build a richer hierarchy that combines both nonlocal and local information.
In terms of parameterisation, we usually parameterise $g$ as a linear embedding of the form $g\left(x_{j}\right) = W_{g}\mathbb{x}_{j}$ , where $W_{g}$ is a weight matrix to be learned. This is implemented as, e.g., 1×1 convolution in space or 1×1×1 convolution in spacetime. For $f$ we use an affinity function, a list of which can be found here.
Source:TASK  PAPERS  SHARE 

Image Generation  15  14.71% 
Semantic Segmentation  8  7.84% 
Conditional Image Generation  7  6.86% 
Object Detection  6  5.88% 
Action Recognition  5  4.90% 
Instance Segmentation  5  4.90% 
SuperResolution  3  2.94% 
Image Inpainting  2  1.96% 
Video Classification  2  1.96% 
COMPONENT  TYPE 


1x1 Convolution

Convolutions 