Self-Supervised Learning

MoBY

Introduced by Xie et al. in Self-Supervised Learning with Swin Transformers

MoBY is a self-supervised learning approach for Vision Transformers. The approach is basically a combination of MoCo v2 and BYOL. It inherits the momentum design, the key queue, and the contrastive loss used in MoCo v2, and inherits the asymmetric encoders, asymmetric data augmentations and the momentum scheduler in BYOL. It is named MoBY by picking the first two letters of each method.

The MoBY approach is illustrated in the Figure. There are two encoders: an online encoder and a target encoder. Both two encoders consist of a backbone and a projector head (2-layer MLP), and the online encoder introduces an additional prediction head (2-layer MLP), which makes the two encoders asymmetric. The online encoder is updated by gradients, and the target encoder is a moving average of the online encoder by momentum updating in each training iteration. A gradually increasing momentum updating strategy is applied for on the target encoder: the value of momentum term is gradually increased to 1 during the course of training. The default starting value is $0.99$.

A contrastive loss is applied to learn the representations. Specifically, for an online view $q$, its contrastive loss is computed as

$$ \mathcal{L}_{q}=-\log \frac{\exp \left(q \cdot k_{+} / \tau\right)}{\sum_{i=0}^{K} \exp \left(q \cdot k_{i} / \tau\right)} $$

where $k_{+}$is the target feature for the other view of the same image; $k_{i}$ is a target feature in the key queue; $\tau$ is a temperature term; $K$ is the size of the key queue (4096 by default).

In training, like most Transformer-based methods, the AdamW optimizer is used, in contrast to previous self-supervised learning approaches built on ResNet backbone where usually SGD or LARS $[4,8,19]$ is used. The authors also use a regularization method of asymmetric drop path which proves important for the final performance.

In the experiments, the authors adopt a fixed learning rate of $0.001$ and a fixed weight decay of $0.05$, which performs stably well. Hyper-parameters are tuned of the key queue size $K$, the starting momentum value of the target branch, the temperature $\tau$, and the drop path rates.

Source: Self-Supervised Learning with Swin Transformers

Papers


Paper Code Results Date Stars

Categories