MoBY Explained | Papers With Code

Method Name:*

Method Full Name:*

Description with Markdown (optional):

**MoBY** is a self-supervised learning approach for [Vision Transformers](methods/category/vision-transformer). The approach is basically a combination of [MoCo v2](https://paperswithcode.com/method/moco-v2) and [BYOL](https://paperswithcode.com/method/byol). It inherits the momentum design, the key queue, and the contrastive loss used in MoCo v2, and inherits the asymmetric encoders, asymmetric data augmentations and the momentum scheduler in BYOL. It is named MoBY by picking the first two letters of each method.

The MoBY approach is illustrated in the Figure. There are two encoders: an online encoder and a target encoder. Both two encoders consist of a backbone and a projector head ([2-layer MLP](https://paperswithcode.com/method/feedforward-network)), and the online encoder introduces an additional prediction head (2-layer MLP), which makes the two encoders asymmetric. The online encoder is updated by gradients, and the target encoder is a moving average of the online encoder by momentum updating in each training iteration. A gradually increasing momentum updating strategy is applied for on the target encoder: the value of momentum term is gradually increased to 1 during the course of training. The default starting value is $0.99$.

A contrastive loss is applied to learn the representations. Specifically, for an online view $q$, its contrastive loss is computed as

$$
\mathcal{L}\_{q}=-\log \frac{\exp \left(q \cdot k\_{+} / \tau\right)}{\sum\_{i=0}^{K} \exp \left(q \cdot k\_{i} / \tau\right)}
$$

where $k\_{+}$is the target feature for the other view of the same image; $k\_{i}$ is a target feature in the key queue; $\tau$ is a temperature term; $K$ is the size of the key queue (4096 by default).

In training, like most [Transformer-based methods](https://paperswithcode.com/methods/category/transformers), the [AdamW](https://paperswithcode.com/method/adamw) optimizer is used, in contrast to previous [self-supervised learning approaches](https://paperswithcode.com/methods/category/self-supervised-learning) built on [ResNet](https://paperswithcode.com/method/resnet) backbone where usually [SGD](https://paperswithcode.com/method/sgd-with-momentum) or [LARS](https://paperswithcode.com/method/lars) $[4,8,19]$ is used. The authors also use a regularization method of asymmetric [drop path](https://paperswithcode.com/method/droppath) which proves important for the final performance.

In the experiments, the authors adopt a fixed learning rate of $0.001$ and a fixed weight decay of $0.05$, which performs stably well. Hyper-parameters are tuned of the key queue size $K$, the starting momentum value of the target branch, the temperature $\tau$, and the drop path rates.

Code Snippet URL (optional):

Image

Currently: methods/f6b5504f-3ce7-48d8-b5e2-143a173fc8c7.png Clear
Change:

Attached collections:

SELF-SUPERVISED LEARNING

Add:

New collection name:

Top-level area:

Parent collection (if any):

Description (optional):

Task	Papers	Share
Self-Supervised Learning	2	25.00%
Active Learning	1	12.50%
Classification	1	12.50%
Image Classification	1	12.50%
Object Detection	1	12.50%
Self-Supervised Image Classification	1	12.50%
Semantic Segmentation	1	12.50%

Component	Type	Add Remove
AdamW	Stochastic Optimization
BYOL	Self-Supervised Learning
DropPath	Regularization
MoCo v2	Self-Supervised Learning

MoBY

Papers

Tasks

Usage Over Time

Components

Categories

Add Remove