Vision Transformers

EsViT proposes two techniques for developing efficient self-supervised vision transformers for visual representation leaning: a multi-stage architecture with sparse self-attention and a new pre-training task of region matching. The multi-stage architecture reduces modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. The new pretraining task allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations.

Source: Efficient Self-supervised Vision Transformers for Representation Learning

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Self-Supervised Image Classification 1 100.00%

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories