VirText, or Visual representations from Textual annotations is a pretraining approach using semantically dense captions to learn visual representations. First a ConvNet and Transformer are jointly trained from scratch to generate natural language captions for images. Then, the learned features are transferred to downstream visual recognition tasks.
Source: VirTex: Learning Visual Representations from Textual AnnotationsPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Image Reconstruction | 1 | 12.50% |
Quantization | 1 | 12.50% |
General Classification | 1 | 12.50% |
Image Captioning | 1 | 12.50% |
Image Classification | 1 | 12.50% |
Instance Segmentation | 1 | 12.50% |
Object Detection | 1 | 12.50% |
Semantic Segmentation | 1 | 12.50% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |