Dense Prediction Transformers (DPT) are a type of vision transformer for dense prediction tasks.
The input image is transformed into tokens (orange) either by extracting non-overlapping patches followed by a linear projection of their flattened representation (DPT-Base and DPT-Large) or by applying a ResNet-50 feature extractor (DPT-Hybrid). The image embedding is augmented with a positional embedding and a patch-independent readout token (red) is added. The tokens are passed through multiple transformer stages. The tokens are reassembled from different stages into an image-like representation at multiple resolutions (green). Fusion modules (purple) progressively fuse and upsample the representations to generate a fine-grained prediction.
Source: Vision Transformers for Dense PredictionPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Depth Estimation | 4 | 8.51% |
Monocular Depth Estimation | 4 | 8.51% |
Language Modelling | 4 | 8.51% |
Classification | 2 | 4.26% |
Image Classification | 2 | 4.26% |
Decision Making | 2 | 4.26% |
Question Answering | 2 | 4.26% |
Semantic Segmentation | 2 | 4.26% |
Object Recognition | 1 | 2.13% |