Multiscale Attention ViT with Late fusion (MAVL) is a multi-modal network, trained with aligned image-text pairs, capable of performing targeted detection using human understandable natural language text queries. It utilizes multi-scale image features and uses deformable convolutions with late multi-modal fusion. The authors demonstrate excellent ability of MAVL as class-agnostic object detector when queried using general human understandable natural language command, such as "all objects", "all entities", etc.
Source: Class-agnostic Object Detection with Multi-modal TransformerPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Open Vocabulary Attribute Detection | 1 | 14.29% |
Open Vocabulary Object Detection | 1 | 14.29% |
Zero-Shot Object Detection | 1 | 14.29% |
Class-agnostic Object Detection | 1 | 14.29% |
Object Detection | 1 | 14.29% |
Object Proposal Generation | 1 | 14.29% |
Open World Object Detection | 1 | 14.29% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |