no code implementations • 17 Nov 2023 • Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, Ishan Misra
We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image.
1 code implementation • CVPR 2023 • Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra
We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together.
Ranked #2 on Zero-shot Classification (unified classes) on LLVIP
1 code implementation • ICCV 2023 • Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra
While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well.
Ranked #1 on Few-Shot Image Classification on ImageNet - 10-shot (using extra training data)
1 code implementation • CVPR 2023 • Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra
Furthermore, this model can be learned by dropping 90% of the image and 95% of the video patches, enabling extremely fast training of huge model architectures.
2 code implementations • CVPR 2022 • Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, Ishan Misra
Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data.
Ranked #1 on Scene Recognition on SUN-RGBD (using extra training data)
2 code implementations • CVPR 2022 • Mannat Singh, Laura Gustafson, Aaron Adcock, Vinicius de Freitas Reis, Bugra Gedik, Raj Prateek Kosaraju, Dhruv Mahajan, Ross Girshick, Piotr Dollár, Laurens van der Maaten
Model pre-training is a cornerstone of modern visual recognition systems.
Ranked #1 on Out-of-Distribution Generalization on ImageNet-W (using extra training data)
Fine-Grained Image Classification Out-of-Distribution Generalization +3
1 code implementation • NeurIPS 2021 • Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollár, Ross Girshick
To test whether this atypical design choice causes an issue, we analyze the optimization behavior of ViT models with their original patchify stem versus a simple counterpart where we replace the ViT stem by a small number of stacked stride-two 3*3 convolutions.
3 code implementations • 26 Apr 2021 • Aishwarya Kamath, Mannat Singh, Yann Lecun, Gabriel Synnaeve, Ishan Misra, Nicolas Carion
We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.
Ranked #1 on Visual Question Answering (VQA) on CLEVR-Humans
Generalized Referring Expression Comprehension Phrase Grounding +9
4 code implementations • CVPR 2021 • Piotr Dollár, Mannat Singh, Ross Girshick
This leads us to propose a simple fast compound scaling strategy that encourages primarily scaling model width, while scaling depth and resolution to a lesser extent.
1 code implementation • 2 Mar 2021 • Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra, Armand Joulin, Piotr Bojanowski
Recently, self-supervised learning methods like MoCo, SimCLR, BYOL and SwAV have reduced the gap with supervised methods.
Ranked #6 on Image Classification on Places205
Self-Supervised Image Classification Self-Supervised Learning +1
1 code implementation • ICCV 2021 • Aishwarya Kamath, Mannat Singh, Yann Lecun, Gabriel Synnaeve, Ishan Misra, Nicolas Carion
We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.
Ranked #2 on Referring Expression Comprehension on Talk2Car (using extra training data)