Search Results for author: Mannat Singh

Found 11 papers, 10 papers with code

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

no code implementations • 17 Nov 2023 • Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, Ishan Misra

We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image.

Text-to-Video Generation Video Generation

Paper
Add Code

ImageBind: One Embedding Space To Bind Them All

1 code implementation • CVPR 2023 • Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together.

Ranked #2 on Zero-shot Classification (unified classes) on LLVIP

Cross-Modal Retrieval Retrieval +7

7,882

Paper
Code

The effectiveness of MAE pre-pretraining for billion-scale pretraining

1 code implementation • ICCV 2023 • Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra

While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well.

Ranked #1 on Few-Shot Image Classification on ImageNet - 10-shot (using extra training data)

Action Classification Action Recognition +6

Paper
Code

OmniMAE: Single Model Masked Pretraining on Images and Videos

1 code implementation • CVPR 2023 • Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

Furthermore, this model can be learned by dropping 90% of the image and 95% of the video patches, enabling extremely fast training of huge model architectures.

543

Paper
Code

Omnivore: A Single Model for Many Visual Modalities

2 code implementations • CVPR 2022 • Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, Ishan Misra

Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data.

Ranked #1 on Scene Recognition on SUN-RGBD (using extra training data)

Action Classification Action Recognition +3

2,993

Paper
Code

Revisiting Weakly Supervised Pre-Training of Visual Perception Models

2 code implementations • CVPR 2022 • Mannat Singh, Laura Gustafson, Aaron Adcock, Vinicius de Freitas Reis, Bugra Gedik, Raj Prateek Kosaraju, Dhruv Mahajan, Ross Girshick, Piotr Dollár, Laurens van der Maaten

Model pre-training is a cornerstone of modern visual recognition systems.

Ranked #1 on Out-of-Distribution Generalization on ImageNet-W (using extra training data)

Fine-Grained Image Classification Out-of-Distribution Generalization +3

165

Paper
Code

Early Convolutions Help Transformers See Better

1 code implementation • NeurIPS 2021 • Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollár, Ross Girshick

To test whether this atypical design choice causes an issue, we analyze the optimization behavior of ViT models with their original patchify stem versus a simple counterpart where we replace the ViT stem by a small number of stacked stride-two 3*3 convolutions.

Paper
Code

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

3 code implementations • 26 Apr 2021 • Aishwarya Kamath, Mannat Singh, Yann Lecun, Gabriel Synnaeve, Ishan Misra, Nicolas Carion

We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.

Ranked #1 on Visual Question Answering (VQA) on CLEVR-Humans

Generalized Referring Expression Comprehension Phrase Grounding +9

1,294

Paper
Code

Fast and Accurate Model Scaling

4 code implementations • CVPR 2021 • Piotr Dollár, Mannat Singh, Ross Girshick

This leads us to propose a simple fast compound scaling strategy that encourages primarily scaling model width, while scaling depth and resolution to a lesser extent.

29,774

Paper
Code

Self-supervised Pretraining of Visual Features in the Wild

1 code implementation • 2 Mar 2021 • Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra, Armand Joulin, Piotr Bojanowski

Recently, self-supervised learning methods like MoCo, SimCLR, BYOL and SwAV have reduced the gap with supervised methods.

Ranked #6 on Image Classification on Places205

Self-Supervised Image Classification Self-Supervised Learning +1

3,229

Paper
Code

MDETR - Modulated Detection for End-to-End Multi-Modal Understanding

1 code implementation • ICCV 2021 • Aishwarya Kamath, Mannat Singh, Yann Lecun, Gabriel Synnaeve, Ishan Misra, Nicolas Carion

We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.

Ranked #2 on Referring Expression Comprehension on Talk2Car (using extra training data)

Phrase Grounding Question Answering +3

939

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.