no code implementations • 28 Mar 2024 • Bo Wan, Michael Tschannen, Yongqin Xian, Filip Pavetic, Ibrahim Alabdulmohsin, Xiao Wang, André Susano Pinto, Andreas Steiner, Lucas Beyer, Xiaohua Zhai
In this paper, we propose a simple visual pretraining method with location-aware captioners (LocCa).
no code implementations • 7 Mar 2024 • Ibrahim Alabdulmohsin, Xiao Wang, Andreas Steiner, Priya Goyal, Alexander D'Amour, Xiaohua Zhai
Interestingly, data and architectural improvements seem to mitigate the negative impact of data balancing on performance; e. g. applying M4 to SigLIP-B/16 with data quality filters improves COCO image-to-text retrieval @5 from 86% (without data balancing) to 87% and ImageNet 0-shot classification from 77% to 77. 5%!
no code implementations • 20 Oct 2023 • Muhammad Ferjad Naeem, Yongqin Xian, Xiaohua Zhai, Lukas Hoyer, Luc van Gool, Federico Tombari
However, the contrastive objective used by these models only focuses on image-text alignment and does not incentivise image feature learning for dense prediction tasks.
1 code implementation • 13 Oct 2023 • Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, Radu Soricut
This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger.
Ranked #2 on Temporal/Casual QA on NExT-QA (using extra training data)
1 code implementation • NeurIPS 2023 • Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, Lucas Beyer
We further analyze the effect of the model architecture and scale, as well as the pretraining data on the representation quality, and find that captioning exhibits the same or better scaling behavior along these axes.
2 code implementations • 29 May 2023 • Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias Minderer, Filip Pavetic, Austin Waters, Gang Li, Ibrahim Alabdulmohsin, Lucas Beyer, Julien Amelot, Kenton Lee, Andreas Peter Steiner, Yang Li, Daniel Keysers, Anurag Arnab, Yuanzhong Xu, Keran Rong, Alexander Kolesnikov, Mojtaba Seyedhosseini, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture.
Ranked #1 on Fine-Grained Image Recognition on OVEN
1 code implementation • NeurIPS 2023 • Jannik Kossen, Mark Collier, Basil Mustafa, Xiao Wang, Xiaohua Zhai, Lucas Beyer, Andreas Steiner, Jesse Berent, Rodolphe Jenatton, Efi Kokiopoulou
With 3T, we propose a more flexible strategy that allows the image tower to benefit from both pretrained embeddings and contrastive training.
1 code implementation • NeurIPS 2023 • Ibrahim Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, Lucas Beyer
Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration.
1 code implementation • 30 Mar 2023 • Lucas Beyer, Bo Wan, Gagan Madan, Filip Pavetic, Andreas Steiner, Alexander Kolesnikov, André Susano Pinto, Emanuele Bugliarello, Xiao Wang, Qihang Yu, Liang-Chieh Chen, Xiaohua Zhai
A key finding is that a small decoder learned on top of a frozen pretrained encoder works surprisingly well.
8 code implementations • ICCV 2023 • Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer
We propose a simple pairwise Sigmoid loss for Language-Image Pre-training (SigLIP).
1 code implementation • 16 Feb 2023 • André Susano Pinto, Alexander Kolesnikov, Yuge Shi, Lucas Beyer, Xiaohua Zhai
Misalignment between model predictions and intended usage can be detrimental for the deployment of computer vision models.
1 code implementation • 10 Feb 2023 • Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodkar, Cristina Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetić, Dustin Tran, Thomas Kipf, Mario Lučić, Xiaohua Zhai, Daniel Keysers, Jeremiah Harmsen, Neil Houlsby
The scaling of Transformers has driven breakthrough capabilities for language models.
Ranked #1 on Zero-Shot Transfer Image Classification on ObjectNet
4 code implementations • CVPR 2023 • Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim Alabdulmohsin, Filip Pavetic
Vision Transformers convert images to sequences by slicing them into patches.
1 code implementation • 14 Sep 2022 • Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut
PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages.
1 code implementation • 13 Sep 2022 • Ibrahim Alabdulmohsin, Behnam Neyshabur, Xiaohua Zhai
The remarkable progress in deep learning in recent years is largely driven by improvements in scale, where bigger models are trained on larger datasets for longer schedules.
1 code implementation • 20 May 2022 • Alexander Kolesnikov, André Susano Pinto, Lucas Beyer, Xiaohua Zhai, Jeremiah Harmsen, Neil Houlsby
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks.
2 code implementations • 12 May 2022 • Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, Neil Houlsby
Combining simple architectures with large-scale pre-training has led to massive improvements in image classification.
Ranked #1 on One-Shot Object Detection on MS COCO
5 code implementations • 3 May 2022 • Lucas Beyer, Xiaohua Zhai, Alexander Kolesnikov
It is commonly accepted that the Vision Transformer model requires sophisticated regularization techniques to excel at ImageNet-1k scale data.
3 code implementations • 17 Dec 2021 • Wuyang Chen, Xianzhi Du, Fan Yang, Lucas Beyer, Xiaohua Zhai, Tsung-Yi Lin, Huizhong Chen, Jing Li, Xiaodan Song, Zhangyang Wang, Denny Zhou
In this paper, we comprehensively study three architecture design choices on ViT -- spatial reduction, doubled channels, and multiscale features -- and demonstrate that a vanilla ViT architecture can fulfill this goal without handcrafting multiscale features, maintaining the original ViT design philosophy.
4 code implementations • CVPR 2022 • Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, Lucas Beyer
This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models while still taking advantage of their pre-training.
16 code implementations • 18 Jun 2021 • Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, Lucas Beyer
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications, such as image classification, object detection and semantic image segmentation.
1 code implementation • NeurIPS 2021 • Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, Mario Lucic
Accurate estimation of predictive uncertainty (model calibration) is essential for the safe application of neural networks.
3 code implementations • CVPR 2022 • Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Markeeva, Rohan Anil, Alexander Kolesnikov
In particular, we uncover that there are certain implicit design choices, which may drastically affect the effectiveness of distillation.
Ranked #453 on Image Classification on ImageNet
1 code implementation • CVPR 2022 • Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, Lucas Beyer
As a result, we successfully train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90. 45% top-1 accuracy.
Ranked #3 on Image Classification on VTAB-1k (using extra training data)
46 code implementations • NeurIPS 2021 • Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy
Convolutional Neural Networks (CNNs) are the go-to model for computer vision.
Ranked #17 on Image Classification on OmniBenchmark
1 code implementation • 9 Apr 2021 • Jessica Yung, Rob Romijnders, Alexander Kolesnikov, Lucas Beyer, Josip Djolonga, Neil Houlsby, Sylvain Gelly, Mario Lucic, Xiaohua Zhai
Before deploying machine learning models it is critical to assess their robustness.
1 code implementation • 6 Apr 2021 • Vincent Dumoulin, Neil Houlsby, Utku Evci, Xiaohua Zhai, Ross Goroshin, Sylvain Gelly, Hugo Larochelle
To bridge this gap, we perform a cross-family study of the best transfer and meta learners on both a large-scale meta-learning benchmark (Meta-Dataset, MD), and a transfer learning benchmark (Visual Task Adaptation Benchmark, VTAB).
no code implementations • 6 Nov 2020 • Alexander D'Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yian Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary Nado, Vivek Natarajan, Christopher Nielson, Thomas F. Osborne, Rajiv Raman, Kim Ramasamy, Rory Sayres, Jessica Schrouff, Martin Seneviratne, Shannon Sequeira, Harini Suresh, Victor Veitch, Max Vladymyrov, Xuezhi Wang, Kellie Webster, Steve Yadlowsky, Taedong Yun, Xiaohua Zhai, D. Sculley
Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains.
144 code implementations • ICLR 2021 • Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited.
Ranked #1 on Image Classification on CIFAR-10 (using extra training data)
no code implementations • 30 Sep 2020 • Maxim Neumann, André Susano Pinto, Xiaohua Zhai, Neil Houlsby
Automatically finding good and general remote sensing representations allows to perform transfer learning on a wide range of applications - improving the accuracy and reducing the required number of training samples.
1 code implementation • CVPR 2021 • Josip Djolonga, Jessica Yung, Michael Tschannen, Rob Romijnders, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Matthias Minderer, Alexander D'Amour, Dan Moldovan, Sylvain Gelly, Neil Houlsby, Xiaohua Zhai, Mario Lucic
Modern deep convolutional networks (CNNs) are often criticized for not generalizing under distributional shifts.
2 code implementations • 12 Jun 2020 • Lucas Beyer, Olivier J. Hénaff, Alexander Kolesnikov, Xiaohua Zhai, Aäron van den Oord
Yes, and no.
8 code implementations • ECCV 2020 • Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, Neil Houlsby
We conduct detailed analysis of the main components that lead to high transfer performance.
Ranked #1 on Out-of-Distribution Generalization on ImageNet-W (using extra training data)
no code implementations • CVPR 2020 • Michael Tschannen, Josip Djolonga, Marvin Ritter, Aravindh Mahendran, Xiaohua Zhai, Neil Houlsby, Sylvain Gelly, Mario Lucic
We propose a general framework for self-supervised learning of transferable visual representations based on Video-Induced Visual Invariances (VIVI).
Ranked #15 on Image Classification on VTAB-1k (using extra training data)
1 code implementation • 15 Nov 2019 • Maxim Neumann, Andre Susano Pinto, Xiaohua Zhai, Neil Houlsby
Given the importance of remote sensing, surprisingly little attention has been paid to it by the representation learning community.
Ranked #1 on Multi-Label Image Classification on BigEarthNet (mAP (macro) metric)
2 code implementations • arXiv 2020 • Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, Lucas Beyer, Olivier Bachem, Michael Tschannen, Marcin Michalski, Olivier Bousquet, Sylvain Gelly, Neil Houlsby
And, how close are we to general visual representations?
Ranked #10 on Image Classification on VTAB-1k (using extra training data)
no code implementations • 25 Sep 2019 • Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, Lucas Beyer, Olivier Bachem, Michael Tschannen, Marcin Michalski, Olivier Bousquet, Sylvain Gelly, Neil Houlsby
Representation learning promises to unlock deep learning for the long tail of vision tasks without expansive labelled datasets.
1 code implementation • ICCV 2019 • Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, Lucas Beyer
This work tackles the problem of semi-supervised learning of image classifiers.
Ranked #15 on Semi-Supervised Image Classification on ImageNet - 10% labeled data (Top 5 Accuracy metric)
no code implementations • ICLR 2019 • Karol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michalski, Sylvain Gelly
Generative adversarial networks (GANs) are a class of deep generative models which aim to learn a target distribution in an unsupervised fashion.
1 code implementation • 6 Mar 2019 • Mario Lucic, Michael Tschannen, Marvin Ritter, Xiaohua Zhai, Olivier Bachem, Sylvain Gelly
Deep generative models are becoming a cornerstone of modern machine learning.
Ranked #10 on Conditional Image Generation on ImageNet 128x128
Conditional Image Generation Vocal Bursts Intensity Prediction
6 code implementations • CVPR 2019 • Alexander Kolesnikov, Xiaohua Zhai, Lucas Beyer
Unsupervised visual representation learning remains a largely unsolved problem in computer vision research.
Ranked #119 on Self-Supervised Image Classification on ImageNet
Representation Learning Self-Supervised Image Classification +1
4 code implementations • CVPR 2019 • Ting Chen, Xiaohua Zhai, Marvin Ritter, Mario Lucic, Neil Houlsby
In this work we exploit two popular unsupervised learning techniques, adversarial training and self-supervision, and take a step towards bridging the gap between conditional and unconditional GANs.
Ranked #6 on Image Generation on CelebA-HQ 128x128
no code implementations • 27 Oct 2018 • Ting Chen, Xiaohua Zhai, Neil Houlsby
To counter forgetting, we encourage the discriminator to maintain useful representations by adding a self-supervision.
5 code implementations • ICLR 2019 • Karol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michalski, Sylvain Gelly
Generative adversarial networks (GANs) are a class of deep generative models which aim to learn a target distribution in an unsupervised fashion.
no code implementations • 29 Mar 2018 • Sylvain Gelly, Karol Kurach, Marcin Michalski, Xiaohua Zhai
We propose a new learning paradigm called Deep Memory.