no code implementations • 7 Oct 2022 • Dara Bahri, Heinrich Jiang, Tal Schuster, Afshin Rostamizadeh
Given a labeled training set and a collection of unlabeled data, the goal of active learning (AL) is to identify the best unlabeled points to label.
no code implementations • 14 Jul 2022 • Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Q. Tran, Yi Tay, Donald Metzler
Recent advances in Transformer-based large language models (LLMs) have led to significant performance improvements across many tasks.
1 code implementation • 10 May 2022 • Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Siamak Shakeri, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Denny Zhou, Neil Houlsby, Donald Metzler
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
Ranked #1 on Long-range modeling on SCROLLS (CNLI metric)
no code implementations • Findings (ACL) 2022 • Kai Hui, Honglei Zhuang, Tao Chen, Zhen Qin, Jing Lu, Dara Bahri, Ji Ma, Jai Prakash Gupta, Cicero Nogueira dos santos, Yi Tay, Don Metzler
This results in significant inference time speedups since the decoder-only architecture only needs to learn to interpret static encoder embeddings during inference.
1 code implementation • 14 Feb 2022 • Yi Tay, Vinh Q. Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, Tal Schuster, William W. Cohen, Donald Metzler
In this paper, we demonstrate that information retrieval can be accomplished with a single Transformer, in which all information about the corpus is encoded in the parameters of the model.
3 code implementations • ICLR 2022 • Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q. Tran, Dara Bahri, Jianmo Ni, Jai Gupta, Kai Hui, Sebastian Ruder, Donald Metzler
Despite the recent success of multi-task learning and transfer learning for natural language processing (NLP), few works have systematically studied the effect of scaling up the number of tasks during pre-training.
no code implementations • ACL 2022 • Dara Bahri, Hossein Mobahi, Yi Tay
The allure of superhuman-level capabilities has led to considerable interest in language models like GPT-3 and T5, wherein the research has, by and large, revolved around new model architectures, training tasks, and loss objectives, along with substantial engineering efforts to scale up model capacity and dataset size.
no code implementations • 29 Sep 2021 • Yaodong Yu, Heinrich Jiang, Dara Bahri, Hossein Mobahi, Seungyeon Kim, Ankit Singh Rawat, Andreas Veit, Yi Ma
Concretely, we show that larger models and larger datasets need to be simultaneously leveraged to improve OOD performance.
1 code implementation • ACL 2021 • Yi Tay, Mostafa Dehghani, Jai Prakash Gupta, Vamsi Aribandi, Dara Bahri, Zhen Qin, Donald Metzler
In the context of language models, are convolutional models competitive to Transformers when pre-trained?
no code implementations • ICLR 2022 • Dara Bahri, Heinrich Jiang, Yi Tay, Donald Metzler
Self-supervised contrastive representation learning has proved incredibly successful in the vision and natural language domains, enabling state-of-the-art performance with orders of magnitude less labeled data.
2 code implementations • ICLR 2022 • Yi Tay, Vinh Q. Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, Donald Metzler
In this paper, we propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model.
Ranked #3 on Paraphrase Identification on Quora Question Pairs
no code implementations • ICLR 2022 • Heinrich Jiang, Harikrishna Narasimhan, Dara Bahri, Andrew Cotter, Afshin Rostamizadeh
In real-world systems, models are frequently updated as more data becomes available, and in addition to achieving high accuracy, the goal is to also maintain a low difference in predictions compared to the base model (i. e. predictive "churn").
1 code implementation • 7 May 2021 • Yi Tay, Mostafa Dehghani, Jai Gupta, Dara Bahri, Vamsi Aribandi, Zhen Qin, Donald Metzler
In the context of language models, are convolutional models competitive to Transformers when pre-trained?
no code implementations • 5 May 2021 • Donald Metzler, Yi Tay, Dara Bahri, Marc Najork
When experiencing an information need, users want to engage with a domain expert, but often turn to an information retrieval system, such as a search engine, instead.
1 code implementation • 1 Mar 2021 • Yi Tay, Mostafa Dehghani, Vamsi Aribandi, Jai Gupta, Philip Pham, Zhen Qin, Dara Bahri, Da-Cheng Juan, Donald Metzler
In OmniNet, instead of maintaining a strictly horizontal receptive field, each token is allowed to attend to all tokens in the entire network.
Ranked #1 on Machine Translation on WMT2017 Russian-English
no code implementations • 9 Feb 2021 • Dara Bahri, Heinrich Jiang
Training modern neural networks is an inherently noisy process that can lead to high \emph{prediction churn} -- disagreements between re-trainings of the same model due to factors such as randomization in the parameter initialization and mini-batches -- even when the trained models all attain similar accuracies.
no code implementations • 9 Feb 2021 • Dara Bahri, Heinrich Jiang, Yi Tay, Donald Metzler
Detecting out-of-distribution (OOD) examples is critical in many applications.
Out-of-Distribution Detection Out of Distribution (OOD) Detection
no code implementations • ICLR 2021 • Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, Donald Metzler
Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity.
no code implementations • ICLR 2021 • Yi Tay, Zhe Zhao, Dara Bahri, Donald Metzler, Da-Cheng Juan
Specifically, we propose a decomposable hypernetwork that learns grid-wise projections that help to specialize regions in weight matrices for different tasks.
no code implementations • 1 Jan 2021 • Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng
The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models.
no code implementations • 1 Jan 2021 • Dara Bahri, Heinrich Jiang
Training modern neural networks is an inherently noisy process that can lead to high \emph{prediction churn}-- disagreements between re-trainings of the same model due to factors such as randomization in the parameter initialization and mini-batches-- even when the trained models all attain high accuracies.
2 code implementations • ACL 2021 • Yikang Shen, Yi Tay, Che Zheng, Dara Bahri, Donald Metzler, Aaron Courville
There are two major classes of natural language grammar -- the dependency grammar that models one-to-one correspondences between words and the constituency grammar that models the assembly of one or several corresponded words.
5 code implementations • 8 Nov 2020 • Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, Donald Metzler
In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable model quality to vanilla Transformer models.
Ranked #18 on Long-range modeling on LRA (Pathfinder metric)
no code implementations • 19 Oct 2020 • Dara Bahri, Che Zheng, Yi Tay, Donald Metzler, Andrew Tomkins
Work in information retrieval has largely been centered around ranking and relevance: given a query, return some number of results ordered by relevance to the user.
no code implementations • 14 Sep 2020 • Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler
Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning.
no code implementations • 17 Aug 2020 • Dara Bahri, Yi Tay, Che Zheng, Donald Metzler, Cliff Brunk, Andrew Tomkins
Large generative language models such as GPT-2 are well-known for their ability to generate text as well as their utility in supervised downstream tasks via fine-tuning.
no code implementations • 12 Jul 2020 • Yi Tay, Zhe Zhao, Dara Bahri, Donald Metzler, Da-Cheng Juan
The proposed approach is based on a decomposable hypernetwork that learns grid-wise projections that help to specialize regions in weight matrices for different tasks.
1 code implementation • 2 May 2020 • Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng
The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models.
Ranked #1 on Dialogue Generation on Persona-Chat (BLEU-1 metric, using extra training data)
no code implementations • ICML 2020 • Dara Bahri, Heinrich Jiang, Maya Gupta
Modern machine learning models are often trained on examples with noisy labels that hurt performance and are hard to identify.
no code implementations • 26 Apr 2020 • Dara Bahri, Yi Tay, Che Zheng, Donald Metzler, Andrew Tomkins
Work in information retrieval has traditionally focused on ranking and relevance: given a query, return some number of results ordered by relevance to the user.
no code implementations • ACL 2020 • Yi Tay, Dara Bahri, Che Zheng, Clifford Brunk, Donald Metzler, Andrew Tomkins
This paper seeks to develop a deeper understanding of the fundamental properties of neural text generations models.
1 code implementation • ICML 2020 • Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, Da-Cheng Juan
We propose Sparse Sinkhorn Attention, a new efficient and sparse method for learning to attend.
no code implementations • NeurIPS 2018 • Maya Gupta, Dara Bahri, Andrew Cotter, Kevin Canini
We investigate machine learning models that can provide diminishing returns and accelerating returns guarantees to capture prior knowledge or policies about how outputs should depend on inputs.