Search Results for author: Dara Bahri

Found 33 papers, 11 papers with code

Is margin all you need? An extensive empirical study of active learning on tabular data

no code implementations • 7 Oct 2022 • Dara Bahri, Heinrich Jiang, Tal Schuster, Afshin Rostamizadeh

Given a labeled training set and a collection of unlabeled data, the goal of active learning (AL) is to identify the best unlabeled points to label.

Active Learning Benchmarking +1

Paper
Add Code

Confident Adaptive Language Modeling

no code implementations • 14 Jul 2022 • Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Q. Tran, Yi Tay, Donald Metzler

Recent advances in Transformer-based large language models (LLMs) have led to significant performance improvements across many tasks.

Language Modelling Text Generation

Paper
Add Code

UL2: Unifying Language Learning Paradigms

1 code implementation • 10 May 2022 • Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Siamak Shakeri, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Denny Zhou, Neil Houlsby, Donald Metzler

Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.

Ranked #1 on Long-range modeling on SCROLLS (CNLI metric)

Arithmetic Reasoning Common Sense Reasoning +11

32,783

Paper
Code

ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking Inference

no code implementations • Findings (ACL) 2022 • Kai Hui, Honglei Zhuang, Tao Chen, Zhen Qin, Jing Lu, Dara Bahri, Ji Ma, Jai Prakash Gupta, Cicero Nogueira dos santos, Yi Tay, Don Metzler

This results in significant inference time speedups since the decoder-only architecture only needs to learn to interpret static encoder embeddings during inference.

Information Retrieval Language Modelling +2

Paper
Add Code

Transformer Memory as a Differentiable Search Index

1 code implementation • 14 Feb 2022 • Yi Tay, Vinh Q. Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, Tal Schuster, William W. Cohen, Donald Metzler

In this paper, we demonstrate that information retrieval can be accomplished with a single Transformer, in which all information about the corpus is encoded in the parameters of the model.

Information Retrieval Retrieval

148

Paper
Code

ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning

3 code implementations • ICLR 2022 • Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q. Tran, Dara Bahri, Jianmo Ni, Jai Gupta, Kai Hui, Sebastian Ruder, Donald Metzler

Despite the recent success of multi-task learning and transfer learning for natural language processing (NLP), few works have systematically studied the effect of scaling up the number of tasks during pre-training.

Denoising Multi-Task Learning

5,896

Paper
Code

Sharpness-Aware Minimization Improves Language Model Generalization

no code implementations • ACL 2022 • Dara Bahri, Hossein Mobahi, Yi Tay

The allure of superhuman-level capabilities has led to considerable interest in language models like GPT-3 and T5, wherein the research has, by and large, revolved around new model architectures, training tasks, and loss objectives, along with substantial engineering efforts to scale up model capacity and dataset size.

Language Modelling Natural Questions

Paper
Add Code

An Empirical Study of Pre-trained Models on Out-of-distribution Generalization

no code implementations • 29 Sep 2021 • Yaodong Yu, Heinrich Jiang, Dara Bahri, Hossein Mobahi, Seungyeon Kim, Ankit Singh Rawat, Andreas Veit, Yi Ma

Concretely, we show that larger models and larger datasets need to be simultaneously leveraged to improve OOD performance.

Out-of-Distribution Generalization

Paper
Add Code

Are Pretrained Convolutions Better than Pretrained Transformers?

1 code implementation • ACL 2021 • Yi Tay, Mostafa Dehghani, Jai Prakash Gupta, Vamsi Aribandi, Dara Bahri, Zhen Qin, Donald Metzler

In the context of language models, are convolutional models competitive to Transformers when pre-trained?

32,792

Paper
Code

SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption

no code implementations • ICLR 2022 • Dara Bahri, Heinrich Jiang, Yi Tay, Donald Metzler

Self-supervised contrastive representation learning has proved incredibly successful in the vision and natural language domains, enabling state-of-the-art performance with orders of magnitude less labeled data.

Contrastive Learning Representation Learning +1

Paper
Add Code

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

2 code implementations • ICLR 2022 • Yi Tay, Vinh Q. Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, Donald Metzler

In this paper, we propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model.

Ranked #3 on Paraphrase Identification on Quora Question Pairs

Inductive Bias Linguistic Acceptability +3

32,783

Paper
Code

Churn Reduction via Distillation

no code implementations • ICLR 2022 • Heinrich Jiang, Harikrishna Narasimhan, Dara Bahri, Andrew Cotter, Afshin Rostamizadeh

In real-world systems, models are frequently updated as more data becomes available, and in addition to achieving high accuracy, the goal is to also maintain a low difference in predictions compared to the base model (i. e. predictive "churn").

Paper
Add Code

Are Pre-trained Convolutions Better than Pre-trained Transformers?

1 code implementation • 7 May 2021 • Yi Tay, Mostafa Dehghani, Jai Gupta, Dara Bahri, Vamsi Aribandi, Zhen Qin, Donald Metzler

In the context of language models, are convolutional models competitive to Transformers when pre-trained?

32,783

Paper
Code

Rethinking Search: Making Domain Experts out of Dilettantes

no code implementations • 5 May 2021 • Donald Metzler, Yi Tay, Dara Bahri, Marc Najork

When experiencing an information need, users want to engage with a domain expert, but often turn to an information retrieval system, such as a search engine, instead.

Information Retrieval Question Answering +1

Paper
Add Code

OmniNet: Omnidirectional Representations from Transformers

1 code implementation • 1 Mar 2021 • Yi Tay, Mostafa Dehghani, Vamsi Aribandi, Jai Gupta, Philip Pham, Zhen Qin, Dara Bahri, Da-Cheng Juan, Donald Metzler

In OmniNet, instead of maintaining a strictly horizontal receptive field, each token is allowed to attend to all tokens in the entire network.

Ranked #1 on Machine Translation on WMT2017 Russian-English

Few-Shot Learning Language Modelling +2

Paper
Code

Locally Adaptive Label Smoothing for Predictive Churn

no code implementations • 9 Feb 2021 • Dara Bahri, Heinrich Jiang

Training modern neural networks is an inherently noisy process that can lead to high \emph{prediction churn} -- disagreements between re-trainings of the same model due to factors such as randomization in the parameter initialization and mini-batches -- even when the trained models all attain similar accuracies.

Paper
Add Code

Label Smoothed Embedding Hypothesis for Out-of-Distribution Detection

no code implementations • 9 Feb 2021 • Dara Bahri, Heinrich Jiang, Yi Tay, Donald Metzler

Detecting out-of-distribution (OOD) examples is critical in many applications.

Out-of-Distribution Detection Out of Distribution (OOD) Detection

Paper
Add Code

Long Range Arena : A Benchmark for Efficient Transformers

no code implementations • ICLR 2021 • Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, Donald Metzler

Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity.

16k Benchmarking

Paper
Add Code

HyperGrid Transformers: Towards A Single Model for Multiple Tasks

no code implementations • ICLR 2021 • Yi Tay, Zhe Zhao, Dara Bahri, Donald Metzler, Da-Cheng Juan

Specifically, we propose a decomposable hypernetwork that learns grid-wise projections that help to specialize regions in weight matrices for different tasks.

Multi-Task Learning Natural Language Understanding

Paper
Add Code

Synthesizer: Rethinking Self-Attention for Transformer Models

no code implementations • 1 Jan 2021 • Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng

The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models.

Language Modelling Machine Translation +2

Paper
Add Code

Deep $k$-NN Label Smoothing Improves Reproducibility of Neural Network Predictions

no code implementations • 1 Jan 2021 • Dara Bahri, Heinrich Jiang

Training modern neural networks is an inherently noisy process that can lead to high \emph{prediction churn}-- disagreements between re-trainings of the same model due to factors such as randomization in the parameter initialization and mini-batches-- even when the trained models all attain high accuracies.

Paper
Add Code

StructFormer: Joint Unsupervised Induction of Dependency and Constituency Structure from Masked Language Modeling

2 code implementations • ACL 2021 • Yikang Shen, Yi Tay, Che Zheng, Dara Bahri, Donald Metzler, Aaron Courville

There are two major classes of natural language grammar -- the dependency grammar that models one-to-one correspondences between words and the constituency grammar that models the assembly of one or several corresponded words.

Constituency Parsing Language Modelling +2

32,792

Paper
Code

Long Range Arena: A Benchmark for Efficient Transformers

5 code implementations • 8 Nov 2020 • Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, Donald Metzler

In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable model quality to vanilla Transformer models.

Ranked #18 on Long-range modeling on LRA (Pathfinder metric)

16k Benchmarking +1

681

Paper
Code

Surprise: Result List Truncation via Extreme Value Theory

no code implementations • 19 Oct 2020 • Dara Bahri, Che Zheng, Yi Tay, Donald Metzler, Andrew Tomkins

Work in information retrieval has largely been centered around ranking and relevance: given a query, return some number of results ordered by relevance to the user.

Information Retrieval Retrieval +1

Paper
Add Code

Efficient Transformers: A Survey

no code implementations • 14 Sep 2020 • Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler

Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning.

Navigate reinforcement-learning +1

Paper
Add Code

Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study

no code implementations • 17 Aug 2020 • Dara Bahri, Yi Tay, Che Zheng, Donald Metzler, Cliff Brunk, Andrew Tomkins

Large generative language models such as GPT-2 are well-known for their ability to generate text as well as their utility in supervised downstream tasks via fine-tuning.

Paper
Add Code

HyperGrid: Efficient Multi-Task Transformers with Grid-wise Decomposable Hyper Projections

no code implementations • 12 Jul 2020 • Yi Tay, Zhe Zhao, Dara Bahri, Donald Metzler, Da-Cheng Juan

The proposed approach is based on a decomposable hypernetwork that learns grid-wise projections that help to specialize regions in weight matrices for different tasks.

Multi-Task Learning Natural Language Understanding

Paper
Add Code

Synthesizer: Rethinking Self-Attention in Transformer Models

1 code implementation • 2 May 2020 • Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng

The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models.

Ranked #1 on Dialogue Generation on Persona-Chat (BLEU-1 metric, using extra training data)

Abstractive Text Summarization Dialogue Generation +6

Paper
Code

Deep k-NN for Noisy Labels

no code implementations • ICML 2020 • Dara Bahri, Heinrich Jiang, Maya Gupta

Modern machine learning models are often trained on examples with noisy labels that hurt performance and are hard to identify.

BIG-bench Machine Learning

Paper
Add Code

Choppy: Cut Transformer For Ranked List Truncation

no code implementations • 26 Apr 2020 • Dara Bahri, Yi Tay, Che Zheng, Donald Metzler, Andrew Tomkins

Work in information retrieval has traditionally focused on ranking and relevance: given a query, return some number of results ordered by relevance to the user.

Information Retrieval Retrieval

Paper
Add Code

Reverse Engineering Configurations of Neural Text Generation Models

no code implementations • ACL 2020 • Yi Tay, Dara Bahri, Che Zheng, Clifford Brunk, Donald Metzler, Andrew Tomkins

This paper seeks to develop a deeper understanding of the fundamental properties of neural text generations models.

Text Generation

Paper
Add Code

Sparse Sinkhorn Attention

1 code implementation • ICML 2020 • Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, Da-Cheng Juan

We propose Sparse Sinkhorn Attention, a new efficient and sparse method for learning to attend.

Document Classification Image Generation +2

247

Paper
Code

Diminishing Returns Shape Constraints for Interpretability and Regularization

no code implementations • NeurIPS 2018 • Maya Gupta, Dara Bahri, Andrew Cotter, Kevin Canini

We investigate machine learning models that can provide diminishing returns and accelerating returns guarantees to capture prior knowledge or policies about how outputs should depend on inputs.

BIG-bench Machine Learning

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.