Search Results for author: Samuel R. Bowman

Found 100 papers, 52 papers with code

Crowdsourcing Beyond Annotation: Case Studies in Benchmark Data Collection

no code implementations EMNLP (ACL) 2021 Alane Suhr, Clara Vania, Nikita Nangia, Maarten Sap, Mark Yatskar, Samuel R. Bowman, Yoav Artzi

Even though it is such a fundamental tool in NLP, crowdsourcing use is largely guided by common practices and the personal experience of researchers.

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

1 code implementation20 Nov 2023 David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, Samuel R. Bowman

We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry.

Multiple-choice

Debate Helps Supervise Unreliable Experts

1 code implementation15 Nov 2023 Julian Michael, Salsabila Mahdi, David Rein, Jackson Petty, Julien Dirani, Vishakh Padmakumar, Samuel R. Bowman

Comparing debate to a baseline we call consultancy, where a single expert argues for only one answer which is correct half of the time, we find that debate performs significantly better, with 84% judge accuracy compared to consultancy's 74%.

Reading Comprehension

Towards Understanding Sycophancy in Language Models

1 code implementation20 Oct 2023 Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez

Overall, our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.

Text Generation

Studying Large Language Model Generalization with Influence Functions

1 code implementation7 Aug 2023 Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, Evan Hubinger, Kamilė Lukošiūtė, Karina Nguyen, Nicholas Joseph, Sam McCandlish, Jared Kaplan, Samuel R. Bowman

When trying to gain better visibility into a machine learning model in order to understand and mitigate the associated risks, a potentially valuable source of evidence is: which training examples most contribute to a given behavior?

counterfactual Language Modelling +2

Measuring Faithfulness in Chain-of-Thought Reasoning

no code implementations17 Jul 2023 Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy Maxwell, Timothy Telleen-Lawton, Tristan Hume, Zac Hatfield-Dodds, Jared Kaplan, Jan Brauner, Samuel R. Bowman, Ethan Perez

Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i. e., its process for answering the question).

ScoNe: Benchmarking Negation Reasoning in Language Models With Fine-Tuning and In-Context Learning

1 code implementation30 May 2023 Jingyuan Selena She, Christopher Potts, Samuel R. Bowman, Atticus Geiger

For in-context learning, we test InstructGPT models and find that most prompt strategies are not successful, including those using step-by-step reasoning.

Benchmarking In-Context Learning +3

Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

no code implementations23 May 2023 Angelica Chen, Jason Phang, Alicia Parrish, Vishakh Padmakumar, Chen Zhao, Samuel R. Bowman, Kyunghyun Cho

Large language models (LLMs) have achieved widespread success on a variety of in-context few-shot tasks, but this success is typically evaluated via correctness rather than consistency.

valid

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

1 code implementation NeurIPS 2023 Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman

We demonstrate that CoT explanations can be heavily influenced by adding biasing features to model inputs--e. g., by reordering the multiple-choice options in a few-shot prompt to make the answer always "(A)"--which models systematically fail to mention in their explanations.

Multiple-choice

Eight Things to Know about Large Language Models

no code implementations2 Apr 2023 Samuel R. Bowman

Experts are not yet able to interpret the inner workings of LLMs.

Improving Code Generation by Training with Natural Language Feedback

1 code implementation28 Mar 2023 Angelica Chen, Jérémy Scheurer, Tomasz Korbak, Jon Ander Campos, Jun Shern Chan, Samuel R. Bowman, Kyunghyun Cho, Ethan Perez

The potential for pre-trained large language models (LLMs) to use natural language feedback at inference time has been an exciting recent development.

Code Generation Imitation Learning +1

Pretraining Language Models with Human Preferences

1 code implementation16 Feb 2023 Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher L. Buckley, Jason Phang, Samuel R. Bowman, Ethan Perez

Language models (LMs) are pretrained to imitate internet text, including content that would violate human preferences if generated by an LM: falsehoods, offensive comments, personally identifiable information, low-quality or buggy code, and more.

Imitation Learning Language Modelling

(QA)$^2$: Question Answering with Questionable Assumptions

1 code implementation20 Dec 2022 Najoung Kim, Phu Mon Htut, Samuel R. Bowman, Jackson Petty

Naturally occurring information-seeking questions often contain questionable assumptions -- assumptions that are false or unverifiable.

Question Answering

What Artificial Neural Networks Can Tell Us About Human Language Acquisition

no code implementations17 Aug 2022 Alex Warstadt, Samuel R. Bowman

Rapid progress in machine learning for natural language processing has the potential to transform debates about how humans learn language.

Language Acquisition

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

3 code implementations9 Jun 2022 Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Madotto, Andrea Santilli, Andreas Stuhlmüller, Andrew Dai, Andrew La, Andrew Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakaş, B. Ryan Roberts, Bao Sheng Loe, Barret Zoph, Bartłomiej Bojanowski, Batuhan Özyurt, Behnam Hedayatnia, Behnam Neyshabur, Benjamin Inden, Benno Stein, Berk Ekmekci, Bill Yuchen Lin, Blake Howald, Bryan Orinion, Cameron Diao, Cameron Dour, Catherine Stinson, Cedrick Argueta, César Ferri Ramírez, Chandan Singh, Charles Rathkopf, Chenlin Meng, Chitta Baral, Chiyu Wu, Chris Callison-Burch, Chris Waites, Christian Voigt, Christopher D. Manning, Christopher Potts, Cindy Ramirez, Clara E. Rivera, Clemencia Siro, Colin Raffel, Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Dan Garrette, Dan Hendrycks, Dan Kilman, Dan Roth, Daniel Freeman, Daniel Khashabi, Daniel Levy, Daniel Moseguí González, Danielle Perszyk, Danny Hernandez, Danqi Chen, Daphne Ippolito, Dar Gilboa, David Dohan, David Drakard, David Jurgens, Debajyoti Datta, Deep Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes, Diganta Misra, Dilyar Buzan, Dimitri Coelho Mollo, Diyi Yang, Dong-Ho Lee, Dylan Schrader, Ekaterina Shutova, Ekin Dogus Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth Donoway, Ellie Pavlick, Emanuele Rodola, Emma Lam, Eric Chu, Eric Tang, Erkut Erdem, Ernie Chang, Ethan A. Chi, Ethan Dyer, Ethan Jerzak, Ethan Kim, Eunice Engefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, Fernando Martínez-Plumed, Francesca Happé, Francois Chollet, Frieda Rong, Gaurav Mishra, Genta Indra Winata, Gerard de Melo, Germán Kruszewski, Giambattista Parascandolo, Giorgio Mariani, Gloria Wang, Gonzalo Jaimovitch-López, Gregor Betz, Guy Gur-Ari, Hana Galijasevic, Hannah Kim, Hannah Rashkin, Hannaneh Hajishirzi, Harsh Mehta, Hayden Bogar, Henry Shevlin, Hinrich Schütze, Hiromu Yakura, Hongming Zhang, Hugh Mee Wong, Ian Ng, Isaac Noble, Jaap Jumelet, Jack Geissinger, Jackson Kernion, Jacob Hilton, Jaehoon Lee, Jaime Fernández Fisac, James B. Simon, James Koppel, James Zheng, James Zou, Jan Kocoń, Jana Thompson, Janelle Wingfield, Jared Kaplan, Jarema Radom, Jascha Sohl-Dickstein, Jason Phang, Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, Jesse Engel, Jesujoba Alabi, Jiacheng Xu, Jiaming Song, Jillian Tang, Joan Waweru, John Burden, John Miller, John U. Balis, Jonathan Batchelder, Jonathan Berant, Jörg Frohberg, Jos Rozen, Jose Hernandez-Orallo, Joseph Boudeman, Joseph Guerr, Joseph Jones, Joshua B. Tenenbaum, Joshua S. Rule, Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva, Katja Markert, Kaustubh D. Dhole, Kevin Gimpel, Kevin Omondi, Kory Mathewson, Kristen Chiafullo, Ksenia Shkaruta, Kumar Shridhar, Kyle McDonell, Kyle Richardson, Laria Reynolds, Leo Gao, Li Zhang, Liam Dugan, Lianhui Qin, Lidia Contreras-Ochando, Louis-Philippe Morency, Luca Moschella, Lucas Lam, Lucy Noble, Ludwig Schmidt, Luheng He, Luis Oliveros Colón, Luke Metz, Lütfi Kerem Şenel, Maarten Bosma, Maarten Sap, Maartje ter Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli, Marco Maru, Maria Jose Ramírez Quintana, Marie Tolkiehn, Mario Giulianelli, Martha Lewis, Martin Potthast, Matthew L. Leavitt, Matthias Hagen, Mátyás Schubert, Medina Orduna Baitemirova, Melody Arnaud, Melvin McElrath, Michael A. Yee, Michael Cohen, Michael Gu, Michael Ivanitskiy, Michael Starritt, Michael Strube, Michał Swędrowski, Michele Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac Suzgun, Mitch Walker, Mo Tiwari, Mohit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini, Mukund Varma T, Nanyun Peng, Nathan A. Chi, Nayeon Lee, Neta Gur-Ari Krakover, Nicholas Cameron, Nicholas Roberts, Nick Doiron, Nicole Martinez, Nikita Nangia, Niklas Deckers, Niklas Muennighoff, Nitish Shirish Keskar, Niveditha S. Iyer, Noah Constant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Omer Levy, Owain Evans, Pablo Antonio Moreno Casares, Parth Doshi, Pascale Fung, Paul Pu Liang, Paul Vicol, Pegah Alipoormolabashi, Peiyuan Liao, Percy Liang, Peter Chang, Peter Eckersley, Phu Mon Htut, Pinyu Hwang, Piotr Miłkowski, Piyush Patil, Pouya Pezeshkpour, Priti Oli, Qiaozhu Mei, Qing Lyu, Qinlang Chen, Rabin Banjade, Rachel Etta Rudolph, Raefer Gabriel, Rahel Habacker, Ramon Risco, Raphaël Millière, Rhythm Garg, Richard Barnes, Rif A. Saurous, Riku Arakawa, Robbe Raymaekers, Robert Frank, Rohan Sikand, Roman Novak, Roman Sitelew, Ronan LeBras, Rosanne Liu, Rowan Jacobs, Rui Zhang, Ruslan Salakhutdinov, Ryan Chi, Ryan Lee, Ryan Stovall, Ryan Teehan, Rylan Yang, Sahib Singh, Saif M. Mohammad, Sajant Anand, Sam Dillavou, Sam Shleifer, Sam Wiseman, Samuel Gruetter, Samuel R. Bowman, Samuel S. Schoenholz, Sanghyun Han, Sanjeev Kwatra, Sarah A. Rous, Sarik Ghazarian, Sayan Ghosh, Sean Casey, Sebastian Bischoff, Sebastian Gehrmann, Sebastian Schuster, Sepideh Sadeghi, Shadi Hamdan, Sharon Zhou, Shashank Srivastava, Sherry Shi, Shikhar Singh, Shima Asaadi, Shixiang Shane Gu, Shubh Pachchigar, Shubham Toshniwal, Shyam Upadhyay, Shyamolima, Debnath, Siamak Shakeri, Simon Thormeyer, Simone Melzi, Siva Reddy, Sneha Priscilla Makini, Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas Dehaene, Stefan Divic, Stefano Ermon, Stella Biderman, Stephanie Lin, Stephen Prasad, Steven T. Piantadosi, Stuart M. Shieber, Summer Misherghi, Svetlana Kiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali, Tatsu Hashimoto, Te-Lin Wu, Théo Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei Kornev, Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Vera Demberg, Victoria Nyamai, Vikas Raunak, Vinay Ramasesh, Vinay Uday Prabhu, Vishakh Padmakumar, Vivek Srikumar, William Fedus, William Saunders, William Zhang, Wout Vossen, Xiang Ren, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh, Yair Lakretz, Yangqiu Song, Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding Hao, Yifu Chen, Yonatan Belinkov, Yu Hou, Yufang Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, Zijian Wang, Zijie J. Wang, ZiRui Wang, Ziyi Wu

BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models.

Common Sense Reasoning Math +1

SQuALITY: Building a Long-Document Summarization Dataset the Hard Way

1 code implementation23 May 2022 Alex Wang, Richard Yuanzhe Pang, Angelica Chen, Jason Phang, Samuel R. Bowman

Summarization datasets are often assembled either by scraping naturally occurring public-domain summaries -- which are nearly always in difficult-to-work-with technical domains -- or by using approximate heuristics to extract them from everyday text -- which frequently yields unfaithful summaries.

Document Summarization Multiple-choice

Instruction Induction: From Few Examples to Natural Language Task Descriptions

1 code implementation22 May 2022 Or Honovich, Uri Shaham, Samuel R. Bowman, Omer Levy

Large language models are able to perform a task by conditioning on a few input-output demonstrations - a paradigm known as in-context learning.

In-Context Learning

Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions

no code implementations LNLS (ACL) 2022 Alicia Parrish, Harsh Trivedi, Ethan Perez, Angelica Chen, Nikita Nangia, Jason Phang, Samuel R. Bowman

We use long contexts -- humans familiar with the context write convincing explanations for pre-selected correct and incorrect answers, and we test if those explanations allow humans who have not read the full context to more accurately determine the correct answer.

Multiple-choice Reading Comprehension

What Makes Reading Comprehension Questions Difficult?

1 code implementation ACL 2022 Saku Sugawara, Nikita Nangia, Alex Warstadt, Samuel R. Bowman

For a natural language understanding benchmark to be useful in research, it has to consist of examples that are diverse and difficult enough to discriminate among current and near-future state-of-the-art systems.

Logical Reasoning Multiple-choice +2

QuALITY: Question Answering with Long Input Texts, Yes!

2 code implementations NAACL 2022 Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, Samuel R. Bowman

To enable building and testing models on long-document comprehension, we introduce QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5, 000 tokens, much longer than typical current models can process.

Multiple-choice Multiple Choice Question Answering (MCQA)

Adversarially Constructed Evaluation Sets Are More Challenging, but May Not Be Fair

no code implementations NAACL (DADC) 2022 Jason Phang, Angelica Chen, William Huang, Samuel R. Bowman

We find that AFLite indeed selects more challenging examples, lowering the performance of evaluated models more as stronger adversary models are used.

Clean or Annotate: How to Spend a Limited Data Collection Budget

no code implementations DeepLo 2022 Derek Chen, Zhou Yu, Samuel R. Bowman

Crowdsourcing platforms are often used to collect datasets for training machine learning models, despite higher levels of inaccurate labeling compared to expert labeling.

Denoising Learning with noisy labels

BBQ: A Hand-Built Bias Benchmark for Question Answering

1 code implementation Findings (ACL) 2022 Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, Samuel R. Bowman

It is well documented that NLP models learn social biases, but little work has been done on how these biases manifest in model outputs for applied tasks like question answering (QA).

Question Answering

The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail

no code implementations15 Oct 2021 Samuel R. Bowman

Researchers in NLP often frame and discuss research results in ways that serve to deemphasize the field's successes, often in response to the field's widespread hype.

Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers

no code implementations EMNLP (BlackboxNLP) 2021 Jason Phang, Haokun Liu, Samuel R. Bowman

Despite the success of fine-tuning pretrained language encoders like BERT for downstream natural language understanding (NLU) tasks, it is still poorly understood how neural networks change after fine-tuning.

Natural Language Understanding

NOPE: A Corpus of Naturally-Occurring Presuppositions in English

1 code implementation CoNLL (EMNLP) 2021 Alicia Parrish, Sebastian Schuster, Alex Warstadt, Omar Agha, Soo-Hwan Lee, Zhuoye Zhao, Samuel R. Bowman, Tal Linzen

Understanding language requires grasping not only the overtly stated content, but also making inferences about things that were left unsaid.

What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?

1 code implementation ACL 2021 Nikita Nangia, Saku Sugawara, Harsh Trivedi, Alex Warstadt, Clara Vania, Samuel R. Bowman

However, we find that training crowdworkers, and then using an iterative process of collecting data, sending feedback, and qualifying workers based on expert judgments is an effective means of collecting challenging data.

Multiple-choice Natural Language Understanding +1

Comparing Test Sets with Item Response Theory

no code implementations ACL 2021 Clara Vania, Phu Mon Htut, William Huang, Dhara Mungra, Richard Yuanzhe Pang, Jason Phang, Haokun Liu, Kyunghyun Cho, Samuel R. Bowman

Recent years have seen numerous NLP datasets introduced to evaluate the performance of fine-tuned models on natural language understanding tasks.

Natural Language Understanding

Does Putting a Linguist in the Loop Improve NLU Data Collection?

no code implementations Findings (EMNLP) 2021 Alicia Parrish, William Huang, Omar Agha, Soo-Hwan Lee, Nikita Nangia, Alex Warstadt, Karmanya Aggarwal, Emily Allaway, Tal Linzen, Samuel R. Bowman

We take natural language inference as a test case and ask whether it is beneficial to put a linguist `in the loop' during data collection to dynamically identify and address gaps in the data by introducing novel constraints on the task.

Natural Language Inference

What Will it Take to Fix Benchmarking in Natural Language Understanding?

no code implementations NAACL 2021 Samuel R. Bowman, George E. Dahl

Evaluation for many natural language understanding (NLU) tasks is broken: Unreliable and biased systems score so highly on standard benchmarks that there is little room for researchers who develop better systems to demonstrate their improvements.

Benchmarking Natural Language Understanding +1

When Do You Need Billions of Words of Pretraining Data?

1 code implementation ACL 2021 Yian Zhang, Alex Warstadt, Haau-Sing Li, Samuel R. Bowman

We adopt four probing methods---classifier probing, information-theoretic probing, unsupervised relative acceptability judgment, and fine-tuning on NLU tasks---and draw learning curves that track the growth of these different measures of linguistic ability with respect to pretraining data volume using the MiniBERTas, a group of RoBERTa models pretrained on 1M, 10M, 100M and 1B words.

Asking Crowdworkers to Write Entailment Examples: The Best of Bad Options

1 code implementation Asian Chapter of the Association for Computational Linguistics 2020 Clara Vania, Ruijie Chen, Samuel R. Bowman

Using these protocols and a writing-based baseline, we collect several new English NLI datasets of over 3k examples each, each using a fixed amount of annotator time, but a varying number of examples to fit that time budget.

Natural Language Inference Transfer Learning

Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually)

1 code implementation EMNLP 2020 Alex Warstadt, Yian Zhang, Haau-Sing Li, Haokun Liu, Samuel R. Bowman

One reason pretraining on self-supervised linguistic tasks is effective is that it teaches models features that are helpful for language understanding.

Binary Classification

Counterfactually-Augmented SNLI Training Data Does Not Yield Better Generalization Than Unaugmented Data

1 code implementation EMNLP (insights) 2020 William Huang, Haokun Liu, Samuel R. Bowman

A growing body of work shows that models exploit annotation artifacts to achieve state-of-the-art performance on standard crowdsourced benchmarks---datasets collected from crowdworkers to create an evaluation task---while still failing on out-of-domain examples for the same task.

counterfactual Natural Language Inference +2

Precise Task Formalization Matters in Winograd Schema Evaluations

1 code implementation EMNLP 2020 Haokun Liu, William Huang, Dhara A. Mungra, Samuel R. Bowman

Performance on the Winograd Schema Challenge (WSC), a respected English commonsense reasoning benchmark, recently rocketed from chance accuracy to 89% on the SuperGLUE leaderboard, with relatively little corroborating evidence of a correspondingly large improvement in reasoning ability.

Language Modelling Multiple-choice

CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

1 code implementation EMNLP 2020 Nikita Nangia, Clara Vania, Rasika Bhalerao, Samuel R. Bowman

To measure some forms of social bias in language models against protected demographic groups in the US, we introduce the Crowdsourced Stereotype Pairs benchmark (CrowS-Pairs).

Can neural networks acquire a structural bias from raw linguistic data?

no code implementations14 Jul 2020 Alex Warstadt, Samuel R. Bowman

We argue that these results are the strongest evidence so far from artificial learners supporting the proposition that a structural bias can be acquired from raw data.

Inductive Bias Language Acquisition +1

Self-Training for Unsupervised Parsing with PRPN

no code implementations WS 2020 Anhad Mohananey, Katharina Kann, Samuel R. Bowman

To be able to use our model's predictions during training, we extend a recent neural UP architecture, the PRPN (Shen et al., 2018a) such that it can be trained in a semi-supervised fashion.

Language Modelling

English Intermediate-Task Training Improves Zero-Shot Cross-Lingual Transfer Too

no code implementations Asian Chapter of the Association for Computational Linguistics 2020 Jason Phang, Iacer Calixto, Phu Mon Htut, Yada Pruksachatkun, Haokun Liu, Clara Vania, Katharina Kann, Samuel R. Bowman

Intermediate-task training---fine-tuning a pretrained model on an intermediate task before fine-tuning again on the target task---often improves model performance substantially on language understanding tasks in monolingual English settings.

Question Answering Retrieval +3

Learning to Learn Morphological Inflection for Resource-Poor Languages

no code implementations28 Apr 2020 Katharina Kann, Samuel R. Bowman, Kyunghyun Cho

We propose to cast the task of morphological inflection - mapping a lemma to an indicated inflected form - for resource-poor languages as a meta-learning problem.

Cross-Lingual Transfer LEMMA +2

New Protocols and Negative Results for Textual Entailment Data Collection

1 code implementation EMNLP 2020 Samuel R. Bowman, Jennimaria Palomaki, Livio Baldini Soares, Emily Pitler

Natural language inference (NLI) data has proven useful in benchmarking and, especially, as pretraining data for tasks requiring language understanding.

Benchmarking Natural Language Inference +1

BLiMP: The Benchmark of Linguistic Minimal Pairs for English

4 code implementations TACL 2020 Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, Samuel R. Bowman

We introduce The Benchmark of Linguistic Minimal Pairs (shortened to BLiMP), a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English.

Do Attention Heads in BERT Track Syntactic Dependencies?

1 code implementation27 Nov 2019 Phu Mon Htut, Jason Phang, Shikha Bordia, Samuel R. Bowman

We investigate the extent to which individual attention heads in pretrained transformer language models, such as BERT and RoBERTa, implicitly capture syntactic dependency relations.

CoLA

Neural Unsupervised Parsing Beyond English

no code implementations WS 2019 Katharina Kann, Anhad Mohananey, Samuel R. Bowman, Kyunghyun Cho

Recently, neural network models which automatically infer syntactic structure from raw text have started to achieve promising results.

Inducing Constituency Trees through Neural Machine Translation

no code implementations22 Sep 2019 Phu Mon Htut, Kyunghyun Cho, Samuel R. Bowman

Latent tree learning(LTL) methods learn to parse sentences using only indirect supervision from a downstream task.

Language Modelling Machine Translation +1

Towards Realistic Practices In Low-Resource Natural Language Processing: The Development Set

no code implementations IJCNLP 2019 Katharina Kann, Kyunghyun Cho, Samuel R. Bowman

Here, we aim to answer the following questions: Does using a development set for early stopping in the low-resource setting influence results as compared to a more realistic alternative, where the number of training epochs is tuned on development languages?

Can Unconditional Language Models Recover Arbitrary Sentences?

no code implementations NeurIPS 2019 Nishant Subramani, Samuel R. Bowman, Kyunghyun Cho

We then investigate the conditions under which a language model can be made to generate a sentence through the identification of a point in such a space and find that it is possible to recover arbitrary sentences nearly perfectly with language models and representations of moderate size without modifying any model parameters.

Language Modelling Sentence +2

Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark

no code implementations ACL 2019 Nikita Nangia, Samuel R. Bowman

The GLUE benchmark (Wang et al., 2019b) is a suite of language understanding tasks which has seen dramatic progress in the past year, with average performance moving from 70. 0 at launch to 83. 9, state of the art at the time of writing (May 24, 2019).

Sentence Sentence Classification

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

5 code implementations NeurIPS 2019 Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman

In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks.

Transfer Learning

Looking for ELMo's friends: Sentence-Level Pretraining Beyond Language Modeling

no code implementations ICLR 2019 Samuel R. Bowman, Ellie Pavlick, Edouard Grave, Benjamin Van Durme, Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pappagari, R. Thomas McCoy, Roma Patel, Najoung Kim, Ian Tenney, Yinghui Huang, Katherin Yu, Shuning Jin, Berlin Chen

Work on the problem of contextualized word representation—the development of reusable neural network components for sentence understanding—has recently seen a surge of progress centered on the unsupervised pretraining task of language modeling with methods like ELMo (Peters et al., 2018).

Language Modelling Sentence

Probing What Different NLP Tasks Teach Machines about Function Word Comprehension

no code implementations SEMEVAL 2019 Najoung Kim, Roma Patel, Adam Poliak, Alex Wang, Patrick Xia, R. Thomas McCoy, Ian Tenney, Alexis Ross, Tal Linzen, Benjamin Van Durme, Samuel R. Bowman, Ellie Pavlick

Our results show that pretraining on language modeling performs the best on average across our probing tasks, supporting its widespread use for pretraining state-of-the-art NLP models, and CCG supertagging and NLI pretraining perform comparably.

CCG Supertagging Language Modelling +3

Identifying and Reducing Gender Bias in Word-Level Language Models

no code implementations NAACL 2019 Shikha Bordia, Samuel R. Bowman

Many text corpora exhibit socially problematic biases, which can be propagated or amplified in the models trained on such data.

Language Modelling

On Measuring Social Biases in Sentence Encoders

1 code implementation NAACL 2019 Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, Rachel Rudinger

The Word Embedding Association Test shows that GloVe and word2vec word embeddings exhibit human-like implicit biases based on gender, race, and other social constructs (Caliskan et al., 2017).

Sentence Word Embeddings

Linguistic Analysis of Pretrained Sentence Encoders with Acceptability Judgments

no code implementations11 Jan 2019 Alex Warstadt, Samuel R. Bowman

We use this analysis set to investigate the grammatical knowledge of three pretrained encoders: BERT (Devlin et al., 2018), GPT (Radford et al., 2018), and the BiLSTM baseline from Warstadt et al. We find that these models have a strong command of complex or non-canonical argument structures like ditransitives (Sue gave Dan a book) and passives (The book was read).

CoLA General Classification +2

Verb Argument Structure Alternations in Word and Sentence Embeddings

no code implementations WS 2019 Katharina Kann, Alex Warstadt, Adina Williams, Samuel R. Bowman

For converging evidence, we further construct LaVA, a corresponding word-level dataset, and investigate whether the same syntactic features can be extracted from word embeddings.

Sentence Sentence Embedding +2

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

2 code implementations2 Nov 2018 Jason Phang, Thibault Févry, Samuel R. Bowman

Pretraining sentence encoders with language modeling and related unsupervised tasks has recently been shown to be very effective for language understanding tasks.

Language Modelling Natural Language Inference +2

Language Modeling Teaches You More Syntax than Translation Does: Lessons Learned Through Auxiliary Task Analysis

no code implementations26 Sep 2018 Kelly W. Zhang, Samuel R. Bowman

We find that representations from language models consistently perform best on our syntactic auxiliary prediction tasks, even when trained on relatively small amounts of data.

Language Modelling Transfer Learning +1

Grammar Induction with Neural Language Models: An Unusual Replication

1 code implementation EMNLP (ACL) 2018 Phu Mon Htut, Kyunghyun Cho, Samuel R. Bowman

A substantial thread of recent work on latent tree learning has attempted to develop neural network models with parse-valued latent variables and train them on non-parsing tasks, in the hope of having them discover interpretable tree structure.

Constituency Parsing Language Modelling

Neural Network Acceptability Judgments

2 code implementations TACL 2019 Alex Warstadt, Amanpreet Singh, Samuel R. Bowman

This paper investigates the ability of artificial neural networks to judge the grammatical acceptability of a sentence, with the goal of testing their linguistic competence.

CoLA General Classification +3

Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Task Analysis

no code implementations24 May 2018 Kelly W. Zhang, Samuel R. Bowman

There is mounting evidence that pretraining can be valuable for neural network language understanding models, but we do not yet have a clear understanding of how the choice of pretraining objective affects the type of linguistic information that models learn.

Language Modelling Transfer Learning +1

A Stable and Effective Learning Strategy for Trainable Greedy Decoding

1 code implementation EMNLP 2018 Yun Chen, Victor O. K. Li, Kyunghyun Cho, Samuel R. Bowman

Beam search is a widely used approximate search strategy for neural network decoders, and it generally outperforms simple greedy decoding on tasks like machine translation.

Machine Translation Translation

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

11 code implementations WS 2018 Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman

For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset.

Natural Language Inference Natural Language Understanding +2

ListOps: A Diagnostic Dataset for Latent Tree Learning

2 code implementations NAACL 2018 Nikita Nangia, Samuel R. Bowman

In this paper we introduce ListOps, a toy dataset created to study the parsing ability of latent tree models.

ListOps Sentence +2

Annotation Artifacts in Natural Language Inference Data

no code implementations NAACL 2018 Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, Noah A. Smith

Large-scale datasets for natural language inference are created by presenting crowd workers with a sentence (premise), and asking them to generate three new sentences (hypotheses) that it entails, contradicts, or is logically neutral with respect to.

Natural Language Inference Negation +2

The Lifted Matrix-Space Model for Semantic Composition

2 code implementations CONLL 2018 WooJin Chung, Sheng-Fu Wang, Samuel R. Bowman

Tree-structured neural network architectures for sentence encoding draw inspiration from the approach to semantic composition generally seen in formal linguistics, and have shown empirical improvements over comparable sequence models by doing so.

Semantic Composition Sentence +1

Do latent tree learning models identify meaningful structure in sentences?

1 code implementation TACL 2018 Adina Williams, Andrew Drozdov, Samuel R. Bowman

Recent work on the problem of latent tree learning has made it possible to train neural networks that learn to both parse a sentence and use the resulting parse to interpret the sentence, all without exposure to ground-truth parse trees at training time.

Sentence Sentence Classification

The RepEval 2017 Shared Task: Multi-Genre Natural Language Inference with Sentence Representations

no code implementations WS 2017 Nikita Nangia, Adina Williams, Angeliki Lazaridou, Samuel R. Bowman

This paper presents the results of the RepEval 2017 Shared Task, which evaluated neural network sentence representation learning models on the Multi-Genre Natural Language Inference corpus (MultiNLI) recently introduced by Williams et al. (2017).

Natural Language Inference Representation Learning +1

Sequential Attention: A Context-Aware Alignment Function for Machine Reading

no code implementations WS 2017 Sebastian Brarda, Philip Yeres, Samuel R. Bowman

In this paper we propose a neural network model with a novel Sequential Attention layer that extends soft attention by assigning weights to words in an input sequence in a way that takes into account not just how well that word matches a query, but how well surrounding words match.

Reading Comprehension

Ruminating Reader: Reasoning with Gated Multi-Hop Attention

no code implementations WS 2018 Yichen Gong, Samuel R. Bowman

To answer the question in machine comprehension (MC) task, the models need to establish the interaction between the question and the context.

Question Answering Reading Comprehension

Discourse-Based Objectives for Fast Unsupervised Sentence Representation Learning

no code implementations23 Apr 2017 Yacine Jernite, Samuel R. Bowman, David Sontag

This work presents a novel objective function for the unsupervised training of neural network sentence encoders.

Representation Learning Sentence

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

3 code implementations NAACL 2018 Adina Williams, Nikita Nangia, Samuel R. Bowman

This paper introduces the Multi-Genre Natural Language Inference (MultiNLI) corpus, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding.

BIG-bench Machine Learning Domain Adaptation +2

Generating Sentences from a Continuous Space

17 code implementations CONLL 2016 Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, Samy Bengio

The standard recurrent neural network language model (RNNLM) generates sentences one word at a time and does not work from an explicit global sentence representation.

Language Modelling Sentence

A large annotated corpus for learning natural language inference

3 code implementations EMNLP 2015 Samuel R. Bowman, Gabor Angeli, Christopher Potts, Christopher D. Manning

Understanding entailment and contradiction is fundamental to understanding natural language, and inference about entailment and contradiction is a valuable testing ground for the development of semantic representations.

Image Captioning Natural Language Inference +1

Tree-structured composition in neural networks without tree-structured architectures

1 code implementation16 Jun 2015 Samuel R. Bowman, Christopher D. Manning, Christopher Potts

We hypothesize that neural sequence models like LSTMs are in fact able to discover and implicitly use recursive compositional structure, at least for tasks with clear cues to that structure in the data.

Sentence

Learning Distributed Word Representations for Natural Logic Reasoning

no code implementations15 Oct 2014 Samuel R. Bowman, Christopher Potts, Christopher D. Manning

Natural logic offers a powerful relational conception of meaning that is a natural counterpart to distributed semantic representations, which have proven valuable in a wide range of sophisticated language tasks.

Logical Reasoning Open-Ended Question Answering +1

Recursive Neural Networks Can Learn Logical Semantics

no code implementations WS 2015 Samuel R. Bowman, Christopher Potts, Christopher D. Manning

Tree-structured recursive neural networks (TreeRNNs) for sentence meaning have been successful for many applications, but it remains an open question whether the fixed-length representations that they learn can support tasks as demanding as logical deduction.

Open-Ended Question Answering Relational Reasoning +2

Can recursive neural tensor networks learn logical reasoning?

1 code implementation21 Dec 2013 Samuel R. Bowman

Recursive neural network models and their accompanying vector representations for words have seen success in an array of increasingly semantically sophisticated tasks, but almost nothing is known about their ability to accurately capture the aspects of linguistic meaning that are necessary for interpretation or reasoning.

Logical Reasoning Tensor Networks

Cannot find the paper you are looking for? You can Submit a new open access paper.