no code implementations • 5 Apr 2024 • João Coelho, Bruno Martins, João Magalhães, Jamie Callan, Chenyan Xiong
This study investigates the existence of positional biases in Transformer-based models for text representation learning, particularly in the context of web document retrieval.
no code implementations • 6 Feb 2024 • Harshit Mehrotra, Jamie Callan, Zhen Fan
The ClueWeb22 dataset containing nearly 10 billion documents was released in 2022 to support academic and industry research.
1 code implementation • 11 May 2023 • Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, Graham Neubig
In this work, we provide a generalized view of active retrieval augmented generation, methods that actively decide when and what to retrieve across the course of the generation.
2 code implementations • 20 Dec 2022 • Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan
Given a query, HyDE first zero-shot instructs an instruction-following language model (e. g. InstructGPT) to generate a hypothetical document.
1 code implementation • 5 Dec 2022 • Zhengbao Jiang, Luyu Gao, Jun Araki, Haibo Ding, Zhiruo Wang, Jamie Callan, Graham Neubig
Systems for knowledge-intensive tasks such as open-domain question answering (QA) usually consist of two stages: efficient retrieval of relevant documents from a large corpus and detailed reading of the selected documents to generate answers.
Ranked #1 on Passage Retrieval on Natural Questions
no code implementations • 29 Nov 2022 • Arnold Overwijk, Chenyan Xiong, Xiao Liu, Cameron VandenBerg, Jamie Callan
ClueWeb22, the newest iteration of the ClueWeb line of datasets, provides 10 billion web pages affiliated with rich information.
2 code implementations • 18 Nov 2022 • Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, PengFei Liu, Yiming Yang, Jamie Callan, Graham Neubig
Much of this success can be attributed to prompting methods such as "chain-of-thought'', which employ LLMs for both understanding the problem description by decomposing it into steps, as well as solving each step of the problem.
Ranked #17 on Arithmetic Reasoning on GSM8K
1 code implementation • 9 May 2022 • Luyu Gao, Jamie Callan
In this paper, we propose instead to model full query-to-document interaction, leveraging the attention operation and modular Transformer re-ranker framework.
1 code implementation • 11 Mar 2022 • Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan
In this paper, we present Tevatron, a dense retrieval toolkit optimized for efficiency, flexibility, and code simplicity.
2 code implementations • 30 Aug 2021 • HongChien Yu, Chenyan Xiong, Jamie Callan
This paper proposes ANCE-PRF, a new query encoder that uses pseudo relevance feedback (PRF) to improve query representations for dense retrieval.
1 code implementation • ACL 2022 • Luyu Gao, Jamie Callan
Recent research demonstrates the effectiveness of using fine-tuned language models~(LM) for dense retrieval.
1 code implementation • EMNLP 2021 • Luyu Gao, Jamie Callan
Pre-trained Transformer language models (LM) have become go-to text representation encoders.
1 code implementation • NAACL 2021 • Luyu Gao, Zhuyun Dai, Jamie Callan
Classical information retrieval systems such as BM25 rely on exact lexical match and carry out search efficiently with inverted list index.
1 code implementation • 21 Jan 2021 • Luyu Gao, Zhuyun Dai, Jamie Callan
Pre-trained deep language models~(LM) have advanced the state-of-the-art of text retrieval.
no code implementations • 21 Jan 2021 • Luís Borges, Bruno Martins, Jamie Callan
Our work aimed at experimentally assessing the benefits of model ensembling within the context of neural methods for passage reranking.
1 code implementation • 20 Jan 2021 • HongChien Yu, Zhuyun Dai, Jamie Callan
Most research on pseudo relevance feedback (PRF) has been done in vector space and probabilistic retrieval models.
5 code implementations • ACL (RepL4NLP) 2021 • Luyu Gao, Yunyi Zhang, Jiawei Han, Jamie Callan
Contrastive learning has been applied successfully to learn vector representations of text.
no code implementations • Findings of the Association for Computational Linguistics 2020 • Vaibhav Kumar, Jamie Callan
Given an input question, it uses a BERT-based classifier (trained with weak supervision) to de-contextualize the input by selecting relevant terms from the dialog history.
no code implementations • 19 Aug 2020 • Shuo Zhang, Krisztian Balog, Jamie Callan
Category systems are central components of knowledge bases, as they provide a hierarchical grouping of semantically related concepts and entities.
no code implementations • 18 Aug 2020 • Vaibhav Kumar, Vikas Raunak, Jamie Callan
Given a natural language query, teaching machines to ask clarifying questions is of immense utility in practical natural language processing systems.
no code implementations • 21 Jul 2020 • Luyu Gao, Zhuyun Dai, Jamie Callan
Deep language models such as BERT pre-trained on large corpus have given a huge performance boost to the state-of-the-art information retrieval ranking systems.
1 code implementation • 23 May 2020 • Shuo Zhang, Zhuyun Dai, Krisztian Balog, Jamie Callan
We propose to generate natural language summaries as answers to describe the complex information contained in a table.
no code implementations • 29 Apr 2020 • Luyu Gao, Zhuyun Dai, Tongfei Chen, Zhen Fan, Benjamin Van Durme, Jamie Callan
This paper presents CLEAR, a retrieval model that seeks to complement classical lexical exact-match models such as BM25 with semantic matching signals from a neural embedding matching model.
no code implementations • EMNLP 2020 • Luyu Gao, Zhuyun Dai, Jamie Callan
Recent innovations in Transformer-based ranking models have advanced the state-of-the-art in information retrieval.
1 code implementation • 30 Mar 2020 • Jeffrey Dalton, Chenyan Xiong, Jamie Callan
A common theme through the runs is the use of BERT-based neural reranking methods.
2 code implementations • 23 Oct 2019 • Zhuyun Dai, Jamie Callan
When applied to passages, DeepCT-Index produces term weights that can be stored in an ordinary inverted index for passage retrieval.
1 code implementation • 22 May 2019 • Zhuyun Dai, Jamie Callan
Neural networks provide new possibilities to automatically learn complex language patterns and query-document relations.
Ranked #5 on Ad-Hoc Information Retrieval on TREC Robust04
no code implementations • 27 Sep 2018 • Mary Arpita Pyreddy, Varshini Ramaseshan, Narendra Nath Joshi, Zhuyun Dai, Chenyan Xiong, Jamie Callan, Zhiyuan Liu
This paper studies the consistency of the kernel-based neural ranking model K-NRM, a recent state-of-the-art neural IR model, which is important for reproducible research and deployment in the industry.
no code implementations • 3 May 2018 • Chenyan Xiong, Zhengzhong Liu, Jamie Callan, Tie-Yan Liu
The salience model also improves ad hoc search accuracy, providing effective ranking features by modeling the salience of query entities in candidate documents.
no code implementations • WSDM 2018 2018 • Zhuyun Dai, Chenyan Xiong, Jamie Callan, Zhiyuan Liu
This paper presents Conv-KNRM, a Convolutional Kernel-based Neural Ranking Model that models n-gram soft matches for ad-hoc search.
no code implementations • 20 Jun 2017 • Chenyan Xiong, Jamie Callan, Tie-Yan Liu
This paper presents a word-entity duet framework for utilizing knowledge bases in ad-hoc retrieval.
1 code implementation • 20 Jun 2017 • Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, Russell Power
Given a query and a set of documents, K-NRM uses a translation matrix that models word-level similarities via word embeddings, a new kernel-pooling technique that uses kernels to extract multi-level soft match features, and a learning-to-rank layer that combines those features into the final ranking score.
no code implementations • 1 Jul 2013 • Bhavana Dalvi, William W. Cohen, Jamie Callan
We describe a open-domain information extraction method for extracting concept-instance pairs from an HTML corpus.
no code implementations • 1 Jul 2013 • Bhavana Dalvi, William W. Cohen, Jamie Callan
In multiclass semi-supervised learning (SSL), it is sometimes the case that the number of classes present in the data is not known, and hence no labeled examples are provided for some classes.