The Reddit dataset is a graph dataset from Reddit posts made in the month of September, 2014. The node label in this case is the community, or “subreddit”, that a post belongs to. 50 large communities have been sampled to build a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 posts with an average degree of 492. The first 20 days are used for training and the remaining days for testing (with 30% used for validation). For features, off-the-shelf 300-dimensional GloVe CommonCrawl word vectors are used.
586 PAPERS • 13 BENCHMARKS
The Multi-domain Wizard-of-Oz (MultiWOZ) dataset is a large-scale human-human conversational corpus spanning over seven domains, containing 8438 multi-turn dialogues, with each dialogue averaging 14 turns. Different from existing standard datasets like WOZ and DSTC2, which contain less than 10 slots and only a few hundred values, MultiWOZ has 30 (domain, slot) pairs and over 4,500 possible values. The dialogues span seven domains: restaurant, hotel, attraction, taxi, train, hospital and police.
316 PAPERS • 11 BENCHMARKS
DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 dialogues with corresponding manually labeled summaries and topics.
38 PAPERS • 2 BENCHMARKS
For goal-oriented document-grounded dialogs, it often involves complex contexts for identifying the most relevant information, which requires better understanding of the inter-relations between conversations and documents. Meanwhile, many online user-oriented documents use both semi-structured and unstructured contents for guiding users to access information of different contexts. Thus, we create a new goal-oriented document-grounded dialogue dataset that captures more diverse scenarios derived from various document contents from multiple domains such ssa.gov and studentaid.gov. For data collection, we propose a novel pipeline approach for dialogue data construction, which has been adapted and evaluated for several domains.
34 PAPERS • NO BENCHMARKS YET
MultiDoc2Dial is a new task and dataset on modeling goal-oriented dialogues grounded in multiple documents. Most previous works treat document-grounded dialogue modeling as a machine reading comprehension task based on a single given document or passage. We aim to address more realistic scenarios where a goal-oriented information-seeking conversation involves multiple topics, and hence is grounded on different documents.
20 PAPERS • NO BENCHMARKS YET
Most existing dialogue systems fail to respond properly to potentially unsafe user utterances by either ignoring or passively agreeing with them.
13 PAPERS • 1 BENCHMARK
FaithDial is a new benchmark for hallucination-free dialogues, by editing hallucinated responses in the Wizard of Wikipedia (WoW) benchmark.
12 PAPERS • NO BENCHMARKS YET
SODA is a high-quality social dialogue dataset. In contrast to most existing crowdsourced, small-scale dialogue corpora, Soda distills 1.5M socially-grounded dialogues from a pre-trained language model (InstructGPT; Ouyang et al., ). Dialogues are distilled by contextualizing social commonsense knowledge from a knowledge graph (Atomic10x).
11 PAPERS • NO BENCHMARKS YET
FusedChat is an inter-mode dialogue dataset. It contains dialogue sessions fusing task-oriented dialogues (TOD) and open-domain dialogues (ODD). Based on MultiWOZ, FusedChat appends or prepends an ODD to every existing TOD. See more details in the paper.
6 PAPERS • 1 BENCHMARK
OTTers is a dataset of human one-turn topic transitions. In this task, models must connect two topics in a cooperative and coherent manner, by generating a "bridging" utterance connecting the new topic tot he topic of the previous conversation turn.
6 PAPERS • NO BENCHMARKS YET
Harry Potter Dialogue is the first dialogue dataset that integrates with scene, attributes and relations which are dynamically changed as the storyline goes on. Our work can facilitate research to construct more human-like conversational systems in practice. For example, virtual assistant, NPC in games, etc. Moreover, HPD can both support dialogue generation and retrieval tasks.
1 PAPER • 2 BENCHMARKS
MultiRefKGC is a dataset created from conversations from Reddit designed for Knowledge-Grounded Dialogue Generation tasks.
1 PAPER • NO BENCHMARKS YET
OpenViDial 2.0 is a larger-scale open-domain multi-modal dialogue dataset compared to the previous version OpenViDial 1.0. OpenViDial 2.0 contains a total number of 5.6 million dialogue turns extracted from either movies or TV series from different resources, and each dialogue turn is paired with its corresponding visual context.
1 PAPER • 1 BENCHMARK
The StatCan Dialogue Dataset: Retrieving Data Tables through Conversations with Genuine Intents