RuOpenBookQA

Introduced by Taktasheva et al. in TAPE: Assessing Few-shot Russian Language Understanding

RuOpenBookQA is a QA dataset with multiple-choice elementary-level science questions which probe the understanding of core science facts.

Motivation

RuOpenBookQA is mainly based on the work of (Mihaylov et al., 2018): it is a QA dataset with multiple-choice elementary-level science questions, which probe the understanding of 1k+ core science facts.

Very similar to the pipeline of the RuWorldTree, the dataset includes a corpus of factoids, factoid questions and correct answer. Only one fact is enough to find the correct answer, so this task can be considered easier.

```{ 'ID': '7-674',

'question': 'If a person walks in the direction opposite to the compass needle, they are going (A) west (B) north (C) east (D) south',

'answer': 'D',

'episode': [11],

'perturbation': 'ru_openbook'

}```

Data Fields

ID: a string containing a unique question id
question: a string containing question text with answer options
answer: a string containing the correct answer key (A, B, C or D)
perturbation: a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is used
episode: a list of episodes in which the instance is used. Only used for the train set

Data Splits

The dataset consists of a training set with labeled examples and a test set in two configurations:

raw data: includes the original data with no additional sampling
episodes: data is split into evaluation episodes and includes several perturbations of test for robustness evaluation

Test Perturbations

Each training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:

ButterFingers: randomly adds noise to data by mimicking spelling mistakes made by humans through character swaps based on their keyboard distance
Emojify: replaces the input words with the corresponding emojis, preserving their original meaning
EDAdelete: randomly deletes tokens in the text
EDAswap: randomly swaps tokens in the text
BackTranslation: generates variations of the context through back-translation (ru -> en -> ru)
AddSent: replaces one or more choice options with a generated one

Homepage