MuSeRC Dataset | Papers With Code

Name:*

Full name (optional):

Description (Markdown and $\LaTeX$ enabled):*

We present a reading comprehension challenge in which questions can only be answered by taking into account information from multiple sentences. The dataset is the first to study multi-sentence inference at scale, with an open-ended set of question types that requires reasoning skills.

### Task Type
Binary classification by each answer. True/False

### Example
```
    {
        "id": 397,
        "text": "(1) Мужская сборная команда Норвегии по биатлону в рамках этапа Кубка мира в немецком Оберхофе выиграла эстафетную гонку. (2) Вторыми стали французы, а бронзу получила немецкая команда. (3) Российские биатлонисты не смогли побороться даже за четвертое место, отстав от норвежцев более чем на две минуты. (4) Это худший результат сборной России в текущем сезоне. (5) Четвёртыми в Оберхофе стали австрийцы. (6) В составе сборной Норвегии на четвёртый этап вышел легендарный Уле-Эйнар Бьорндален. (7) Впрочем, Норвегия с самого начала гонки была в числе лидеров, успешно проведя все четыре этапа. (8) За сборную России в Оберхофе выступали Иван Черезов, Антон Шипулин, Евгений Устюгов и Максим Чудов. (9) Гонка не задалась уже с самого начала: если на стрельбе из положения лежа Черезов был точен, то из положения стоя он допустил несколько промахов, в результате чего ему пришлось бежать один дополнительный круг. (10) После этого отставание российской команды от соперников только увеличивалось. (11) Напомним, что днем ранее российские биатлонистки выиграли свою эстафету. (12) В составе сборной России выступали Анна Богалий-Титовец, Анна Булыгина, Ольга Медведцева и Светлана Слепцова. (13) Они опередили своих основных соперниц - немок - всего на 0,3 секунды.",
        "questions": [
            {
                "question": "На сколько секунд женская команда опередила своих соперниц?",
                "answers": [
                    {
                        "text": "Всего на 0,3 секунды.",
                        "label": 1
                    },
                    {
                        "text": "На 0,3 секунды.",
                        "label": 1
                    },
                    {
                        "text": "На секунду.",
                        "label": 0
                    },
                    {
                        "text": "На 0.5 секунд.",
                        "label": 0
                    }
                ],
                "idx": 0
            }]
    }
 ```

### How did we collect data? 
Our challenge dataset contains ∼6k questions for +800 paragraphs across 5 different domains:

* elementary school texts
 * news
 * fiction stories
  * fairy tales
 * summary of series

First, we have collected all data from open sources and automatically preprocessed them, filtered only those paragraphs that corresponding to the following parameters: 1) paragraph length 2) number of NER entities 3) number of coreference relations. Afterwords we have check the correct splitting on sentences and numerate each of them.

Next, in Yandex.Toloka we have generated the crowdsource task to get from tolkers information: 1) generate questions 2) generate answers 3) check that to solve every question man need more than one sentence in the text.

### Principles
 * We exclude any question that can be answered based on a single sentence from a paragraph.
 * Answers are not written in the full match form in the text.
 * Answers to the questions are independent from each other. Their number can distinguish.

Homepage URL (optional):

Paper where the dataset was introduced:

Introduction date:

Dataset license:

URL to full license terms:

Image

---

MuSeRC (Russian Multi-Sentence Reading Comprehension)

Task Type

Example

How did we collect data?

Principles

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

RWSD

TERRa

LiDiRus

RuCoS

Usage

License

Modalities

Languages