SumTitles: a Summarization Dataset with Low Extractiveness

COLING 2020 · Valentin Malykh, Konstantin Chernis, Ekaterina Artemova, Irina Piontkovskaya ·

The existing dialogue summarization corpora are significantly extractive. We introduce a methodology for dataset extractiveness evaluation and present a new low-extractive corpus of movie dialogues for abstractive text summarization along with baseline evaluation. The corpus contains 153k dialogues and consists of three parts: 1) automatically aligned subtitles, 2) automatically aligned scenes from scripts, and 3) manually aligned scenes from scripts. We also present an alignment algorithm which we use to construct the corpus.

PDF Abstract