OrangeSum is a single-document extreme summarization dataset with two tasks: title and abstract. Ground truth summaries are respectively 11.42 and 32.12 words in length on average, for the title and abstract tasks respectively, while document sizes are 315 and 350 words.

The motivation for OrangeSum was to put together a French equivalent of the XSum dataset.

Unlike the historical CNN, DailyMail, and NY Times datasets, OrangeSum requires the models to display a high degree of abstractivity to perform well. OrangeSum was created by scraping articles and their titles and abstracts from the Orange Actu website.

Scraped pages cover almost a decade from Feb 2011 to Sep 2020, and belong to five main categories: France, world, politics, automotive, and society. The society category is itself divided into 8 subcategories: health, environment, people, culture, media, high-tech, unsual ("insolite" in French), and miscellaneous.

The dataset is publicly available at: https://github.com/Tixierae/OrangeSum.

Source: BARThez: a Skilled Pretrained French Sequence-to-Sequence Model

Papers


Paper Code Results Date Stars

Tasks


Similar Datasets


License


  • Unknown

Modalities


Languages