DEplain-web-doc: A German Parallel Corpus for Document Simplification on Web Texts

DEplain is a new dataset of parallel, professionally written and manually aligned simplifications in plain German “plain DE” (or in German: “Einfache Sprache”). DEplain consists of four main subcorpora: DEplain-APA-doc, DEplain-APA-sent, DEplain-web-doc, and DEplain-web-sent.

DEplain-web-doc consists of approx. 150 aligned documents. The data is publicly available (see licenses). The corpus includes texts from the following domains: fictional texts (literature and fairy tales), bible texts, health-related texts, texts for language learners, texts for accessibility, and public administration texts. The corpus can be used for German text simplification, or in more detail document simplification. The corpus is also available on Huggingface: see https://huggingface.co/datasets/DEplain/DEplain-web-doc.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


Modalities


Languages