🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task

Filter by Language (clear)

169 dataset results for German

DEplain-APA-sent: A German Parallel Corpus for Sentence Simplification on News Texts DEplain is a new dataset of parallel, professionally written and manually aligned simplifications in plain German “plain DE” (or in German: “Einfache Sprache”). DEplain consists of four main subcorpora: DEplain-APA-doc, DEplain-APA-sent, DEplain-web-doc, and DEplain-web-sent.

2 PAPERS • 1 BENCHMARK

DEplain-web-sent

DEplain-web-sent: A German Parallel Corpus for Sentence Simplification on Web Texts DEplain is a new dataset of parallel, professionally written and manually aligned simplifications in plain German “plain DE” (or in German: “Einfache Sprache”). DEplain consists of four main subcorpora: DEplain-APA-doc, DEplain-APA-sent, DEplain-web-doc, and DEplain-web-sent.

2 PAPERS • 1 BENCHMARK

DeCOCO

DeCOCO is a bilingual (English-German) corpus of image descriptions, where the English part is extracted from the COCO dataset, and the German part are translations by a native German speaker.

2 PAPERS • NO BENCHMARKS YET

GermEval

The GermEval dataset is a valuable resource for natural language processing (NLP) tasks, specifically named entity recognition (NER), conducted in the German language. Here are some key details about this dataset:

2 PAPERS • 1 BENCHMARK

GermanDPR

GermanDPR is a dataset for passage retrieval in German. GermanDPR comprises 8,245 question/answer pairs in the training set, 1,030 pairs in the development set, and 1,025 pairs in the test set. For each pair, there are one positive context and three hard negative contexts.

2 PAPERS • NO BENCHMARKS YET

MobIE

MobIE is a German-language dataset which is human-annotated with 20 coarse- and fine-grained entity types and entity linking information for geographically linkable entities. The dataset consists of 3,232 social media texts and traffic reports with 91K tokens, and contains 20.5K annotated entities, 13.1K of which are linked to a knowledge base. A subset of the dataset is human-annotated with seven mobility-related, n-ary relation types, while the remaining documents are annotated using a weakly-supervised labeling approach implemented with the Snorkel framework.

2 PAPERS • NO BENCHMARKS YET

Morph Call

Morph Call is a suite of 46 probing tasks for four Indo-European languages that fall under different morphology: Russian, French, English, and German. The tasks are designed to explore the morphosyntactic content of multilingual transformers which is a less studied aspect at the moment.

2 PAPERS • NO BENCHMARKS YET

MuCo-VQA

MuCo-VQA consist of large-scale (3.7M) multilingual and code-mixed VQA datasets in multiple languages: Hindi (hi), Bengali (bn), Spanish (es), German (de), French (fr) and code-mixed language pairs: en-hi, en-bn, en-fr, en-de and en-es.

2 PAPERS • NO BENCHMARKS YET

MultiSpider

MultiSpider is a large multilingual text-to-SQL dataset which covers seven languages (English, German, French, Spanish, Japanese, Chinese, and Vietnamese).

2 PAPERS • NO BENCHMARKS YET

PCC

PCC (Potsdam Commentary Corpus)

The Potsdam Commentary Corpus (PCC) is a corpus of 220 German newspaper commentaries (2.900 sentences, 44.000 tokens) taken from the online issues of the Märkische Allgemeine Zeitung (MAZ subcorpus) and Tagesspiegel (ProCon subcorpus) and is annotated with a range of different types of linguistic information.

2 PAPERS • NO BENCHMARKS YET

SAF

SAF (Short Answer Feedback Dataset)

This dataset can be found on HuggingFace:

2 PAPERS • NO BENCHMARKS YET

SubEdits

SubEdits is a human-annnoated post-editing dataset of neural machine translation outputs, compiled from in-house NMT outputs and human post-edits of subtitles form Rakuten Viki. It is collected from English-German annotations and contains 160k triplets.

2 PAPERS • NO BENCHMARKS YET

Tilde MODEL Corpus

Tilde MODEL Corpus (Tilde Multilingual Open Data for European Languages)

Tilde MODEL Corpus is a multilingual corpora for European languages – particularly focused on the smaller languages. The collected resources have been cleaned, aligned, and formatted into a corpora standard TMX format useable for developing new Language technology products and services.

2 PAPERS • NO BENCHMARKS YET

WikiCaps

WikiCaps is a large-scale multilingual but non-parallel data set for multimodal machine translation and retrieval. The image-caption data was extracted from Wikimedia Commons and is thus a representative of the collection of largely available non-descriptive image-caption pairs in the web. The current version of the dataset contains 3,816,940 images with 3,825,132 English captions and additional 1,000 image-caption pairs in German, French, and Russian together with their English counterparts.

2 PAPERS • NO BENCHMARKS YET

X-WikiRE

X-WikiRE is a new, large-scale multilingual relation extraction dataset in which relation extraction is framed as a problem of reading comprehension to allow for generalization to unseen relations.

2 PAPERS • NO BENCHMARKS YET

APE (Automatic Post-Editing)

APE is useful to evaluate Machine Translation automatic post-editing (APE), which is the task of improving the output of a blackbox MT system by automatically fixing its mistakes. The act of post-editing text can be fully specified as a sequence of delete and insert actions in given positions.

1 PAPER • NO BENCHMARKS YET

AQL-22

AQL-22 (Archive Query Log)

The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. The AQL is the first publicly available query log that combines size, scope, and diversity, enabling research on new retrieval models and search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.

1 PAPER • NO BENCHMARKS YET

Analysing state-backed propaganda websites: a new dataset and linguistic study

This paper analyses two hitherto unstudied sites sharing state-backed disinformation, Reliable Recent News (rrn.world) and WarOnFakes (waronfakes.com), which publish content in Arabic, Chinese, English, French, German, and Spanish.

1 PAPER • NO BENCHMARKS YET

Arendt

Digital Edition: Essays from Hannah Arendt We have created a NER dataset from the digital edition "Sechs Essays" by Hannah Arendt. It consists of 23 documents from the period 1932-1976, which are available as TEI files online (see https://hannah-arendt-edition.net/3p.html?lang=de).

1 PAPER • NO BENCHMARKS YET

BioVid (BioVid Heat Pain Database)

To advance methods for pain assessment, in particular automatic assessment methods, the BioVid Heat Pain Database was collected in a collaboration of the Neuro-Information Technology group of the University of Magdeburg and the Medical Psychology group of the University of Ulm. In our study, 90 participants were subjected to experimentally induced heat pain in four intensities. To compensate for varying heat pain sensitivities, the stimulation temperatures were adjusted based on the subject-specific pain threshold and pain tolerance. Each of the four pain levels was stimulated 20 times in randomized order. For each stimulus, the maximum temperature was held for 4 seconds. The pauses between the stimuli were randomized between 8-12 seconds. The pain stimulation experiment was conducted twice: once with un-occluded face and once with facial EMG sensors.

1 PAPER • NO BENCHMARKS YET

CGHD1152 (Circuit Graph Hand Drawn 1152)

1152 Images 144 Circuits 12 Drafter 48,563 Object (Symbol, Structural, Text) Annotations

1 PAPER • NO BENCHMARKS YET

CareerCoach 2022

The CareerCoach 2022 gold standard is available for download in the NIF and JSON format, and draws upon documents from a corpus of over 99,000 education courses which have been retrieved from 488 different education providers.

1 PAPER • NO BENCHMARKS YET

DEplain-APA-doc

DEplain-APA-doc: A German Parallel Corpus for Document Simplification on News Texts DEplain is a new dataset of parallel, professionally written and manually aligned simplifications in plain German “plain DE” (or in German: “Einfache Sprache”). DEplain consists of four main subcorpora: DEplain-APA-doc, DEplain-APA-sent, DEplain-web-doc, and DEplain-web-sent.

1 PAPER • 1 BENCHMARK

DEplain-web-doc

DEplain-web-doc: A German Parallel Corpus for Document Simplification on Web Texts DEplain is a new dataset of parallel, professionally written and manually aligned simplifications in plain German “plain DE” (or in German: “Einfache Sprache”). DEplain consists of four main subcorpora: DEplain-APA-doc, DEplain-APA-sent, DEplain-web-doc, and DEplain-web-sent.

1 PAPER • 1 BENCHMARK

Dataset of Legal Documents

Dataset of Legal Documents consists of court decisions from 2017 and 2018 were selected for the dataset, published online by the Federal Ministry of Justice and Consumer Protection. The documents originate from seven federal courts: Federal Labour Court (BAG), Federal Fiscal Court (BFH), Federal Court of Justice (BGH), Federal Patent Court (BPatG), Federal Social Court (BSG), Federal Constitutional Court (BVerfG) and Federal Administrative Court (BVerwG).

1 PAPER • NO BENCHMARKS YET

Dubbing Test Set

Dubbing Test Set consists of two subsets extracted from the En→De test set of COVOST-2, a large-scale multilingual speech translation corpus based on Common Voice. Specifically, the first subset is created by randomly sampling 91 sentences (test91), while the second is randomly sampled 101 sentences from the longest 10% of the De part of the test set (test101).

1 PAPER • NO BENCHMARKS YET

Energy Consumption Curves of 499 Customers from Spain

Predictions of energy consumption are crucial for energy retailers to minimize deviations from energy acquired in the day-ahead market and the actual consumption of their customers. The increasing spread of smartmeters means that retailers have access to hourly consumption values of all their contracted customers in realtime. Using machine learning algorithms, these hourly values can be used to calculate predictions for the future energy consumption of the customers. The present data set allows the training and validation of AI-based prediction models.

1 PAPER • NO BENCHMARKS YET

Event-QA

Contains 1000 semantic queries and the corresponding English, German and Portuguese verbalizations for EventKG - an event-centric knowledge graph with more than 970 thousand events.

1 PAPER • NO BENCHMARKS YET

Event-focused Emotion Corpora for German and English

A corpus designed in analogy to the well-established English ISEAR emotion dataset.

1 PAPER • NO BENCHMARKS YET

FEIDEGGER

FEIDEGGER (FEIDEGGER: A Multi-modal Corpus of Fashion Images and Descriptions in German)

The FEIDEGGER (fashion images and descriptions in German) dataset is a new multi-modal corpus that focuses specifically on the domain of fashion items and their visual descriptions in German. The dataset was created as part of ongoing research at Zalando into text-image multi-modality in the area of fashion.

1 PAPER • NO BENCHMARKS YET

Fallout New Vegas Dialog

Fallout New Vegas Dialog is a multilingual sentiment annotated dialog dataset from Fallout New Vegas. The game developers have preannotated every line of dialog in the game in one of the 8 different sentiments: anger, disgust, fear, happy, neutral, pained, sad and surprised and they have been translated into 5 different languages: English, Spanish, German, French and Italian.

1 PAPER • NO BENCHMARKS YET

Food Recall Incidents Dataset

The Food Recall Incidents dataset consists of 7,546 short texts (from 5 to 360 characters each), which are the titles of food recall announcements (therefore referred to as title), crawled from 24 public food safety authority websites by Agroknow. The texts are written in 6 languages, with English (6,644) and German (888) being the most common, followed by French (8), Greek (4), Italian (1) and Danish (1). Most of the texts have been authored after 2010 and they describe recalls of specific food products due to specific hazards. Experts manually classified each text to four groups of classes describing hazards and products on two levels of granularity:

1 PAPER • NO BENCHMARKS YET

GePaDe

This dataset encompasses 265 speeches (over 200,000 tokens) from the German Bundestag, primarily from the 19th legislative term (2017-2021), given by 195 distinct speakers representing 6 political parties.

1 PAPER • 2 BENCHMARKS

Harry Potter Dialogue Dataset

Harry Potter Dialogue is the first dialogue dataset that integrates with scene, attributes and relations which are dynamically changed as the storyline goes on. Our work can facilitate research to construct more human-like conversational systems in practice. For example, virtual assistant, NPC in games, etc. Moreover, HPD can both support dialogue generation and retrieval tasks.

1 PAPER • 2 BENCHMARKS

HumanEval-XL

We introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at https://github.com/FloatAI/HumanEval-XL.

1 PAPER • NO BENCHMARKS YET

HumanMT

HumanMT is a collection of human ratings and corrections of machine translations. It consists of two parts: The first part contains five-point and pairwise sentence-level ratings, the second part contains error markings and corrections. Details are described in the following.

1 PAPER • NO BENCHMARKS YET

IAPR TC-12 (IAPR TC-12 Benchmark)

The image collection of the IAPR TC-12 Benchmark consists of 20,000 still natural images taken from locations around the world and comprising an assorted cross-section of still natural images. This includes pictures of different sports and actions, photographs of people, animals, cities, landscapes, and many other aspects of contemporary life. Each image is associated with a text caption in up to three different languages (English, German and Spanish).

1 PAPER • NO BENCHMARKS YET

Jam-ALT

Jam-ALT (JamALT: A Formatting-Aware Lyrics Transcription Benchmark)

JamALT is a revision of the JamendoLyrics dataset (80 songs in 4 languages), adapted for use as an automatic lyrics transcription (ALT) benchmark.

1 PAPER • 5 BENCHMARKS

LibriS2S

LibriS2S is a Speech to Speech Translation (S2ST) dataset build further upon existing resources. The dataset provides English-German speech and text quadruplets ranging just over 50 hours for both languages.

1 PAPER • NO BENCHMARKS YET

M-AILabs speech dataset

The M-AILABS Speech Dataset is the first large dataset that we are providing free-of-charge, freely usable as training data for speech recognition and speech synthesis. Most of the data is based on LibriVox and Project Gutenberg. The training data consist of nearly thousand hours of audio and the text-files in prepared format. A transcription is provided for each clip. Clips vary in length from 1 to 20 seconds and have a total length of approximately shown in the list (and in the respective info.txt-files) below. The texts were published between 1884 and 1964, and are in the public domain. The audio was recorded by the LibriVox project and is also in the public domain

1 PAPER • 1 BENCHMARK

M-Phasis

M-Phasis (A Feature-Based Corpus of Hate Online)

A corpus of 9k German and French user comments collected from migration-related news articles. It goes beyond the hate-neutral dichotomy and is instead annotated with 23 features, which in combination become descriptors of various types of speech, ranging from critical comments to implicit and explicit expressions of hate. The annotations are performed by 4 native speakers per language and achieve high (0.77) inter-annotator agreements.

1 PAPER • NO BENCHMARKS YET

Mega-COV

Mega-COV is a billion-scale dataset from Twitter for studying COVID-19. The dataset is diverse (covers 234 countries), longitudinal (goes as back as 2007), multilingual (comes in 65 languages), and has a significant number of location-tagged tweets (~32M tweets).

1 PAPER • NO BENCHMARKS YET

MetaCLIR

This data adds textual meta-infomation data to two existing corpora for cross language information retrieval: BoostCLIR, and the Large Scale CLIR Dataset (wiki-clir).

1 PAPER • NO BENCHMARKS YET

MultiTACRED

MultiTACRED is a multilingual version of the large-scale TAC Relation Extraction Dataset. It covers 12 typologically diverse languages from 9 language families, and was created by the Speech & Language Technology group of DFKI by machine-translating the instances of the original TACRED dataset and automatically projecting their entity annotations. For details of the original TACRED's data collection and annotation process, see the Stanford paper. Translations are syntactically validated by checking the correctness of the XML tag markup. Any translations with an invalid tag structure, e.g. missing or invalid head or tail tag pairs, are discarded (on average, 2.3% of the instances).

1 PAPER • NO BENCHMARKS YET

Multilingual Persuasion Detection

This dataset contains dialogue lines from the games Knights of the Old Republic 1 & 2 and Neverwinter Nights 1. Some of the dialogue lines are marked as persuasive (which is when the player character is attempting a Persuade skill check.)

1 PAPER • NO BENCHMARKS YET

Multilingual Terms of Service

The first annotated corpus for multilingual analysis of potentially unfair clauses in online Terms of Service. The data set comprises a total of 100 contracts, obtained from 25 documents annotated in four different languages: English, German, Italian, and Polish. For each contract, potentially unfair clauses for the consumer are annotated, for nine different unfairness categories.

1 PAPER • NO BENCHMARKS YET

NISQA Speech Quality Corpus

The NISQA Corpus includes more than 14,000 speech samples with simulated (e.g. codecs, packet-loss, background noise) and live (e.g. mobile phone, Zoom, Skype, WhatsApp) conditions. Each file is labelled with subjective ratings of the overall quality and the quality dimensions Noisiness, Coloration, Discontinuity, and Loudness. In total, it contains more than 97,000 human ratings for each of the dimensions and the overall MOS.

1 PAPER • NO BENCHMARKS YET

Names pairs dataset

Includes co-referent name string pairs along with their similarities.

1 PAPER • NO BENCHMARKS YET

Datasets

169 dataset results for German