Texts

GYAFC (Grammarly’s Yahoo Answers Formality Corpus)

Introduced by Rao et al. in Dear Sir or Madam, May I introduce the GYAFC Dataset: Corpus, Benchmarks and Metrics for Formality Style Transfer

Grammarly’s Yahoo Answers Formality Corpus (GYAFC) is the largest dataset for any style containing a total of 110K informal / formal sentence pairs.

Yahoo Answers is a question answering forum, contains a large number of informal sentences and allows redistribution of data. The authors used the Yahoo Answers L6 corpus to create the GYAFC dataset of informal and formal sentence pairs. In order to ensure a uniform distribution of data, they removed sentences that are questions, contain URLs, and are shorter than 5 words or longer than 25. After these preprocessing steps, 40 million sentences remain.

The Yahoo Answers corpus consists of several different domains like Business, Entertainment & Music, Travel, Food, etc. Pavlick and Tetreault formality classifier (PT16) shows that the formality level varies significantly across different genres. In order to control for this variation, the authors work with two specific domains that contain the most informal sentences and show results on training and testing within those categories. The authors use the formality classifier from PT16 to identify informal sentences and train this classifier on the Answers genre of the PT16 corpus which consists of nearly 5,000 randomly selected sentences from Yahoo Answers manually annotated on a scale of -3 (very informal) to 3 (very formal). They find that the domains of Entertainment & Music and Family & Relationships contain the most informal sentences and create the GYAFC dataset using these domains.

Source: Dear Sir or Madam, May I Introduce the GYAFC Dataset: Corpus, Benchmarks and Metrics for Formality Style Transfer

Homepage