This is a benchmark for neural paraphrase detection, to differentiate between original and machine-generated content.
1,474,230 aligned paragraphs (98,282 original, 1,375,948 paraphrased with 3 models and 5 hyperparameter configurations each 98,282) extracted from 4,012 (English) Wikipedia articles.
BERT-large (cased):
arXiv - Original - 20,966; Paraphrased - 20,966;
Theses - Original - 5,226; Paraphrased - 5,226;
Wikipedia - Original - 39,241; Paraphrased - 39,241;
RoBERTa-large (cased):
arXiv - Original - 20,966; Paraphrased - 20,966;
Theses - Original - 5,226; Paraphrased - 5,226;
Wikipedia - Original - 39,241; Paraphrased - 39,241;
Longformer-large (uncased):
arXiv - Original - 20,966; Paraphrased - 20,966;
Theses - Original - 5,226; Paraphrased - 5,226;
Wikipedia - Original - 39,241; Paraphrased - 39,241;
Paper | Code | Results | Date | Stars |
---|