KP20k is a large-scale scholarly articles dataset with 528K articles for training, 20K articles for validation and 20K articles for testing.
79 PAPERS • 3 BENCHMARKS
We describe the SemEval task of extracting keyphrases and relations between them from scientific documents, which is crucial for understanding which publications describe which processes, tasks and materials. Although this was a new task, we had a total of 26 submissions across 3 evaluation scenarios. We expect the task and the findings reported in this paper to be relevant for researchers working on understanding scientific content, as well as the broader knowledge base population and information extraction communities.
33 PAPERS • 2 BENCHMARKS
KPTimes is a large-scale dataset of news texts paired with editor-curated keyphrases.
25 PAPERS • 3 BENCHMARKS
Paper: Improved automatic keyword extraction given more linguistic knowledge Doi: 10.3115/1119355.1119383
6 PAPERS • 2 BENCHMARKS
A dataset for benchmarking keyphrase extraction and generation techniques from long document English scientific papers. The dataset has high quality and consists of 2,000 scientific papers from the Computer Science domain published by ACM. Each paper has its keyphrases assigned by the authors and verified by the reviewers. Different parts of papers, such as title and abstract, are separated, enabling extraction based on the part of an article's text. The content of each paper is converted from PDF to plain text. The pieces of formulae, tables, figures and LaTeX mark up were removed automatically. Link: https://huggingface.co/datasets/midas/krapivin
1 PAPER • 1 BENCHMARK
The dataset was constructed by first finding suitable publications and then collecting keyphrases from manual annotators. Google SOAP API was used to find documents using variants of the query “keywords general terms filetype:pdf”. Over 250 of these PDF documents were downloaded for further processing. Documents were then manually restricted to scientific conference papers, with a length range of 4-12 pages. The PDF documents were then converted to plain text using the PDF995 software suite (as it handled two-columned text better than other programs tried). At the end of this process, 211 documents in plain text format were selected which were converted successfully without problems. The authors then recruited student volunteers from our department to participate in manual keyphrase assignments. Each volunteer was given three PDF files (with author-assigned keyphrases hidden) to assign keyphrases to.