An open corpus of Scientific Research papers which has a representative sample from across scientific disciplines. This corpus not only includes the full text of the article, but also the metadata of the documents, along with the bibliographic information for each reference.
2 PAPERS • NO BENCHMARKS YET
We present YTSeg, a topically and structurally diverse benchmark for the text segmentation task based on YouTube transcriptions. The dataset comprises 19,299 videos from 393 channels, amounting to 6,533 content hours. The topics are wide-ranging, covering domains such as science, lifestyle, politics, health, economy, and technology. The videos are from various types of content formats, such as podcasts, lectures, news, corporate events \& promotional content, and, more broadly, videos from individual content creators. We refer to the paper for further information.
1 PAPER • 2 BENCHMARKS