LEPISZCZE is an open-source comprehensive benchmark for Polish NLP and a continuous-submission leaderboard, concentrating public Polish datasets (existing and new) in specific tasks.
2 PAPERS • NO BENCHMARKS YET
we have prepared a dataset using publicly available TED Talks transcripts [27] and selected the Turkish corpus. The resulting Turkish punctuation restoration dataset currently consists of 146K sentences and 1.8M tokens. The ratio of the train, validation, and test splits are 0.8, 0.1, and 0.1, respectively. Data files contain two columns. The first column has the tokens separated by white space. The second column includes tags for each token.
1 PAPER • NO BENCHMARKS YET