we have prepared a dataset using publicly available TED Talks transcripts [27] and selected the Turkish corpus. The resulting Turkish punctuation restoration dataset currently consists of 146K sentences and 1.8M tokens. The ratio of the train, validation, and test splits are 0.8, 0.1, and 0.1, respectively. Data files contain two columns. The first column has the tokens separated by white space. The second column includes tags for each token.
1 PAPER • NO BENCHMARKS YET