Turkish Punctuation Restoration

we have prepared a dataset using publicly available TED Talks transcripts [27] and selected the Turkish corpus. The resulting Turkish punctuation restoration dataset currently consists of 146K sentences and 1.8M tokens. The ratio of the train, validation, and test splits are 0.8, 0.1, and 0.1, respectively. Data files contain two columns. The first column has the tokens separated by white space. The second column includes tags for each token.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


License


  • Unknown

Modalities


Languages