Search Results for author: Kyubyong Park

Found 12 papers, 10 papers with code

K-HATERS: A Hate Speech Detection Corpus in Korean with Target-Specific Ratings

1 code implementation • 24 Oct 2023 • Chaewon Park, Soohwan Kim, Kyubyong Park, Kunwoo Park

This resource is the largest offensive language corpus in Korean and is the first to offer target-specific ratings on a three-point Likert scale, enabling the detection of hate expressions in Korean across varying degrees of offensiveness.

Hate Speech Detection

Paper
Code

A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean Language Models

no code implementations • 4 Jun 2023 • Hyunwoong Ko, Kichang Yang, Minho Ryu, Taekyoon Choi, Seungmu Yang, jiwung Hyun, Sungho Park, Kyubyong Park

This paper presents our work in developing the Polyglot Korean models, which propose some steps towards addressing the non-English language performance gap in multilingual language models.

Paper
Add Code

An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks

1 code implementation • Asian Chapter of the Association for Computational Linguistics 2020 • Kyubyong Park, Joohong Lee, Seongbo Jang, Dawoon Jung

Typically, tokenization is the very first step in most text processing works.

Machine Translation Natural Language Understanding +2

113

Paper
Code

KoParadigm: A Korean Conjugation Paradigm Generator

1 code implementation • 28 Apr 2020 • Kyubyong Park

Korean is a morphologically rich language.

Paper
Code

An Empirical Study of Invariant Risk Minimization

1 code implementation • 10 Apr 2020 • Yo Joong Choe, Jiyeon Ham, Kyubyong Park

Invariant risk minimization (IRM) (Arjovsky et al., 2019) is a recently proposed framework designed for learning predictors that are invariant to spurious correlations across different training environments.

text-classification Text Classification

Paper
Code

KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding

3 code implementations • Findings of the Association for Computational Linguistics 2020 • Jiyeon Ham, Yo Joong Choe, Kyubyong Park, Ilji Choi, Hyungjoon Soh

Although several benchmark datasets for those tasks have been released in English and a few other languages, there are no publicly available NLI or STS datasets in the Korean language.

Natural Language Inference Natural Language Understanding +2

651

Paper
Code

g2pM: A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset

1 code implementation • 7 Apr 2020 • Kyubyong Park, Seanie Lee

Conversion of Chinese graphemes to phonemes (G2P) is an essential component in Mandarin Chinese Text-To-Speech (TTS) systems.

Ranked #2 on Polyphone disambiguation on CPP

Polyphone disambiguation

328

Paper
Code

Jejueo Datasets for Machine Translation and Speech Synthesis

1 code implementation • LREC 2020 • Kyubyong Park, Yo Joong Choe, Jiyeon Ham

Jejueo was classified as critically endangered by UNESCO in 2010.

Machine Translation Speech Synthesis +1

Paper
Code

word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs

2 code implementations • LREC 2020 • Yo Joong Choe, Kyubyong Park, Dongwoo Kim

We wrap our dataset and model in an easy-to-use Python library, which supports downloading and retrieving top-k word translations in any of the supported language pairs as well as computing top-k word translations for custom parallel corpora.

Sentence Translation

352

Paper
Code