no code implementations • 30 Mar 2024 • Marco Cognetta, Tatsuya Hiraoka, Naoaki Okazaki, Rico Sennrich, Yuval Pinter
We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a postprocessing step that replaces rare subwords with their component subwords.
no code implementations • 22 Feb 2024 • Marco Cognetta, Vilém Zouhar, Sangwhan Moon, Naoaki Okazaki
In Tokenization and the Noiseless Channel (Zouhar et al., 2023a), R\'enyi efficiency is suggested as an intrinsic mechanism for evaluating a tokenizer: for NLP tasks, the tokenizer which leads to the highest R\'enyi efficiency of the unigram distribution should be chosen.
no code implementations • IJCNLP 2019 • Jun-U Park, Sang-Ki Ko, Marco Cognetta, Yo-Sub Han
We continue the study of generating se-mantically correct regular expressions from natural language descriptions (NL).
no code implementations • WS 2019 • Marco Cognetta, Cyril Allauzen, Michael Riley
Indeed, a delicate balance between comprehensiveness, speed, and memory must be struck to conform to device requirements while providing a good user experience. In this paper, we describe a compression scheme for lexicons when represented as finite-state transducers.
no code implementations • ACL 2019 • Marco Cognetta, Yo-Sub Han, Soon Chan Kwon
Probabilistic finite automata (PFAs) are com- mon statistical language model in natural lan- guage and speech processing.
no code implementations • EMNLP 2018 • Marco Cognetta, Yo-Sub Han, Soon Chan Kwon
The problem of computing infix probabilities of strings when the pattern distribution is given by a probabilistic context-free grammar or by a probabilistic finite automaton is already solved, yet it was open to compute the infix probabilities in an incremental manner.