CoNFET: An English Sentence to Emojis Translation Algorithm

6 Jan 2021 · Alex Day, Chris Mankos, Soo Kim, Jody Strausser ·

Emojis are a collection of emoticons that have been standardized by the Unicode Consortium. Currently, there are over 3,000 emojis in the Unicode standard. These small pictographs can represent an object as vague as a laughter (🤣) to something as specific as a passport control (🛂). Due to their high information density and the sheer amount, emojis have become prevalent in common communication media such as SMS and Twitter. There is a need to increase natural language understanding in the emoji domain. To this end, we present the CoNFET (Composition of N-grams for Emoji Translation) algorithm to translate an English sentence into a sequence of emojis. This translation algorithm consists of three main parts: the n-gram sequence generation, the n-gram to emoji translation, and the translation scoring. First, the input sentence is split into its constituent n-grams either in an exhaustive manner or using dependency relations. Second, the n-grams of the sentence are translated into emojis using the nearest neighbor in a vectorized linguistic space. Finally, these translations are scored using either a simple average or an average weighted by the Term Frequency-Inverse Document Frequency (TF-IDF) score of the n-gram. As the result, the sequence of emojis with the highest score is selected as an output of the sentence summarization.

PDF