TD-ConE: An Information-Theoretic Approach to Assessing Parallel Text Generation Data

ACL ARR January 2022 · Anonymous ·

Existing data assessment methods are mainly for classification-based datasets and limited for use in natural language generation (NLG) datasets. In this work, we focus on parallel NLG datasets and address this problem through an information-theoretic approach, TD-ConE, to assess data uncertainty using input-output sequence mappings. Our experiments on text style transfer datasets demonstrate that the proposed simple method leads to better measurement of data uncertainty compared to some complicated alternatives and demonstrates a high correlation with downstream model performance. As an extension of TD-ConE, we introduce TD-ConE_Rel to compute the relative uncertainty between two datasets. Our experiments with paraphrase generation datasets demonstrate that selecting data with lower TD-ConE_Rel scores leads to better model performance and decreased validation perplexity.

PDF Abstract