Rethinking Data Augmentation in Text-to-text Paradigm

COLING 2022 · Yanan Chen, Yang Liu ·

As manually labelling data can be costly, some recent studies tend to augment the training data for improving the generalization power of machine learning models, known as data augmentation (DA). With the arise of pre-trained language models (PLMs), some recent works on DA try to synthesize new samples benefiting from the knowledge learned from PLM’s pre-training. Along the same direction, we in this paper propose to integrate text-to-text language models and construct a new two-phase framework for augmentation: 1) a fine-tuning phase where PLMs are well adapted to downstream classification with the help of two novel schemes, and 2) a generation phase where the fine-tuned models are leveraged to create new samples for performance lifting. This paradigm opens up a new way of designing fine-tuning scheme to better serve DA in an easy-to-implement manner, and can be easily extended to other desired tasks. We evaluate our proposal on two public classification datasets and demonstrate its effectiveness with remarkable gains.

PDF Abstract