Beyond Accuracy: Behavioral Testing of NLP models with CheckList

ACL 2020 Marco Tulio RibeiroTongshuang WuCarlos GuestrinSameer Singh

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models... (read more)

PDF Abstract ACL 2020 PDF ACL 2020 Abstract

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods used in the Paper