DIME: AN INFORMATION-THEORETIC DIFFICULTY MEASURE FOR AI DATASETS

Evaluating the relative difficulty of widely-used benchmark datasets across time and across data modalities is important for accurately measuring progress in machine learning. To help tackle this problem, we proposeDIME, an information-theoretic DIfficulty MEasure for datasets, based on conditional entropy estimation of the sample-label distribution. Theoretically, we prove a model-agnostic and modality-agnostic lower bound on the 0-1 error by extending Fano’s inequality to the common supervised learning scenario where labels are discrete and features are continuous. Empirically, we estimate this lower bound using a neural network to compute DIME. DIME can be decomposed into components attributable to the data distribution and the number of samples. DIME can also compute per-class difficulty scores. Through extensive experiments on both vision and language datasets, we show that DIME is well-aligned with empirically observed performance of state-of-the-art machine learning models. We hope that DIME can aid future dataset design and model-training strategies.

PDF Abstract NeurIPS Workshop 2020 PDF NeurIPS Workshop 2020 Abstract
No code implementations yet. Submit your code now

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here