Data Valuation

27 papers with code • 0 benchmarks • 0 datasets

Data valuation in machine learning tries to determine the worth of data, or data sets, for downstream tasks. Some methods are task-agnostic and consider datasets as a whole, mostly for decision making in data markets. These look at distributional distances between samples. More often, methods look at how individual points affect performance of specific machine learning models. They assign a scalar to each element of a training set which reflects its contribution to the final performance of some model trained on it. Some concepts of value depend on a specific model of interest, others are model-agnostic.

Concepts of the usefulness of a datum or its influence on the outcome of a prediction have a long history in statistics and ML, in particular through the notion of the influence function. However, it has only been recently that rigorous and practical notions of value for data, and in particular data-sets, have appeared in the ML literature, often based on concepts from collaborative game theory, but also from generalization estimates of neural networks, or optimal transport theory, among others.

Libraries

Use these libraries to find Data Valuation models and implementations

Most implemented papers

ModelPred: A Framework for Predicting Trained Model from Training Data

yyzeng43/ModelPred 24 Nov 2021

In this work, we propose ModelPred, a framework that helps to understand the impact of changes in training data on a trained model.

Incentivizing Collaboration in Machine Learning via Synthetic Data Rewards

XinyiYS/CML-RewardDistribution 17 Dec 2021

This paper presents a novel collaborative generative modeling (CGM) framework that incentivizes collaboration among self-interested parties to contribute data to a pool for training a generative model (e. g., GAN), from which synthetic data are drawn and distributed to the parties as rewards commensurate to their contributions.

Probably Approximate Shapley Fairness with Applications in Machine Learning

BobbyZhouZijian/ProbablyApproximateShapleyFairness 1 Dec 2022

We observe that the fairness guarantees of exact SVs are too restrictive for SV estimates.

Data Valuation Without Training of a Model

jjchy/cg_score 3 Jan 2023

Many recent works on understanding deep learning try to quantify how much individual data instances influence the optimization and generalization of a model.

FairShap: A Data Re-weighting Approach for Algorithmic Fairness based on Shapley Values

AdrianArnaiz/fair-shap 3 Mar 2023

Algorithmic fairness is of utmost societal importance, yet the current trend in large-scale machine learning models requires training with massive datasets that are frequently biased.

A Note on "Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms"

jiachen-t-wang/softlabel-knnsv 9 Apr 2023

In this note, we revisit the work of Jia et al. (2019) and propose a more natural and interpretable utility function that better reflects the performance of KNN models.

LAVA: Data Valuation without Pre-Specified Learning Algorithms

ruoxi-jia-group/lava 28 Apr 2023

(1) We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between training and validation sets.

Scalable Data Point Valuation in Decentralized Learning

kpandl/Scalable-Data-Point-Valuation-in-Decentralized-Learning 1 May 2023

The valuation of data points through DDVal allows to also draw hierarchical conclusions on the contribution of institutions, and we empirically show that the accuracy of DDVal in estimating institutional contributions is higher than existing Shapley value approximation methods for federated learning.

Data valuation: The partial ordinal Shapley value for machine learning

peizhengwang/partialordinalshapley 2 May 2023

Data valuation using Shapley value has emerged as a prevalent research domain in machine learning applications.

Data Selection for Fine-tuning Large Language Models Using Transferred Shapley Values

stephanieschoch/ts-dshapley 16 Jun 2023

Although Shapley values have been shown to be highly effective for identifying harmful training instances, dataset size and model complexity constraints limit the ability to apply Shapley-based data valuation to fine-tuning large pre-trained language models.