Data Valuation

29 papers with code • 0 benchmarks • 0 datasets

Data valuation in machine learning tries to determine the worth of data, or data sets, for downstream tasks. Some methods are task-agnostic and consider datasets as a whole, mostly for decision making in data markets. These look at distributional distances between samples. More often, methods look at how individual points affect performance of specific machine learning models. They assign a scalar to each element of a training set which reflects its contribution to the final performance of some model trained on it. Some concepts of value depend on a specific model of interest, others are model-agnostic.

Concepts of the usefulness of a datum or its influence on the outcome of a prediction have a long history in statistics and ML, in particular through the notion of the influence function. However, it has only been recently that rigorous and practical notions of value for data, and in particular data-sets, have appeared in the ML literature, often based on concepts from collaborative game theory, but also from generalization estimates of neural networks, or optimal transport theory, among others.

Libraries

Use these libraries to find Data Valuation models and implementations

Most implemented papers

2D-Shapley: A Framework for Fragmented Data Valuation

ruoxi-jia-group/2dshapley 18 Jun 2023

Data valuation -- quantifying the contribution of individual data sources to certain predictive behaviors of a model -- is of great importance to enhancing the transparency of machine learning and designing incentive systems for data sharing.

Exploring Data Redundancy in Real-world Image Classification through Data Selection

zhenyutang2023/data_selection 25 Jun 2023

Deep learning models often require large amounts of data for training, leading to increased costs.

Data Valuation and Detections in Federated Learning

muz1lee/motdata 9 Nov 2023

In scenarios involving numerous data clients within FL, it is often the case that only a subset of clients and datasets are pertinent to a specific learning task, while others might have either a negative or negligible impact on the model training process.

DeRDaVa: Deletion-Robust Data Valuation for Machine Learning

snoidetx/derdava 18 Dec 2023

Data valuation is concerned with determining a fair valuation of data from data sources to compensate them or to identify training examples that are the most or least useful for predictions.

Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution

iancovert/amortized-valuation 29 Jan 2024

Many tasks in explainable machine learning, such as data valuation and feature attribution, perform expensive computation for each data point and can be intractable for large datasets.

Interpretable Machine Learning for TabPFN

david-rundel/tabpfn_iml 16 Mar 2024

The recently developed Prior-Data Fitted Networks (PFNs) have shown very promising results for applications in low-data regimes.

Neural Dynamic Data Valuation

liangzhangyong/nddv 30 Apr 2024

Data constitute the foundational component of the data economy and its marketplaces.

Data Valuation with Gradient Similarity

nathanieljevans/DVGS 13 May 2024

High-quality data is crucial for accurate machine learning and actionable analytics, however, mislabeled or noisy data is a common problem in many domains.

What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions

logix-project/logix 22 May 2024

Large language models (LLMs) are trained on a vast amount of human-written data, but data providers often remain uncredited.