Data Valuation
29 papers with code • 0 benchmarks • 0 datasets
Data valuation in machine learning tries to determine the worth of data, or data sets, for downstream tasks. Some methods are task-agnostic and consider datasets as a whole, mostly for decision making in data markets. These look at distributional distances between samples. More often, methods look at how individual points affect performance of specific machine learning models. They assign a scalar to each element of a training set which reflects its contribution to the final performance of some model trained on it. Some concepts of value depend on a specific model of interest, others are model-agnostic.
Concepts of the usefulness of a datum or its influence on the outcome of a prediction have a long history in statistics and ML, in particular through the notion of the influence function. However, it has only been recently that rigorous and practical notions of value for data, and in particular data-sets, have appeared in the ML literature, often based on concepts from collaborative game theory, but also from generalization estimates of neural networks, or optimal transport theory, among others.
Benchmarks
These leaderboards are used to track progress in Data Valuation
Libraries
Use these libraries to find Data Valuation models and implementationsMost implemented papers
2D-Shapley: A Framework for Fragmented Data Valuation
Data valuation -- quantifying the contribution of individual data sources to certain predictive behaviors of a model -- is of great importance to enhancing the transparency of machine learning and designing incentive systems for data sharing.
Exploring Data Redundancy in Real-world Image Classification through Data Selection
Deep learning models often require large amounts of data for training, leading to increased costs.
Data Valuation and Detections in Federated Learning
In scenarios involving numerous data clients within FL, it is often the case that only a subset of clients and datasets are pertinent to a specific learning task, while others might have either a negative or negligible impact on the model training process.
DeRDaVa: Deletion-Robust Data Valuation for Machine Learning
Data valuation is concerned with determining a fair valuation of data from data sources to compensate them or to identify training examples that are the most or least useful for predictions.
Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution
Many tasks in explainable machine learning, such as data valuation and feature attribution, perform expensive computation for each data point and can be intractable for large datasets.
Interpretable Machine Learning for TabPFN
The recently developed Prior-Data Fitted Networks (PFNs) have shown very promising results for applications in low-data regimes.
Neural Dynamic Data Valuation
Data constitute the foundational component of the data economy and its marketplaces.
Data Valuation with Gradient Similarity
High-quality data is crucial for accurate machine learning and actionable analytics, however, mislabeled or noisy data is a common problem in many domains.
What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions
Large language models (LLMs) are trained on a vast amount of human-written data, but data providers often remain uncredited.