Data Integration
73 papers with code • 0 benchmarks • 7 datasets
Data integration (also called information integration) is the process of consolidating data from a set of heterogeneous data sources into a single uniform data set (materialized integration) or view on the data (virtual integration). Data integration pipelines involve subtasks such as schema matching, table annotation, entity resolution, value normalization, data cleansing, and data fusion. Application domains of data integration include data warehousing, data lakes, and knowledge base consolidation. Surveys on Data integration:
Benchmarks
These leaderboards are used to track progress in Data Integration
Libraries
Use these libraries to find Data Integration models and implementationsMost implemented papers
Leveraging Legacy Data to Accelerate Materials Design via Preference Learning
Machine learning applications in materials science are often hampered by shortage of experimental data.
Elastic Coupled Co-clustering for Single-Cell Genomic Data
The recent advances in single-cell technologies have enabled us to profile genomic features at unprecedented resolution and datasets from multiple domains are available, including datasets that profile different types of genomic features and datasets that profile the same type of genomic features across different species.
A Systematic Approach to Featurization for Cancer Drug Sensitivity Predictions with Deep Learning
By combining various cancer cell line (CCL) drug screening panels, the size of the data has grown significantly to begin understanding how advances in deep learning can advance drug response predictions.
The scalable Birth-Death MCMC Algorithm for Mixed Graphical Model Learning with Application to Genomic Data Integration
Recent advances in biological research have seen the emergence of high-throughput technologies with numerous applications that allow the study of biological mechanisms at an unprecedented depth and scale.
Consistent and Flexible Selectivity Estimation for High-Dimensional Data
Selectivity estimation aims at estimating the number of database objects that satisfy a selection criterion.
An Empirical Meta-analysis of the Life Sciences (Linked?) Open Data on the Web
While the biomedical community has published several "open data" sources in the last decade, most researchers still endure severe logistical and technical challenges to discover, query, and integrate heterogeneous data and knowledge from multiple sources.
Kernel learning approaches for summarising and combining posterior similarity matrices
Here we build upon the notion of the posterior similarity matrix (PSM) in order to suggest new approaches for summarising the output of MCMC algorithms for Bayesian clustering models.
SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization
Furthermore, most previous works focus on binary DDI prediction whereas the multi-typed DDI pharmacological effect prediction is a more meaningful but harder task.
BayReL: Bayesian Relational Learning for Multi-omics Data Integration
High-throughput molecular profiling technologies have produced high-dimensional multi-omics data, enabling systematic understanding of living systems at the genome scale.
Profiling Entity Matching Benchmark Tasks
In order to enable the exact reproducibility of evaluation results, matching tasks need to contain exactly defined sets of matching and non-matching record pairs, as well as a fixed development and test split.