PubChem18 (PubChem 2018)

A.2.1 AN OPEN, LARGE-SCALE DATASET FOR ZERO-SHOT DRUG DISCOVERY DERIVED FROM PUBCHEM We constructed a large public dataset extracted from PubChem (Kim et al., 2019; Preuer et al., 2018), an open chemistry database, and the largest collection of readily available chemical data. We take assays ranging from 2004 to 2018-05. It initially comprises 224,290,250 records of molecule-bioassay activity, corresponding to 2,120,854 unique molecules and 21,003 unique bioassays. We find that some molecule-bioassay pairs have multiple activity records, which may not all agree. We reduce every molecule-bioassay pair to exactly one activity measurement by applying majority voting. Molecule-bioassay pairs with ties are discarded. This step yields our final bioactivity dataset, which features 223,219,241 records of molecule-bioassay activity, corresponding to 2,120,811 unique molecules and 21,002 unique bioassays ranging from AID 1 to AID 1259411. Molecules range up to CID 132472079. The dataset has 3 different splitting schemes.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


  • Unknown

Modalities


Languages