All SMILES Variational Autoencoder
Variational autoencoders (VAEs) defined over SMILES string and graph-based representations of molecules promise to improve the optimization of molecular properties, thereby revolutionizing the pharmaceuticals and materials industries. However, these VAEs are hindered by the non-unique nature of SMILES strings and the computational cost of graph convolutions. To efficiently pass messages along all paths through the molecular graph, we encode multiple SMILES strings of a single molecule using a set of stacked recurrent neural networks, pooling hidden representations of each atom between SMILES representations, and use attentional pooling to build a final fixed-length latent representation. By then decoding to a disjoint set of SMILES strings of the molecule, our All SMILES VAE learns an almost bijective mapping between molecules and latent representations near the high-probability-mass subspace of the prior. Our SMILES-derived but molecule-based latent representations significantly surpass the state-of-the-art in a variety of fully- and semi-supervised property regression and molecular property optimization tasks.
PDF AbstractResults from the Paper
Ranked #1 on Molecular Graph Generation on ZINC (QED Top-3 metric)
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Drug Discovery | Tox21 | SSVAE with multiple SMILES | AUC | 0.871 | # 3 | |
Molecular Graph Generation | ZINC | All SMILES VAE | Validty | 98.5 | # 9 | |
QED Top-3 | 0.948, 0.948, 0.948 | # 1 | ||||
PlogP Top-3 | 29.80, 29.76, 29.11 | # 1 | ||||
function evaluations | 250500 | # 14 | ||||
Uniqueness | 100 | # 1 | ||||
Novelty | 99.96 | # 3 |