Unsupervised Learning of Molecular Embeddings for Enhanced Clustering and Emergent Properties for Chemical Compounds

The detailed analysis of molecular structures and properties holds great potential for drug development discovery through machine learning. Developing an emergent property in the model to understand molecules would broaden the horizons for development with a new computational tool. We introduce various methods to detect and cluster chemical compounds based on their SMILES data. Our first method, analyzing the graphical structures of chemical compounds using embedding data, employs vector search to meet our threshold value. The results yielded pronounced, concentrated clusters, and the method produced favorable results in querying and understanding the compounds. We also used natural language description embeddings stored in a vector database with GPT3.5, which outperforms the base model. Thus, we introduce a similarity search and clustering algorithm to aid in searching for and interacting with molecules, enhancing efficiency in chemical exploration and enabling future development of emergent properties in molecular property prediction models.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods