BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations

11 Oct 2023  ·  Qizhi Pei, Wei zhang, Jinhua Zhu, Kehan Wu, Kaiyuan Gao, Lijun Wu, Yingce Xia, Rui Yan ·

Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose $\mathbf{BioT5}$, a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. $\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature. Furthermore, $\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information. After fine-tuning, BioT5 shows superior performance across a wide range of tasks, demonstrating its strong capability of capturing underlying relations and properties of bio-entities. Our code is available at $\href{https://github.com/QizhiPei/BioT5}{Github}$.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Text-based de novo Molecule Generation ChEBI-20 BioT5 Text2Mol 57.6 # 8
BLEU 86.7 # 2
Exact Match 41.3 # 2
Levenshtein 15.097 # 15
MACCS FTS 88.6 # 4
RDK FTS 80.1 # 4
Morgan FTS 73.4 # 5
Frechet ChemNet Distance (FCD) .43 # 7
Validity 100 # 1
Parameter Count 252000000 # 13
Molecule Captioning ChEBI-20 BioT5 BLEU-2 63.5 # 2
BLEU-4 55.6 # 2
ROUGE-1 69.2 # 2
ROUGE-2 55.9 # 2
ROUGE-L 63.3 # 2
METEOR 65.6 # 2
Text2Mol 60.3 # 1

Methods


No methods listed for this paper. Add relevant methods here