Unifying Molecular and Textual Representations via Multi-task Language Modelling

The recent advances in neural language models have also been successfully applied to the field of chemistry, offering generative solutions for classical problems in molecular design and synthesis planning. These new methods have the potential to fuel a new era of data-driven automation in scientific discovery. However, specialized models are still typically required for each task, leading to the need for problem-specific fine-tuning and neglecting task interrelations. The main obstacle in this field is the lack of a unified representation between natural language and chemical representations, complicating and limiting human-machine interaction. Here, we propose the first multi-domain, multi-task language model that can solve a wide range of tasks in both the chemical and natural language domains. Our model can handle chemical and natural language concurrently, without requiring expensive pre-training on single domains or task-specific models. Interestingly, sharing weights across domains remarkably improves our model when benchmarked against state-of-the-art baselines on single-domain and cross-domain tasks. In particular, sharing information across domains and tasks gives rise to large improvements in cross-domain tasks, the magnitude of which increase with scale, as measured by more than a dozen of relevant metrics. Our work suggests that such models can robustly and efficiently accelerate discovery in physical sciences by superseding problem-specific fine-tuning and enhancing human-model interactions.

PDF Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Molecule Captioning ChEBI-20 Text+Chem T5-augm-Base BLEU-2 62.5 # 3
BLEU-4 54.2 # 3
ROUGE-1 68.2 # 3
ROUGE-2 54.3 # 3
ROUGE-L 62.2 # 3
METEOR 64.8 # 4
Molecule Captioning ChEBI-20 Text+Chem T5-Base BLEU-2 58 # 11
BLEU-4 49 # 11
ROUGE-1 64.7 # 9
ROUGE-2 49.8 # 10
ROUGE-L 58.6 # 9
METEOR 60.4 # 11
Molecule Captioning ChEBI-20 Text+Chem T5-augm-Small BLEU-2 56.0 # 13
BLEU-4 47.0 # 13
ROUGE-1 63.8 # 10
ROUGE-2 48.8 # 11
ROUGE-L 58 # 11
METEOR 58.8 # 13
Molecule Captioning ChEBI-20 Text+Chem T5-Small BLEU-2 55.3 # 14
BLEU-4 46.2 # 14
ROUGE-1 63.3 # 13
ROUGE-2 48.1 # 13
ROUGE-L 57.4 # 13
METEOR 58.3 # 14
Text-based de novo Molecule Generation ChEBI-20 Text+Chem T5-augm base BLEU 85.3 # 5
Exact Match 32.2 # 3
Levenshtein 16.87 # 12
MACCS FTS 90.1 # 3
RDK FTS 81.6 # 2
Morgan FTS 75.7 # 3
Frechet ChemNet Distance (FCD) .05 # 1
Validity 94.3 # 6
Parameter Count 220000000 # 10
Text-based de novo Molecule Generation ChEBI-20 Text+Chem T5 base BLEU 75 # 16
Exact Match 21.2 # 10
Levenshtein 27.39 # 2
MACCS FTS 87.4 # 5
RDK FTS 76.7 # 7
Morgan FTS 69.7 # 9
Frechet ChemNet Distance (FCD) 0.061 # 3
Validity 79.2 # 14
Parameter Count 220000000 # 10
Text-based de novo Molecule Generation ChEBI-20 Text+Chem T5-augm small BLEU 81.5 # 9
Exact Match 19.1 # 12
Levenshtein 21.78 # 6
MACCS FTS 86.4 # 8
RDK FTS 74.4 # 10
Morgan FTS 67.2 # 12
Frechet ChemNet Distance (FCD) 0.06 # 2
Validity 95.1 # 5
Parameter Count 60000000 # 5
Text-based de novo Molecule Generation ChEBI-20 Text+Chem T5 small BLEU 73.9 # 17
Exact Match 15.7 # 14
Levenshtein 28.54 # 1
MACCS FTS 85.9 # 9
RDK FTS 73.6 # 12
Morgan FTS 66 # 14
Frechet ChemNet Distance (FCD) 0.066 # 4
Validity 77.6 # 16
Parameter Count 60000000 # 5

Methods


No methods listed for this paper. Add relevant methods here