ChEBI-20

Introduced by Edwards et al. in Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries

Dataset contains 33,010 molecule-description pairs split into 80\%/10\%/10\% train/val/test splits. The goal of the task is to retrieve the relevant molecule for a natural language description. It is defined as follows:

To push the boundaries of multimodal models, we present a new IR task: \textbf{Text2Mol}.

Given a text query and list of molecules without any reference textual information (represented, for example, as SMILES strings, graphs, or other equivalent representations) retrieve the molecule corresponding to the query. From a text description of a molecule, the model must incorporate the information in the description into a semantic representation which can be used to directly retrieve the molecule. This requires the integration of two very different types of information: the structured knowledge represented by text and the chemical properties present in molecular graphs. We assume there is only one correct (relevant) molecule for each description, so we consider two measures for this task: Hits@1 and mean reciprocal rank (MRR).

80\% of the data is used for training. Retrieval is done against the entire corpus of molecules (train, val, test).

Homepage

Benchmarks

Add a new result Link an existing benchmark

Task	Dataset Variant	Best Model
Molecule Captioning	ChEBI-20	BioT5+
Text-based de novo Molecule Generation	ChEBI-20	BioT5+
Cross-Modal Retrieval	ChEBI-20	All-Ensemble
Image Captioning	ChEBI-20	GIT-Mol