We introduce USPTO-30K, a large-scale benchmark dataset of annotated molecule images, which overcomes these limitations. It is created using the pairs of images and MolFiles by the United States Patent and Trademark Office. Each molecule was independently selected among all the available documents from 2001 to 2020. The set consists of three subsets to decouple the study of clean molecules, molecules with abbreviations and large molecules.

USPTO-10K contains 10,000 clean molecules, i.e. without any abbreviated groups. USPTO-10K-abb contains 10,000 molecules with superatom groups. USPTO-10K-L contains 10,000 clean molecules with more than 70 atoms.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


  • CDLA-Permissive-1.0

Modalities


Languages