Overview

This dataset has been created using one of Maybridge's Catalogues for drug design and disceovery. Pages containing moleule diagrams were scanned at 600x600 resolution. A bespoke tool was used to automatically clip structures with their corresponding CAS numbers. The CAS numbers were used to look up InChI identifiers in online databases. The InChI identifiers were converted into their corresponing MOL files using OpenBabel.

  • Copyright: The dataset is downloadable copyright free (Copyright has been obtained from Maybridge).
  • We ask you only, as a professional courtesy, to cite this paper and acknowledge the source (http://www.cs.bham.ac.uk/research/groupings/reasoning/sdag/chemical.php) if you use the dataset in published work.
  • Click here to download the dataset.
  • The pages were scanned as RGB images and thresholded using Otsu's method.
  • The pages were scanned at resolution of 600x600 dpi.
  • Very small connected components have not been removed.
  • The dataset has 5740 tif images of molecule structures and 5740 corresponding MOL files.
  • An image file's name and its correponding MOL file name are identical (with different extensions).
  • Every file name looks like this: maybridge-xxxx-xxxxxxxxx.ext
    • The first group of x's contains the catalogue's page number
    • The second group of x's contains a random unique identifier
    • The extension is either .tif (for image file), or .mol (for MOL file)
  • Sample catalogue page and its extracted structures can be found below.
  • Click on any thumbnail to view the original image. Notice: the catalogue page is ~72MB
Sample Catalogue Page
mol1 mol1
mol1 mol1
mol1 mol1
mol1 mol1
mol1 mol1
mol1 mol1