Noureddin Sadawi - Molecule Recognition

The Molecule Recognition, or image to mol, task involves the process of turning an image of a chemical diagram into a computer processable format. Molecule diagrams are widely used to illustrate connectivity of various atoms and bonds in real life molecular structures. The way they are drawn can be complicated and in many cases, it is not straightforward to interpret molecule images even visually.

For example, the figure shown below shows a molecule image with its corresponding MOL file. The shape between the O, N and Br atoms is actually a hexagon with two pentagons. Every pentagon shares 3 sides with the hexagon and 2 sides with the other pentagon. Now this may be slightly confusing, especially when you discover that vertical line in the middle and the diagonal line going towards the O atom are not actually connected. That is called a bridge bond and therefore this pattern is sort of 2 1/2 dimensional!

Image to MOL file

Molecule images:

Notice that there is a huge number of existing molecules and the number keeps increasing by thousands every single day. Therefore, the complexity of molecule images can vary from simple such as this, to moderate such as this or complex such as this and this.

.

Benchmark Dataset:

I have recently created a benchmark dataset of molecule structure images and their corresponding MOL files. This is, as far as I am aware, the largest freely available molecule dataset. It has 5740 images and it can be downloaded from here.

.


Bond Types
Single Bond single bond Double Bond double bond Triple Bond triple bond
Wedge Bond wedge bond Bold Bond bold bond Hollow Wedge Bond hollow wedge bond
Dashed Wedge Bond dashed wedge bond Dashed Bold Bond dashed bold bond Dashed Bond dashed bond
Wavy Bond wavy bond Dative Bond dative bond