Chemical Document Analysis

The work we currently do in the field of Chemical Document Analysis is Chemical Structure Recognition. This process involves reading in a chemical molecule bitmap image and generating an equivalent textual, or tabular, representation. We have recently designed and developed a strict rule based approach to identify different bond patterns and formations. The output we can generate is MOL file.

Molecule diagrams are widely used to illustrate connectivity of various atoms and bonds in real life molecular structures. The way they are drawn can be complicated and in many cases, it is not straightforward to interpret molecule images even visually.

For example, the figure shown below shows a molecule image with its corresponding MOL file. The shape between the O, N and Br atoms is actually a hexagon with two pentagons. Every pentagon shares 3 sides with the hexagon and 2 sides with the other pentagon. Now this may be slightly confusing, especially when you discover that vertical line in the middle and the diagonal line going towards the O atom are not actually connected. That is called a bridge bond and therefore this pattern is sort of 2 1/2 dimensional!

Image to MOL file

Notice that the number of existing chemical structures is huge and it keeps increasing by thousands every single day. Therefore, the complexity of molecule images can vary from simple, moderate to very complex. Also, because there is no consensus amongst chemists themselves on how certain bond formations should be depicted and on how some depictions should be interpreted, the task of correctly interpreting a given chemical structure diagram can be challenging and this is why we do it!

Benchmark Dataset:

We have recently created a benchmark dataset of molecule structure images and their corresponding MOL files. This is, as far as we are aware, the largest freely available molecule dataset and it has 5740 images. More information about this dataset can be found here.