My research is concerned with recognising and parsing mathematics within scientific documents, with the aim of automatically producing markup in a language such as MathML.

I mainly work with the ubiqutous PDF format, for which I have written a tool in OCAML that can read the files and extract font and character information from within them. This allows one to bypass the need for Optical Character Recognition, which is notoriously difficult when dealing with mathematics. The results of this are used to produce perfect input into our formula parser, which is based upon a coordinate grammar, with various plugins for producing different output, such as LaTeX and MathML

More detailed information can be found in the following thesis, papers and other materials. You can also find me on Google Scholar

Ph.D. Thesis


Other Work

  • A Poster about my earlier work, presented at DML 2008
  • Slides from a school seminar given in 2010
  • Mark Lee, Petr Sojka, Volker Sorge, Josef Baker Wojtek Hury, and Lukasz Bolikowski. Association Analyzer Implementation: State of the Art, November 2010. Deliverable D8.1 of EU CIP-ICT-PSP project 250503 EuDML: The European Digital Mathematics Library,
  • Petr Sojka, Josef Baker, Alan Sexton, and Volker Sorge. A State of the Art Report on Augmenting Metadata Techniques and Technology, November 2010. Deliverable D7.1 of EU CIP-ICT-PSP project 250503 EuDML: The European Digital Mathematics Library,


  • Multivalent An excellent tool for compressing and decompressing PDF files, extracting text, producing metrics and much more.
  • Mathdex A search engine for mathematical notation. This appears to be currently unavailable.
  • PDF Reference Manual An exhaustive (1200+ pages) description of the PDF file format.
  • OCRopus A modular OCR and document analysis system
  • INFTY Project A research group concentrating on scientific and mathematical document analysis