Research
My research is concerned with recognising and parsing mathematics within scientific documents, with the aim of automatically producing markup in a language such as MathML.
I mainly work with the ubiqutous PDF format, for which I have written a tool in OCAML that can read the files and extract font and character information from within them. This allows one to bypass the need for Optical Character Recognition, which is notoriously difficult when dealing with mathematics. The results of this are used to produce perfect input into our formula parser, which is based upon a coordinate grammar, with various plugins for producing different output, such as LaTeX and MathML
More detailed information can be found in the following thesis, papers and other materials. You can also find me on Google Scholar
Ph.D. Thesis
- Josef B. Baker "A Linear Grammar Approach for the Analysis of Mathematical Documents", University of Birmingham 2012
Papers
- Josef B. Baker, Alan P. Sexton and Volker Sorge "Extracting Precise Data on the Mathematical Content of PDF Documents", Towards Digital Mathematics Library 2008
- Josef B. Baker, Alan P. Sexton and Volker Sorge "Extracting Precise Data from PDF Documents for Mathematical Formula Recognition", Document Analysis Systems 2008
- Josef B. Baker, Alan P. Sexton and Volker Sorge "A Linear Grammar Approach to Mathematical Formula Recognition from PDF", Mathematical Knowledge Management 2009 (Best Paper Award)
- Josef B. Baker, Alan P. Sexton and Volker Sorge "An Online Repository of Mathematical Samples", Towards Digital Mathematics Library 2009
- Josef B. Baker, Alan P. Sexton and Volker Sorge "Using Fonts Within PDF Files to Improve Formula Recognition", Workshop for E-Inclusion in Mathematics 2009
- Josef B. Baker, Alan P. Sexton and Volker Sorge "Faithful Mathematical Formula Recognition from PDF Documents", Document Analysis Systems 2010
- Josef B. Baker, Alan P. Sexton, Volker Sorge and Masakazu Suzuki "Comparing Approaches to Mathematical Document Analysis from PDF", The International Conference on Document Analysis and Recognition 2011
- Josef B. Baker, Alan P. Sexton and Volker Sorge "Towards Reverse Engineering of PDF Documents", Towards Digital Mathematics Library 2011
- Josef B. Baker, Alan P. Sexton and Volker Sorge "MaxTract: Converting PDF to LATEX, MathML and Text", Conferences on Intelligent Computer Mathematics 2012
Other Work
- A Poster about my earlier work, presented at DML 2008
- Slides from a school seminar given in 2010
- Mark Lee, Petr Sojka, Volker Sorge, Josef Baker Wojtek Hury, and Lukasz Bolikowski. Association Analyzer Implementation: State of the Art, November 2010. Deliverable D8.1 of EU CIP-ICT-PSP project 250503 EuDML: The European Digital Mathematics Library, http://eudml.eu/.
- Petr Sojka, Josef Baker, Alan Sexton, and Volker Sorge. A State of the Art Report on Augmenting Metadata Techniques and Technology, November 2010. Deliverable D7.1 of EU CIP-ICT-PSP project 250503 EuDML: The European Digital Mathematics Library, http://eudml.eu/.
Links
- Multivalent An excellent tool for compressing and decompressing PDF files, extracting text, producing metrics and much more.
- Mathdex A search engine for mathematical notation. This appears to be currently unavailable.
- PDF Reference Manual An exhaustive (1200+ pages) description of the PDF file format.
- OCRopus A modular OCR and document analysis system
- INFTY Project A research group concentrating on scientific and mathematical document analysis