on Mathematical Formula Identification and Recognition from Born-digital Sources
The goal of the competition is to evaluate systems for mathematical formula recognition systems with respect to
We want to run the competition on "clean data" taken from digitally born PDF documents rather than from retro-digitised documents containing scanned images. This will allow us to concentrate the evaluation on formula identification and structural recognition and avoid distortion by noise and differences in preprocessing methods.
As data sets we indent to build on those we have used in our previous work[2, 1]. The data will be made available both in PDF and 600dpi TIFF format, to allow for the participation of recognisers that can either use PDF information directly or that rely solely on image analysis for their recognition process.
Task 1: Dataset for formula identification
The dataset for formula identification contains 400 document pages, selected from 194 documents, with 1,575 isolated formulae, and 7,907 embedded formulae. The documents in this dataset is obtained through crawling PDF documents from CiteSeerX. 200 document pages are utilized as training dataset and the remaining are used as testing dataset.
Task 2: Dataset for formula recognition
The data used for the formula recognition evaluation will be based upon a ground-truth set first used to compare two formula recognition systems in [1]. It is modelled on that constructed for the Infty system [3], but exclusively uses documents in PDF format. The ground truth contains identifying the names and types of characters along with their spatial relationships, sizes and co-ordinates.
Task 1: Performance measures for formula identification
To get a more in-depth insight of the overall performance of the system, an evaluation metric proposed in our previous paper [2] which can distinguish different error types and quantify the severity of different errors is adopted. Eight result types are defined in this evaluation metric, including Correct, Missed, False, Partial, Expanded, Partial&Expanded, Merged and Split. For each result type, a result type score is calculated according to the contribution of this result type to the system. In other words, severity of each error type is quantified. Except for error type identification, an overall performance Score is computed based on the result type score of different result types. The Score is obtained by the weighted sum of each result type score. The range of Score is [-1, 1] and the larger value of Score indicates the better formula identification performance. Further details about the evaluation metric can refer to [2].
Task 2: Performance measures for formula recognition
The evaluation metric of formula recognition will be release soon.
The participants will be supplied with a training dataset with pdf/image documents and ground truth and a testing dataset with only pdf/image documents. The training dataset with ground truth can be used by the participants for training or tuning systems. The resutls of the testing dataset returned by the participants will be evaluated for the actual competition.
The formats of ground truth is described in the dataset.
The training/testing datasets for Task 1:
The training/testing datasets for Task 2:
Please send registration e-mail to Josef Baker (j.baker@cs.bham.ac.uk) or Xiaoyan Lin (linxiaoyan@pku.edu.cn). In the registration e-mail, please specify affiliation, contact details and the competition tasks that you are intended to participate in.
Deadline of registration: April 12, 2013
Deadline of submission: April 23, 2013 (Extended)
[1] J. Baker, A.P. Sexton, V. Sorge, and M. Suzuki. Comparing approaches to mathematical document analysis from PDF. In Eleventh International Conference on Document Analysis and Recognition (ICDAR 2011), pages 463-467, September 2011.
[2] X. Lin, L. Gao, Z. Tang, X. Lin, and X. Hu. Performance evaluation of mathematical formula identification. In 2012 10th IAPR International Workshop on Document Analysis Systems, pages 287-291. IEEE, 2012.
[3] M. Suzuki, S. Uchida, and A. Nomura. A ground-truthed mathematical character and symbol image database. In Proc. of ICDAR, pages 675-679. IEEE Computer Society, 2005.