A Resource for Scientific Document Analysis:
Abramowitz and Stegun

Features


Connected Components

A connected component is a maximal connected set of foreground (i.e. black) pixels. Thus the upper case "L" and "M" characters each contain one connected component, while the "=" and lower case "i" and "j" characters each have two. A standard step in document analysis is to extract connected components and calculate features of them for use in classification algorithms for the purpose of character recognition.

All connected components have been extracted from the version of the scan which is 600 dots per inch, binarised, deskewed, and with small connected components removed. In the download files below, each connected component is stored, in a folder which identifies the page the component was extracted from, as a TIFF RGBA image with deflate compression and with the background set to 100% transparent and the foreground set to black. This allows simple reconstruction of the original pages by drawing the connected component images at the appropriate location on the page. The x and y coordinates of where (the top left pixel of) the component image should be drawn is encoded into the name of each component image file. Note that further information about each component is available from the geometric moment features files below.

Connected Component Images
Description Size Download
Pages 0001 to 0099 40 MBytes AandS-mono600_ccs_00xx.tar.bz2
Pages 0100 to 0199 45 MBytes AandS-mono600_ccs_01xx.tar.bz2
Pages 0200 to 0299 41 MBytes AandS-mono600_ccs_02xx.tar.bz2
Pages 0300 to 0399 35 MBytes AandS-mono600_ccs_03xx.tar.bz2
Pages 0400 to 0499 38 MBytes AandS-mono600_ccs_04xx.tar.bz2
Pages 0500 to 0599 34 MBytes AandS-mono600_ccs_05xx.tar.bz2
Pages 0600 to 0699 33 MBytes AandS-mono600_ccs_06xx.tar.bz2
Pages 0700 to 0799 34 MBytes AandS-mono600_ccs_07xx.tar.bz2
Pages 0800 to 0899 38 MBytes AandS-mono600_ccs_08xx.tar.bz2
Pages 0900 to 0999 38 MBytes AandS-mono600_ccs_09xx.tar.bz2
Pages 1000 to 1060 20 MBytes AandS-mono600_ccs_10xx.tar.bz2

Geometric Moment Features

The following download contains a single comma separated value file containing the following information about each connected component in the book:

src_image
The file name of the multi-page TIFF image file that this component was extracted from.
page
The number of the page from the file that the component was extracted from.
page_width, page_height
The width and height in pixels of the page that the component was extracted from.
cc_image
The file name of the component image.
x, y, w, h
The bounding box information of the component image in the page that it was extracted from. The (x,y) coordinates are the top left corner coordinatesof the connected component on the page.
aspect
A feature based on the aspect ratio of the connected component.
m00
The number of foreground pixels in the connected component.
n20, n11, n02
The second order normalised central geometric moments of the connected component.
n30, n21, n12, n03
The third order normalised central geometric moments of the connected component.
i1, i2, i3, i4, i5, i6, i7, i8
The extended set of Hu rotational invariant moments for the connected components.

See the paper for details.

Geometric Moment Features for the Connected Components
Description Size Download
Geometric moments features in a Comma Separated Value Format 255 MBytes AandS_mono600_ccs_moments.csv.gz
Script for creating and populating a Postgres table from the moments file (requires the full path name of the moments file to be changed to the correct local value) 3 KBytes moments-create.sql