Features
Connected Components
A connected component is a maximal connected set of foreground (i.e. black) pixels. Thus the upper case "L" and "M" characters each contain one connected component, while the "=" and lower case "i" and "j" characters each have two. A standard step in document analysis is to extract connected components and calculate features of them for use in classification algorithms for the purpose of character recognition.
All connected components have been extracted from the version of the scan which is 600 dots per inch, binarised, deskewed, and with small connected components removed. In the download files below, each connected component is stored, in a folder which identifies the page the component was extracted from, as a TIFF RGBA image with deflate compression and with the background set to 100% transparent and the foreground set to black. This allows simple reconstruction of the original pages by drawing the connected component images at the appropriate location on the page. The x and y coordinates of where (the top left pixel of) the component image should be drawn is encoded into the name of each component image file. Note that further information about each component is available from the geometric moment features files below.
| Connected Component Images | ||
|---|---|---|
| Description | Size | Download |
| Pages 0001 to 0099 | 40 MBytes | AandS-mono600_ccs_00xx.tar.bz2 |
| Pages 0100 to 0199 | 45 MBytes | AandS-mono600_ccs_01xx.tar.bz2 |
| Pages 0200 to 0299 | 41 MBytes | AandS-mono600_ccs_02xx.tar.bz2 |
| Pages 0300 to 0399 | 35 MBytes | AandS-mono600_ccs_03xx.tar.bz2 |
| Pages 0400 to 0499 | 38 MBytes | AandS-mono600_ccs_04xx.tar.bz2 |
| Pages 0500 to 0599 | 34 MBytes | AandS-mono600_ccs_05xx.tar.bz2 |
| Pages 0600 to 0699 | 33 MBytes | AandS-mono600_ccs_06xx.tar.bz2 |
| Pages 0700 to 0799 | 34 MBytes | AandS-mono600_ccs_07xx.tar.bz2 |
| Pages 0800 to 0899 | 38 MBytes | AandS-mono600_ccs_08xx.tar.bz2 |
| Pages 0900 to 0999 | 38 MBytes | AandS-mono600_ccs_09xx.tar.bz2 |
| Pages 1000 to 1060 | 20 MBytes | AandS-mono600_ccs_10xx.tar.bz2 |
Geometric Moment Features
The following download contains a single comma separated value file containing the following information about each connected component in the book:
- src_image
- The file name of the multi-page TIFF image file that this component was extracted from.
- page
- The number of the page from the file that the component was extracted from.
- page_width, page_height
- The width and height in pixels of the page that the component was extracted from.
- cc_image
- The file name of the component image.
- x, y, w, h
- The bounding box information of the component image in the page that it was extracted from. The (x,y) coordinates are the top left corner coordinatesof the connected component on the page.
- aspect
- A feature based on the aspect ratio of the connected component.
- m00
- The number of foreground pixels in the connected component.
- n20, n11, n02
- The second order normalised central geometric moments of the connected component.
- n30, n21, n12, n03
- The third order normalised central geometric moments of the connected component.
- i1, i2, i3, i4, i5, i6, i7, i8
- The extended set of Hu rotational invariant moments for the connected components.
See the paper for details.
| Geometric Moment Features for the Connected Components | ||
|---|---|---|
| Description | Size | Download |
| Geometric moments features in a Comma Separated Value Format | 255 MBytes | AandS_mono600_ccs_moments.csv.gz |
| Script for creating and populating a Postgres table from the moments file (requires the full path name of the moments file to be changed to the correct local value) | 3 KBytes | moments-create.sql |