Custom OCR for mono-spaced line-printer listings

QUICK START

If you came to this page looking for an OCR program to use with your lineprinter listings, let me set your expectations up front: this code is new, it's under development, and it hasn't yet been packaged for simple use by any random person. You're welcome to use it, but you should expect to have to work on a few shell scripts, and it would help if you're a C programmer. This code is not for the sort of person who needs a GUI to make their breakfast in the morning. In fact there's not even a graphics display in this code — we use ascii art when we want to examine a bitmap, and anything more complex relies on you having access to your own web server to view whole page images. So, before downloading the software (all currently stored in OCR.zip - github will come later when it's more usable), I strongly recommend reading the rest of this document.

Custom OCR for mono-spaced line-printer listings

One of my hobbies is recovering old computer programs from the early days of programming — and often the only remaining source of some program is a line-printer listing. Although sometimes just a scan is enough for historical interest, in many cases we want to recreate the file as text so that it can be resurrected as a running program.

It turns out that most OCR software just does not cope well with old listings, and usually the work of manually correcting the OCR of source code is as laborious as re-keying the source by hand! (Which is something that one of the net groups I'm in has done a few times — having multiple people re-key a listing and finding errors by comparing the multiple copies is a hellish and time-consuming task).

Fortunately, one of the features of old line-printer listings can be made to work in our favour — they're almost always mono-spaced (i.e. fixed pitch) fonts, which makes identifying each character on the page a lot easier than the same problem for proportional fonts.

Way back around 1982, the head of department at my University asked around for suggestions for final-year student projects. I supplied a few, but the one I was hoping most to be implemented was an idea I had for a custom OCR program specifically for line-printer listings. No-one took it up, and 40 years later I'm not aware of any projects out there that have done anything similar. (Well, with a small caveat: https://github.com/eighttails/ProgramListOCR tries to tackle the issue, but I've tried it and it was far from what we need.)

So now that I'm retired and have lots of free time to catch up on the unimplemented coding projects that I've thought of over the years, I've decided to implement the algorithm I suggested back in '82.

To cut to the chase, I've prototyped the program and can successfully extract individual characters from the page (by automatically determining the line and character spacing), and can recognise the individual extracted characters much more accurately than any of the OCR suites I tried. That said, it's not fast! Segmenting a 300 page document into characters took a couple of days on my old computer, and recognising characters takes about 10 seconds per character. This sounds terrible for those of you used to current OCR suites, but the system is for listings that we haven't had the manpower to re-key in the several years since we acquired the scans, and the accuracy of this system is going to be much higher. Taking a few days to get a good OCR of an otherwise unreadable scan is not a problem for us.

Note that this will only work with scanned documents from a flatbed scanner, not images photographed with a hand-held cell phone etc. The input to this code should be trimmed and de-skewed. I supply a script that will do all the necessary preparation for the test document I've been using — you can copy it and modify it to do similar preparation for your own documents.

Work-in-progress results

This is the first output from this new code — it has not yet been tweaked in any way to improve the recognition. I expect it will get better at disambiguating some of the harder glyphs to recognise when I adjust the penalties for excess pixels versus missing pixels when comparing an unknown glyph to an ideal form.

Starting with this raw scan:

my code extracts this text:

 //XALGCL JCB  (T193***,914,1,2,200),' ED SATTERTHWAITE ',MSGLEVEL=1      JOB  18                                             Q
 //JO8LIB  DD   DSNAME=T123.PLLIB,ONIT=23I4,VOLUME=SER=SYSI1,           X   2.
 //             DISP=(OLD,PASS)                                             3.
 //PL360 EXEC   PGM=PL360                                                   4.
 //SYSPRINT OO  SYSOOT=A,DCB=BLKSIZE=133                                    5.
 //SYSGC    DO  OSNAME=&LOAOSET,ONIT=2314,SPACE=(CYL,(3,1)I,            X   6.
 //             OISP=(MOO,PASSI,OCB=ILRECL=80,3LKSIZE=3200,RECFM=FBI        7.
 //SYSIN    OO  *                                                           8.
 IEF2361 ALLOC. FOR XALGOL   PL360
 IEF237I JOBLIB   ON 434
 IEF237I SYSGO    ON 330
 IEF237I SYSIN    CN 00C                        -.

In comparison, the open source ocr program tesseract reads the same image as:

//XALGCL JCB (T193***191411y2;200)y’ ED SATTERTHWAITE *,MSGLEVEL=1 JﬂB 18
//JceLig DD DSNAME=T123.PLLIB,UNIT=2314yVOLUME=SER=SY511y X 2.
// DISP=10LDgPASSi ‘ 3.
/7PL3&C EXEC PGM=PL36D 4.
//SYSPRINT DD SYSBUT=A19£8=BLKSIZE=133 5.
//735YSGC DD DSNAﬁE=SLGABSET,UN17:231415PA£E=(CYL,(311,3, X 6.
/7 913?={M897PASS)yDCB=(LRECL=BQ,BLKSIZE=3200yRECFM=F83 Te
//7SYSIN DD *% 8.
TEF2361 ALLGQ. FOR XALGOL PL360
1EF2371 JCBLIB ON 434
TEF2371 SVSGO ON 330
IEF237I SYSIN CN 00C

so based on this early result, I think the algorithm used shows promise and I will continue working on the project.

The basic algorithm

Align (deskew) the pages, trim off perforations etc, leaving only the text
Determine the line and column spacing, and overlay every page with a grid. Visually check that all characters fall within grid squares. The image of the page with the grid superimposed is only for visual inspection; it's not used directly by the OCR code. An HTML page is also generated which uses an HTML <AREAMAP> to make each character clickable and links to the individual character files: the same program that generates the grid also cuts the original page up into both lines and individual characters. The HTML page and individual character files help considerably in quickly finding an example of each glyph for use in creating the initial font. (The font is stored as an ascii drawing to make the glyphs easy to edit by hand at first.) Note that the output from a 300 page document is maybe 2Gb and contains thousands of small files, which on Linux can mean that you might run out of free i-nodes rather than actual disk space. (As was the case on my Raspberry Pi 5 with a 32Gb SD card. By the way a Pi 5 is otherwise a fine system for this sort of work. Just use a 64Gb card or larger, or request more i-nodes when you prepare your drive).
The lack of a predefined font is not a significant hindrance and for some documents it is a great advantage, because (as long as the document is mono-spaced), a document with an obscure character set is just as recognisable as plain ASCII. One of my test documents is a phonetic dictionary using IPA characters. Very few OCR systems can decode IPA text without extensive training. The same argument applies to accented characters (accents both above and below the character) as well as to underlined text.
The grid alignment uses a rather brute-force algorithm to determine line and column spacing, however it does do so at a resolution of 0.1 of a pixel. With lines of 132 characters, a whole pixel resolution would have caused the characters at the end of a line to start creeping out of their grid into the next square — just be aware that the non-integer separation means that some grid squares are 1 pixel larger than others, once the sub-pixel boundary is rounded to the nearest whole pixel.
Every page in the document is examined to determine the grid spacing of that page. Then the results from every page are examined to identify the most common x and y values, because the results for each individual page may be off by a small amount. Once the mode values are determined, the pages are re-segmented again using that single value. This fixes problems with slight offsets and with the occasional double-spaced page or page with so few lines that the spacing cannot be determined.
Initially, we manually pick a good example of each character that's in the listing, and build a font using these examples. A fully automatic tool to convert a list of png files into a C array is supplied. Later we will use this first approximation to the ideal font to locate other instances of the same characters. We'll align and average these to generate ideal forms of each letter, which are smoother than any random single glyph. However, that said, it turned out that using hand-picked examples of each letter actually gave really good recognition, and at least for the test document might be all that's needed. Creating the 'super-resolution' exemplar characters may be put on hold until some specific document really needs it.
OCR the text using these ideal forms. Save each decoded letter beside the corresponding graphics file. (We use png's for viewable files and pnm's for intermediate files that need to be read by various programs internally. Talking of which, the individual character files were generated as pnm files, but the user views png files via the web — so there's a script colfix.sh which searches the tiles subdirectory for those pnm files and converts them to png using ImageMagick's convert. Unfortunately this conversion does takes a couple of days to run.).
The comparison algorithm is to take the unknown glyph and compare it bit by bit to the master form, with the two bitmaps initially overlapping at their central points. Then the unknown bitmap is displaced in a growing spiral pattern around that initial point to see if the same glyph matches better at some other offset. The movement allowed is roughly 1/3rd of the height and width of its size.
Note that this character recognition technique should work for dot-matrix characters with gaps between the dots, as well as for typewriter-like listings and the output from band printers.
In my original project proposal in 1982, I suggested using a clustering algorithm to identify similar glyphs (rather than manually identifying them), and then to Perform a 1:1 decryption of the OCR'd text to identify the individual glyphs. This turned out not to be necessary as manually identifying the 96 or fewer individual letters really wasn't difficult or time-consuming.
Manually identify any remaining glyphs.
Output correctly spaced and indented text.

In the computation that compares the exemplar against the candidates, I'm currently weighting all pixels the same. However there's a case for biasing some of the pixels to be more significant if missing, eg the serif on a lower-case L vs digit ONE vs upper-case I, or ZERO vs OH vs C — the latter looking like an O that was clipped at the right (as happens on the ALGOL-W listing). We might manually indicate this by highlighting parts of master letter forms in red.

Also, when we have some overlap in a grid that should belong to an adjacent cell, we might be able to identify it by using the 'connected-components' facility in imagemagick to identify blobs, and remove blobs that intersect the edge if they are smaller than other blobs in the same grid square. Note that for awkward documents, the extracted grid cells may deliberately include some extra pixels on all sides to compensate for bad registration or to pick up underlines etc that might have been missed. We need to keep real islands but eliminate those that are a spill-over from the adjacent cells. 'connected-components' ought to enable this. (Also it is likely that we can write our own flood-fill code to implement this rather than invoking image magick).

This grid technique is good at identifying letters with accents and underlines. There may be some minor tweaking needed for characters where an accent above the letter or an overline is on the same level as an underline or accent below a letter on the preceding line. The IPA example has one of these where a syllabic 'l' has a dot over it which was assigned to the previous line. This will be hard to identify without context as IPA also has some letters with a dot below them.

Dewarping: Dewarping is not the same as straightening by rotation or shearing — dewarping can cope with complex curves in the text such as you might get from an open book near the spine. Although automatic dewarping software exists, it's not obvious that the results will fit exactly into a regular grid. It may be worth some experimenting, but for now our technique is probably only applicable to pages from a flatbed scanner.

Although this code allows grey scale images on input, they're thresholded down to 1-bit binary. We have a parameter for thresholding that can be specified when compiling the font-building tool.

The evaluation function of the two pixels being compared is somewhat crude. It may need to be tweaked from experience and may also require tweaking on a per-document basis. But for now it works well enough. We base the match score on pixels that are common to the unknown glyph and a master character template. For now missing pixels and excess pixels are weighted the same, but I am considering weighing the excess pixels more heavily than the missing pixels. The glyph may be missing pixels that are present in the master because the printer strike was not perfect, but pixels in the unknown glyph which are not present in the master are a stronger indication of a poor match.

The segmentation code is in segmenter/ and the character recognition code (and font builder) are in recogniser/

Image Straightening

Some years ago I wrote some image straightening software myself, but then I found the 'deskew' program which was faster and more accurate than mine, so this project will simply use that too. However, even the output from Deskew is not always aligned well enough to be usable here — one image of a height of approximately 3000 pixels suffered a shear of 7 pixels between the top row and the bottom row. I don't yet have code to determine that shear automatically but it's relatively easy to determine manually, and then correct using the "pnmshear" program from the netpnm suite. (For some reason pnmshear takes an angle as the shear parameter, rather than an integer number of pixels.)

In the code below, we calculate the angle of a 7 x 3109 right-angled triangle to use as the parameter to pnmshear. In this case, those numbers were hard coded by inspection — ideally they would be determined automatically.

  # identify P-075.png
  # P-075.png PNG 3109x2113 3109x2113+0+0 8-bit sRGB 1.10411MiB 0.000u 0:00.000

  PX="7"
  X="3109"
  Y="2113"
  ANGLETAN=`echo "scale=7;pi=4*a(1);a(${PX}/${Y})/pi*180"|bc -l`
  echo "shearing $f from 0 pixels at top to ${PX} pixels (to the right) at the bottom. angle = $ANGLETAN degrees"
  pngtopnm < level/$filename.png | pnmshear -noantialias $ANGLETAN | pnmtopng > sheared-$$.png ; mv sheared-$$.png level/$filename.png

Segmentation Algorithm

The ability to impose a grid on a straightened listing relies on examining the image to recognise lines and columns of text. The easiest way to do this is to simply add up the pixels along each horizontal row and vertical row. The computer science way to infer the spacing from this is to apply an FFT and look for the strongest signal. However there's an easier way! Simply loop over the array of row totals, using different strides (eg every 20 pixels, every 21 pixels, every 22 pixels etc) and find the row spacing that gives the largest total amount of whitespace. Surprisingly this works, and it's sufficiently fast. However there's a small catch: if we use an integer step, the grid slowly gets out of synch as you examine from left to right or top to bottom. It's necessary to use a higher resolution, and in my code I initially used a step with 0.1 pixel resolution but am now uprading that to 0.01 pixels.

Character Recognition Algorithm

The primary code of this OCR program (in file 'chocr.c') is surprisingly simple. It's just a bitmap comparison — a master bitmap for each recognisable character is compared to an extracted glyph and awarded a score based on an evaluation function whose parameters are the number of matching pixels plus pixels that exist in the master but not the glyph, plus pixels that exist in the glyph but not the master. This comparison is done with the two bitmaps intially overlapping and with the center of their bounding boxes aligned, and then again multiple times as the unknown glyph is moved in a spiral pattern around that central point. Comparisons where part of the unknown glyph lie outside the bounding box of the master bitmap are handled correctly, which also compensates for the occasional failure to capture a glyph completely within its grid square.

Because of the printing technogies used in some old print-outs, it often happens that part of a character is missing, from an incomplete strike of the print head. Because of this, the evaluation function for the bitmap comparison weighs extra pixels in the unknown glyph as more of an error than missing pixels. Currently the weights for these two parameters are fixed, though I am considering making them values that can be set per scan so that they match the specific printer that was used.

The spiral displacement only moves the characters around by something like 1/3rd of the dimension of the grid square — this not only speeds up the recognition, but avoids confusion of items such as commas and apostrophes which may look identical in isolation, but are distinguished by their position within the cell. (There are a few specific cases such as double-spaced text where this rule of thumb fails - those are being worked on.)

Font preparation

This code relies on having a specific font built-in that matches the document being OCR'd. It does not support generic fonts although later in development we may have a recognition phase that checks a new document against the known fonts that were added with each scan.

People used to other OCR suites which require training to recognise unknown alphabets or strange fonts may think that defining a font is a daunting task to be performed before a scan can be done. It's not. Creating an initial font is quite simple: by this time, segmentation has extracted the individual characters and saved them to separate files. A web page has been created displaying the grid, which uses an HTML areamap to display the filename of any character that is clicked on. To create a new font, you edit the genfont.c file and add lines that list these filenames along with the character description and the text string to be output when the character is recognised. (This allows for UTF-8 support by the way). Here are some example entries from a recogniser for IPA (phonetic) symbols:

    f[next++] = & (CHAR) { " ", "../testdata/IPA/tiles/IPA-02_1_1/line001/col001.png", "space" };
    f[next++] = & (CHAR) { "n", "../testdata/IPA/tiles/IPA-02_1_1/line008/col007.png", "n" };
    f[next++] = & (CHAR) { "h", "../testdata/IPA/tiles/IPA-02_1_1/line030/col007.png", "h" };
    f[next++] = & (CHAR) { "æ", "../testdata/IPA/tiles/IPA-02_1_1/line008/col003.png", "IPA Ash" };
    f[next++] = & (CHAR) { "ə", "../testdata/IPA/tiles/IPA-02_1_1/line008/col006.png", "schwa" };
    f[next++] = & (CHAR) { "e", "../testdata/IPA/tiles/IPA-02_1_1/line038/col012.png", "e" };
    f[next++] = & (CHAR) { "ə̲", "../testdata/IPA/tiles/IPA-02_1_1/line020/col006.png", "underlined schwa" };
    f[next++] = & (CHAR) { "e̲", "../testdata/IPA/tiles/IPA-02_1_1/line048/col006.png", "underlined e" };

Note that the empty grid square (SPACE) should come first, if you want to speed up recognition of empty grid squares. The optimisation of empty spaces takes into consideration a small amount of pixel noise within the square. It does this by counting the pixels in every font character, looking for the character with the fewest pixels, and then taking a quarter of that number as a threshold below which the cell should be considered empty.

The output from genfont is the header file nchar.h. It contains the actual character font definitions as a C array. It's not efficient but it's easy to modify by hand, which was useful to me during development, and may be useful to you in preparing a font for handling a new scan type.


  {
    "n",
    "tiles/IPA-02_1_1/line008/col007.png",
    "n",
    {
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@ @@@@@@@@@@@@@@@",
      "@@@      @@@@@@@      @@@@@@@@@@@",
      "@@@                     @@@@@@@@@",
      "@@@                      @@@@@@@@",
      "@@@@             @@       @@@@@@@",
      "@@@@@          @@@@@@     @@@@@@@",
      "@@@@@@        @@@@@@@@     @@@@@@",
      "@@@@@@      @@@@@@@@@@@    @@@@@@",
      "@@@@@@     @@@@@@@@@@@@    @@@@@@",
      "@@@@@@     @@@@@@@@@@@@    @@@@@@",
      "@@@@@@     @@@@@@@@@@@@    @@@@@@",
      "@@@@@@     @@@@@@@@@@@@    @@@@@@",
      "@@@@@@     @@@@@@@@@@@@    @@@@@@",
      "@@@@@@    @@@@@@@@@@@@@    @@@@@@",
      "@@@@@@    @@@@@@@@@@@@@    @@@@@@",
      "@@@@@@    @@@@@@@@@@@@@    @@@@@@",
      "@@@@@@    @@@@@@@@@@@@@    @@@@@@",
      "@@@@@@    @@@@@@@@@@@@@    @@@@@@",
      "@@@@@@    @@@@@@@@@@@@@    @@@@@@",
      "@@@@@@    @@@@@@@@@@@@@    @@@@@@",
      "@@@@@@    @@@@@@@@@@@@@    @@@@@@",
      "@@@@@      @@@@@@@@@@@      @@@@@",
      "@@@@        @@@@@@@@@        @@@@",
      "@@@          @@@@@@@@         @@@",
      "@@@          @@@@@@@@         @@@",
      "@@@@        @@@@@@@@@@       @@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      NULL,
    },
  },
  {
    "h",
    "tiles/IPA-02_1_1/line030/col007.png",
    "h",
    {
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@    @@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@      @@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@       @@@@@@@@@@@@@@@@@@@@@@@@",
      "@@       @@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@     @@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@      @@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@     @@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@     @@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@     @@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@     @@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@    @@@@@@       @@@@@@@@@@@",
      "@@@@@                  @@@@@@@@@@",
      "@@@@@                   @@@@@@@@@",
      "@@@@@                    @@@@@@@@",
      "@@@@@           @@@      @@@@@@@@",
      "@@@@@        @@@@@@@@     @@@@@@@",
      "@@@@@      @@@@@@@@@@@    @@@@@@@",
      "@@@@@      @@@@@@@@@@@     @@@@@@",
      "@@@@@     @@@@@@@@@@@@     @@@@@@",
      "@@@@@     @@@@@@@@@@@@     @@@@@@",
      "@@@@@     @@@@@@@@@@@@     @@@@@@",
      "@@@@@     @@@@@@@@@@@@     @@@@@@",
      "@@@@@     @@@@@@@@@@@@     @@@@@@",
      "@@@@@     @@@@@@@@@@@@     @@@@@@",
      "@@@@@     @@@@@@@@@@@@     @@@@@@",
      "@@@@@     @@@@@@@@@@@@@    @@@@@@",
      "@@@@@     @@@@@@@@@@@@@    @@@@@@",
      "@@@@@     @@@@@@@@@@@@@    @@@@@@",
      "@@@@@     @@@@@@@@@@@@     @@@@@@",
      "@@@@@     @@@@@@@@@@@@     @@@@@@",
      "@@@@@     @@@@@@@@@@@@     @@@@@@",
      "@@@@      @@@@@@@@@@@       @@@@@",
      "@@@         @@@@@@@@         @@@@",
      "@@           @@@@@@@          @@@",
      "@@          @@@@@@@@          @@@",
      "@@@@       @@@@@@@@@         @@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
      NULL,
    },
  },

The example above shows a couple of features you should be aware of: the letter 'n' is almost exactly contained within the letter 'h'. In the current program (before I get around to improving the code) this causes the characters to be confused, mainly because the penalty for missing pixels is too low — yet it has to be low in order to handle printer strikes where part of the image is missing.

Two characters have underlined variants. Rather than handle underlining as a special case, we simply define the character to include the underline. There isn't an example here (yet), but accented characters will be handled the same way — as part of the character, rather than as an extra blob of ink that has to be recognised separately and linked to a nearby letter, which is how some less naïve OCR systems handle accents.

For some documents, the initial font prepared by genfont turns out to be sufficient and gives good recognition, but for other documents, especially when the print is low quality, you will get several misidentified characters. But there is a solution for this — you can take several copies of the same character and merge them into a new master font character which will be a better quality rendition of that character than any one copy alone. The superimposition of the multiple instances of the same character are done by the 'exemplar.c' program, which uses the same alignment algorithm as the character recognition program 'chocr'.

Improving recognition

One of my test documents was a table of numbers, and apart from 0 through 9, the only characters appearing on the page were ':' and '-'. By creating a font that only included the characters that were possible to appear on the page, the recognition of each character was significantly improved. For example if the letter O never appears in a document and is not defined as a font character, then anything that is unclear as to being an O or a 0 is always going to be recognised (correctly) as a 0.