okra pie: some simple ocr/hocr tests

A couple of years ago while at the University of Alabama, we were using tesseract-ocr to OCR images of old printed texts. At that version of tesseract, without editing the code there didn't seem to be a way to get the actual coordinates of the words.

This week I kind of got re-interested in seeing if there was a simple way to use tesseract to get the bounding box info, i.e. where the words are located on the image. With the newer (3+) version of tesseract I initially learned that one can get the box coordinates by passing the "makebox" argument a la:

$ tesseract foo.tif foo.txt makebox

This actually outputs the coordinates of each character, so I wrote a little Python script to take the text OCR output and compare it against the character coordinates to give me the location of each word. In turn, the script would use ImageMagick to make a PNG file from the TIFF and then dump out an HTML file that placed the words over the images, though the words were transparent. This allowed me to just use the browser's native Find feature (CTRL-F) to search and highlight HTML words as they rested on top of their respective image-based words.

But, of course, then I learned you can just do this:

$ tesseract foo.tif foo hocr

to create an HTML (4.0) file with the coordinates of each word, eliminating the script's need to compare whole words against character coordinates.

Anyway, there's lots of work to do if I want to pursue this. I need to investigate more about some of the weird things happening with text like newspapers with multiple columns (text for some columns is severely offset from the image itself, etc.) but it's a nice little start.

I also want to see if there's a way to map the tesseract output to an Abby Fine Reader like XML output and maybe that way tesseract could be used in conjunction with the Internet Archive's fine eBook reader. I'm sure someone's already done that, so a little research would be step #1. I think the IA's reader uses Abby output(?) and Abby's not open or free, if I understand.

I'd also like to think about doing this for OCR-ed images of audio transcriptions and synchronizing the image and/or the HTML text with the media.

Anywho, here's a link to a sample HTML file. You'll probably want to zoom out given that I didn't resize the image – yeah, so it loads slowly, too. Might not want to use IE, because I don't think it lets you search for more than one word like "Mulberry Tree". Also, there are two JS functions in the page called "hideImage()" and "showImage()" if anyone wants to play a little and see how the text looks like without the image in the background.

By the way the image I tested on is from the Library of Congress' awesome American Memory collection. You can see it here.

… and I almost forgot the best part. The script is called "Okra Pie" because "ocr" is like "okra" and "pie" is for Python.

😛

--------------

Related Content:

2 Comments

  1. David Brandon

    Thanks for this post.  Your sample html file is exactly the type of thing I'm looking to put on my new website.   Had a question, though.    I started by trying to recreate your top instruction:
    $ tesseract foo.tif foo.txt makebox
    I had already downloaded tesseract and tessnet2 and I used it on the command line of windows7 (ignoring the $)....and I got a test.jpg and test.tif to create a file called:
    test.txt.box
    What is this?  How can I view the coordinates for each character in that file type?

    Reply
    1. nitin (Post author)

      Hey David,

      I've never used tessnet2 but the tesseract command can yield a ".box" file that should contain the bounding box info for each character that's OCRed. It's just a plain text file.

      If you want to send the .box file in an email (or post a link to it), I can take a look and make sure we're both talking about the same thing. Thanks,

      nitaro74 AT gmail DOT com

      Reply

Leave a Comment

Your email address will not be published. Required fields are marked *

*