A couple of years ago while at the University of Alabama, we were using tesseract-ocr to OCR images of old printed texts. At that version of tesseract, without editing the code there didn't seem to be a way to get the actual coordinates of the words.
This week I kind of got re-interested in seeing if there was a simple way to use tesseract to get the bounding box info, i.e. where the words are located on the image. With the newer (3+) version of tesseract I initially learned that one can get the box coordinates by passing the "makebox" argument a la:
$ tesseract foo.tif foo.txt makebox
This actually outputs the coordinates of each character, so I wrote a little Python script to take the text OCR output and compare it against the character coordinates to give me the location of each word. In turn, the script would use ImageMagick to make a PNG file from the TIFF and then dump out an HTML file that placed the words over the images, though the words were transparent. This allowed me to just use the browser's native Find feature (CTRL-F) to search and highlight HTML words as they rested on top of their respective image-based words.
But, of course, then I learned you can just do this:
$ tesseract foo.tif foo hocr
to create an HTML (4.0) file with the coordinates of each word, eliminating the script's need to compare whole words against character coordinates.
Anyway, there's lots of work to do if I want to pursue this. I need to investigate more about some of the weird things happening with text like newspapers with multiple columns (text for some columns is severely offset from the image itself, etc.) but it's a nice little start.
I also want to see if there's a way to map the tesseract output to an Abby Fine Reader like XML output and maybe that way tesseract could be used in conjunction with the Internet Archive's fine eBook reader. I'm sure someone's already done that, so a little research would be step #1. I think the IA's reader uses Abby output(?) and Abby's not open or free, if I understand.
I'd also like to think about doing this for OCR-ed images of audio transcriptions and synchronizing the image and/or the HTML text with the media.
Anywho, here's a link to a sample HTML file. You'll probably want to zoom out given that I didn't resize the image – yeah, so it loads slowly, too. Might not want to use IE, because I don't think it lets you search for more than one word like "Mulberry Tree". Also, there are two JS functions in the page called "hideImage()" and "showImage()" if anyone wants to play a little and see how the text looks like without the image in the background.
By the way the image I tested on is from the Library of Congress' awesome American Memory collection. You can see it here.
… and I almost forgot the best part. The script is called "Okra Pie" because "ocr" is like "okra" and "pie" is for Python.