okra pie: the actual code I forgot to post

A while ago I had done a simple test here [http://blog.humaneguitarist.org/2012/07/14/okra-pie-some-simple-ocrhocr-tests/] using Tesseract's HOCR output and ImageMagick to overlay - with CSS - invisible OCR text on top of images in an HTML file. This was a very simple way to generate a page that one could then simply use their browser's "find" function to search on the page for words within the image - at least those that got accurately captured via the OCR process. I can't remember why I didn't post the Python code, so here it is below. It could easily be modified to write to an SQL database or something so that one could create a small digital library with SQL and PHP on a simple site that would allow one to search for snippets of text on a website and click on a search result and then - via some JavaScript - be taken to the specific portion of a page/image that contains the text. Maybe I'll actually do something like that one day, but for now here's the code. ''' @title: Okra Pie @author: Nitin Arora - this script will take an argument (e.x. "foo") and assume the existence of "foo.tif". - it will then use Tesseract and ImageMagick to create, in the "output" folder, the following: - "foo.html" - the Tesseract HOCR/XHTML output, - "foo.png" - a PNG version of the TIFF file, - "foo.okra.html" - an HTML file with the OCR text overlaid on top of the PNG file. example usage: $ python ./okra.py foo ''' #### import modules import sys, codecs from PIL import Image #http://www.pythonware.com/products/pil/ from lxml import etree #http://lxml.de/ from lxml.html import * import urllib #### make hocr file with tesseract and PNG with ImageMagick. try: fp = sys.argv[1] fps = tuple([fp]*4) import os run = ("tesseract %s.tif output/%s hocr | convert %s.tif output/%s.png") %fps print run os.system(run) except: print "You must pass the filename prefix for your .tif file." sys.exit() #### get PNG image size. im = Image.open(fp + ".tif") im_width = im.size[0] im_height = im.size[1] #### parse hocr file. fo = codecs.open("output/" + fp + ".html", "r", "utf-8") fo_r = fo.read() root = fromstring(fo_r) ocrWords = root.findall('.//span[@class="ocr_word"]') #### place each word and its coordinates into a list as a dictionary. wordList = [] for ocrWord in ocrWords: node = ocrWord.find('.//span[@class="ocrx_word"]').text_content() if node != None: word = {} word["text"] = node coordinates = ocrWord.get("title") coordinates = coordinates.split(" ") coordinate = coordinates.pop(0) #remove word "box" from attribute value. coordinates[2] = int(coordinates[2]) - int(coordinates[0]) coordinates[3] = int(coordinates[3]) - int(coordinates[1]) word["left"] = coordinates[0] word["top"] = coordinates[1] word["width"] = coordinates[2] word["height"] = coordinates[3] if (int(word["left"]) <= int(im_width)) and (int(word["top"]) <= int(im_height)): wordList.append(word) fo.close() #### create output HTML file with image and words (overlaid). fo = codecs.open("output/" + fp + ".okra.html","w","utf-8") header = """<!DOCTYPE html> <html> <head> <title>Okra Pie</title> <meta charset="UTF-8" /> <script type="text/javascript"> function hideImage(){ var im = document.getElementById("image"); var ocr = document.getElementById("ocr"); im.style.display = "none"; ocr.style.color = "black"; } function showImage(){ var im = document.getElementById("image"); var ocr = document.getElementById("ocr"); im.style.display = "block"; ocr.style.color = "transparent"; } </script> </head> <body> <div id= "image" style="position:absolute;z-index:-1"> <img src="%s.png" /> </div> """ %fp fo.write(header) fo.write('\ <div id="ocr" style="color:transparent;opacity:0.5;background-color:transparent;">\n') for word in wordList: wordSpan = (word["left"], word["top"], word["width"], word["height"], word["height"], word["text"]) #tag = '\t<span data-X="%s" data-Y="%s" data-W="%s" data-H="%s">%s</span>\n' %wordSpan tag = '\ <span style="left:%spx;top:%spx;width:%spx;height:%spx;font-size:%spx;position:absolute;">%s </span>\n' %wordSpan #note the whitespace at the end so browsers can search for two or more words with a space in between. fo.write(tag) fo.write("""\ </div> </body> </html>""") fo.close()