okra pie: the actual code I forgot to post

A while ago I had done a simple test here using Tesseract's HOCR output and ImageMagick to overlay – with CSS – invisible OCR text on top of images in an HTML file.

This was a very simple way to generate a page that one could then simply use their browser's "find" function to search on the page for words within the image – at least those that got accurately captured via the OCR process.

I can't remember why I didn't post the Python code, so here it is below. It could easily be modified to write to an SQL database or something so that one could create a small digital library with SQL and PHP on a simple site that would allow one to search for snippets of text on a website and click on a search result and then – via some JavaScript – be taken to the specific portion of a page/image that contains the text.

Maybe I'll actually do something like that one day, but for now here's the code.

@title: Okra Pie
@author: Nitin Arora

- this script will take an argument (e.x. "foo") and assume the existence
  of "foo.tif".
- it will then use Tesseract and ImageMagick to create, in the "output"
  folder, the following:
  - "foo.html" - the Tesseract HOCR/XHTML output,
  - "foo.png" - a PNG version of the TIFF file,
  - "foo.okra.html" - an HTML file with the OCR text overlaid on top of the
    PNG file.
example usage:
  $ python ./okra.py foo

#### import modules
import sys, codecs
from PIL import Image #http://www.pythonware.com/products/pil/
from lxml import etree #http://lxml.de/
from lxml.html import *
import urllib

#### make hocr file with tesseract and PNG with ImageMagick.
  fp = sys.argv[1]
  fps = tuple([fp]*4)
  import os
  run = ("tesseract %s.tif output/%s hocr | convert %s.tif output/%s.png") %fps
  print run
  print "You must pass the filename prefix for your .tif file."

#### get PNG image size.
im = Image.open(fp + ".tif")
im_width = im.size[0]
im_height = im.size[1]

#### parse hocr file.
fo = codecs.open("output/" + fp + ".html", "r", "utf-8")
fo_r = fo.read()
root = fromstring(fo_r)
ocrWords = root.findall('.//span[@class="ocr_word"]')

#### place each word and its coordinates into a list as a dictionary.
wordList = []
for ocrWord in ocrWords:
  node = ocrWord.find('.//span[@class="ocrx_word"]').text_content()
  if node != None:
    word = {}
    word["text"] = node
    coordinates = ocrWord.get("title")
    coordinates = coordinates.split(" ")
    coordinate = coordinates.pop(0) #remove word "box" from attribute value.
    coordinates[2] = int(coordinates[2]) - int(coordinates[0])
    coordinates[3] = int(coordinates[3]) - int(coordinates[1])
    word["left"] = coordinates[0]
    word["top"] = coordinates[1]
    word["width"] = coordinates[2]
    word["height"] = coordinates[3]
    if (int(word["left"]) <= int(im_width)) and (int(word["top"]) <= int(im_height)):

#### create output HTML file with image and words (overlaid).
fo = codecs.open("output/" + fp + ".okra.html","w","utf-8")
header = """<!DOCTYPE html>
    <title>Okra Pie</title>
    <meta charset="UTF-8" />
    <script type="text/javascript">
      function hideImage(){
        var im = document.getElementById("image");
        var ocr = document.getElementById("ocr");
        im.style.display = "none";
        ocr.style.color = "black";
      function showImage(){
        var im = document.getElementById("image");
        var ocr = document.getElementById("ocr");
        im.style.display = "block";
        ocr.style.color = "transparent";
    <div id= "image" style="position:absolute;z-index:-1">
      <img src="%s.png" />
""" %fp
    <div id="ocr" style="color:transparent;opacity:0.5;background-color:transparent;">\n')
for word in wordList:
    wordSpan = (word["left"], word["top"], word["width"], word["height"], word["height"], word["text"])
    #tag = '\t<span data-X="%s" data-Y="%s" data-W="%s" data-H="%s">%s</span>\n' %wordSpan
    tag = '\
      <span style="left:%spx;top:%spx;width:%spx;height:%spx;font-size:%spx;position:absolute;">%s </span>\n' %wordSpan
      #note the whitespace at the end so browsers can search for two or more words with a space in between.

Related Content:

Leave a Comment

Your email address will not be published. Required fields are marked *