redacting naughty words in images with Tesseract, ImageMagick, and dish soap

A major part of the current work I'm doing is to use some natural language processing tools and good old regular expressions to try and identify instances of PII (personally identifiable information) in government emails. And, from there, automatically redact them for versions of the emails that might be shown to someone making a request to see the email. And while it's a little out-of-scope for now, I was thinking the other week about doing the same for attachments, namely images that were good candidates for OCR. I'm not going to write too much more, except to say that the code below was a proof-of-concept experiment to do the following: 1. get XHTML/HOCR output of a sample image [http://blog.humaneguitarist.org/uploads/001da.png] using Tesseract as I'd done earlier [http://blog.humaneguitarist.org/2013/09/01/okra-pie-the-actual-code-i-forgot-to-post/], 2. parse that output to get the coordinates of all instances of a given word - in this case "tree", 3. and output the ImageMagick command needed to create a new image in which those instances of the word "tree" are blocked out with a black box overlayed with the word "redacted" written in red. The code outputs the following command: convert 001da.tif -fill black -pointsize 12 -stroke red -draw "rectangle 1900,832 2005,870" -draw "text 1900,851 ' {REDACTED} '" -draw "rectangle 1491,1596 1596,1633" -draw "text 1491,1614 ' {REDACTED} '" -draw "rectangle 989,1792 1069,1822" -draw "text 989,1807 ' {REDACTED} '" -draw "rectangle 1185,2488 1290,2525" -draw "text 1185,2506 ' {REDACTED} '" -draw "rectangle 508,3448 626,3492" -draw "text 508,3470 ' {REDACTED} '" 001da.png That results in this image [http://blog.humaneguitarist.org/uploads/001da_redacted.png]. It all seems promising. And here's the code in Python: #!/usr/bin/python # import modules. from lxml.html import * # open HOCR XHTML file; get all OCR words as list "ocrWords". root = parse("001da.hocr") body = root.find('body') ocrWords = body.findall('.//span[@class="ocrx_word"]') # if we find word "tree" in "ocrWords", then store coordinates in dictionary "wordDict". i = 0 wordDict = {} for ocrWord in ocrWords: node = ocrWord.text_content() if node != None: node = "".join([n for n in node.lower() if n.isalnum()]) # alphanumeric characters only. if node != "tree": continue coordinates = ocrWord.get("title") coordinates = coordinates.split(" ") coordinate = coordinates.pop(0) # remove word "bbox" from attribute value. word = {} word["text"] = node word["left"] = coordinates[0] word["top"] = coordinates[1] word["right"] = coordinates[2] word["bottom"] = coordinates[3][0:-1] wordDict[i] = word i += 1 # create and print ImageMagick command to: # 1) replace all instances of word "tree" with black box # 2) put phrase "{REDACTED}" inside the box. cmd = ["convert 001da.tif -fill black -pointsize 12 -stroke red"] for word in wordDict: word = wordDict[word] average = int(word["top"]) + int(word["bottom"]) average = average/2 cmd_part = '''-draw "rectangle %s,%s %s,%s" -draw "text %s,%s ' {REDACTED} '"''' %(word["left"], word["top"], word["right"], word["bottom"], word["left"], average) cmd.append(cmd_part) cmd.append("001da.png") cmd = " ".join(cmd) print cmd