using PhantomJS to convert HTML email with boo-boos to plain text

[Sat, 28 Jan 2017 16:14:18 +0000]
I'm currently making my living working at home as a developer on a grant project related to archiving and doing some automated processing of state government of North Carolina emails (i.e. Microsoft Outlook). It's for a grant so the party's going to end one day, but I'll enjoy it while I can. Here and there, I'll post about things that might be of general interest outside of the project. For instance, one of the things we need to do is get a nice plain text representation of HTML emails when there's no "text/plain" alternative already available from the PST file. Ideally, there would always be both versions if the sender chose HTML as the default format. It's a discussion for another time, but Microsoft could stand to put some effort into their HTML email markup. It would be really nice if things were semantic-y and not just focused on appearance. Google, by contrast, looks like it marks up its email signatures so that one could easily identify what part of a sent email is a signature - at least for the sender's original email sent via Gmail. But I digress. Anyway, HTML-email to TEXT-email isn't as simple as it sounds. Simply stripping tags leaves a mess even if one removes everything outside the "body" tag. There's all kinds of undesirable text and the issue of trying to re-create whitespace and line breaks that aide readability becomes an unholy nightmare. And the issue is really that the emails need to be readable - by future archivists or, potentially, the public (if a records request is made). If I just wanted the plain text for simple automated processing tasks, then it's not as complicated an issue - I could probably just get away with stripping tags. First, I looked for a pre-existing Python solution. But the Python html2text [] module wasn't a good option - nor were the results looking that great especially because some of the sample emails I have contain HTML boo-boos such as empty or unclosed tags, etc. Also, it converts HTML to Markdown. Markdown retains link URLs in text, but Markdown isn't what I'd consider readable by the general public and a lot of archivists. In certain circumstances it's great (like writing this blog post on my desktop), but emails are their own animal and I don't think Markdown is a good plain text format for emails with all kinds of crap in them. The node.js package html-to-text [] seems far more promising, but for the sample email I was using (that contained empty "li" tags), the output looked like a mistake where those occurred. I want the output to be purty []. I do need to go back to that node.js package and investigate further. After all, it seems specifically designed for email conversion and has options I need to explore. Plus, I could do some pre-processing such as removing empty "li" tags. For this week, however, what I'm liking is to use PhantomJS [] to render a plain text version since it doesn't worry about trying to add leading asterisks for each list item and whatnot. The only thing I decided to keep for now was hyperlink "href" values, so I just had to add a little code to do that. In other words, a hyperlink like this: <a href="">bar</a> becomes: bar <> PhantomJS lets me parse the document with a browser and JavaScript which seems ideal for HTML that's not always totally valid. It seems to do a great job with adding whitespace where needed and making tables into tab-separated rows. I prefer an all-in-one solution (parse and render), so I'm not really exploring lxml's HTML5 parser [] because I still need something that renders nice text. Anyway, here's the PhantomJS code, below, if anyone's interested. /* FILE: htmlEmail2textEmail.js DESCRIPTION: Converts HTML email file to plain text while retaining hyperlink locations within brackets. Before: "<a href="">bar</a>" After: "bar <>" REQUIREMENTS: PhantomJS <> USAGE: $ phantomjs .\htmlEmail2textEmail.js >> You must pass a .html file (from your working folder). $ phantomjs .\htmlEmail2textEmail.js test.html >> Writing file: test.html.txt */ // includes. var fs = require("fs"); var webpage = require("webpage"); var system = require("system"); // test for file argument. if (system.args.length === 1) { console.log("You must pass a .html file (from your working folder)."); phantom.exit(); } // create browser path to HTML file. var html = "file:///" + fs.workingDirectory + "/" + system.args[1]; // rewrite "A" tag values for plain text version. var page = webpage.create();, function (status) { var modify = page.evaluate(function() { var children = document.body.getElementsByTagName("A"); for (i=0; i < children.length; i++) { var child = children[i]; child.innerText = child.innerText + " <" + child.href + ">"; } return; }); // write plain text version to file. var txt = system.args[1] + ".txt"; console.log("Writing file: " + txt); fs.write(txt, page.plainText, "w"); phantom.exit(); }); Get it? Phantom []JS and Boo! []-boos?