spotting the John Hancock in emails

[Sat, 06 May 2017 17:15:24 +0000]
I think I mentioned before that I'm working on software to help state government archivists process emails. And as a chef I once knew said about chicken and salmonella ... If email doesn't scare you, it should. Email is a great example of something will all kinds of rich data inside when in reality a lot of that valuable data seems to be tossed around server to server as loose, non-structured data. One of the things that has had me nervous from Day 1 of taking on the project was when my co-worker told me he wanted us to think about how to identify email signatures. That's because he said we might want to consider them as "noise" and omit them - or at least signatures from outgoing government emails - from natural language processing tasks. For example, if we know John Doe already works for our state government, we might not really want "John Doe" to be extracted as a Person entity from his signature block since we'll already know the email is from "John Doe", a Person from our state government. Anywho, signature extraction from plain text email is a dicey affair. While Gmail seems to encode (in the HTML MIME version of their email) the signature of the outgoing sender, that doesn't help with signatures that are contained within replies and forwarded replies within the same email and from different email clients. Also, the State of North Carolina uses Outlook, which doesn't encode the signature from what I see. Because of this and other issues, we're using the plain text version of emails anyway. In cases where the email is available as HTML only, we're converting it to plain text before any processing. I'll talk more about that conversion process in a future post. Perhaps I'll write that soon, too, as I've neglected this blog for a while. So, what's a signature? Depends on who you ask. The people behind the Talon project [] suggest that it's usually something of 60 characters or less as mentioned in this issue []. Example: Sincerely, John Doe In government and the other places I've worked, they're more like the example in the Talon issue I already linked to: John Smith Co-Founder and CEO Xxxxxxxxx mobile: 555.115.4274 | book a mtg <> | @handle <> | linkedin <> | video <> Unable to find anything pre-written, I worked up something with the idea that if it does a decent job, we can at least build some training data and try using machine learning to detect signatures down the road. But first, before trying to detect signatures, we needed to detect replies within a single email string. Since we're just using Outlook-heavy sample stuff for now, I wrote a little Outlook splitter to detect reply blocks in the form: From: Sir John Doe <> To: Lady Jane Doe <> Subject: Email Signatures Sent: Apr 11, 2017 at 1:51 AM ... etc. Gmail, by contrast, splits a reply with stuff like: On Tue, Apr 11, 2017 at 1:51 AM, Sir John Doe <> wrote: It actually looks like the Talon developers have built splitters for Gmail and some mobile devices, too - so I do need to look at their code more closely at some point because we'll want to split on more than Outlook formats. In other words, we may want to use the Talon splitters but not the Talon signature detection. Anyway, once the replies are split off the idea is simple: 1. take the sender info (either from the email header for the most recent message within the email or from the FROM: line for embedded replies) and split the email part from the name part, i.e. split "Sir John Doe " into "Sir John Doe" and "". 2. normalize the name part by removing non-alphabetic characters and common titles/abbreviations. In other words, "Sir John Doe" to "John Doe". 3. read the reply backward, line by line, all the while normalizing each line using the same method as in step 2. When there's a match of at least two tokens, in this case "John" and "Doe", between the name part from step 2 and the line from Step 3, the code assumes this is the start of the signature and assumes that this line until the end of the reply is the signature. ... I tested the code on about 7,500 emails we have from an ex-government employee. The script found nearly 5k signatures from within about 3,500 emails. In other words, it didn't find signatures in all emails and found multiple signatures in some emails. It is true that not all emails had signatures. I spot checked 1,000 random results and I agreed that it did a good job nearly 930 - or 93%. In cases where it was "wrong" it often found the signature but since it assumes the signature ends with the last line of the reply, it considered extraneous text to be part of the signature when I, as a human, only saw the signature as being a portion of what was extracted. But now that we're working on tagging some training data, we're going to be looking to take a stab at doing this in conjunction with machine learning. The code for the latest commit as of this writing is here: []. We're moving things around, so if I just give a link to the file sans the branch it might not be there after a couple weeks. And, yes, we're making somewhat of a "thinking out loud" mess and don't have a license specified. We know. :P Anyway here's a short description of the files that might be of interest. * []: list of titles and abbreviations that are removed during normalization. * []: the script that detects replies and signatures. * []: I didn't talk about this. Just an early test to extract signature features. * sample_email.txt []: A sample email. Note, I can't post our sample government emails on GitHub. * sample_email.txt.json []: Output of the sample email after being analyzed for signatures, etc. Note that the sender metadata is Null in the first reply block because the metadata for that part of the email is in the email header and can't be extracted from the reply block because it's not in the email body itself.