hightailing it out of None with lxml

[Thu, 08 Mar 2018 02:56:09 +0000]
I've been having to do a lot with XML and Python's lxml library for my work. Some of the XML files we are processing are in the 5-10 gigabyte territory. And, well, that kinda sucks in the first place but I was working on a bug today and thought this aspect of lxml was odd. Before I begin, an element (etree._Element object) created by lxml requires you to manually clean up control characters (most of them anyway) before writing an element's text (yes, even CDATA), attribute, or tail values. The tail is any string that follows the close of an element and precedes another. The error in trying to write illegal characters to the element is a ValueError with the following text: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters That's fine. We found some code online to address this by removing control characters from a string. But we had to modify it because, looking deeper, lxml does let line breaks and tabs through which - far as I know - are control characters. Actually, carriage returns are OK too, but they are escaped with some kind of XML sequence and it's ugly - at least, I think, when writing them to an element's tail. But a line break substitution is a good enough representation for us. We also replace form feeds and vertical tabs with a line break as a means to translate those characters into something similar. Whitespace is significant for us. In fairness, blowing away all control characters directly addresses the ValueError raised by lxml. But taking the error message literally leads to another problem, namely lost whitespace representation. Hence, our modification. In several places in our code base, we try to write an element's text, etc. If there's a ValueError (lxml says it can't write the illegal characters) we legalize the string with our function and try it again so that it works - or rather, doesn't break. Something that can get annoying with lxml is that the initial text and tail values are None as opposed to empty strings. So one has to check if the values are None first before appending a string - "needs more string!". And here's the little bit of inline documentation on an element's tail: >>> help(etree._Element.tail) Help on getset descriptor lxml.etree._Element.tail: tail Text after this element's end tag, but before the next sibling element's start tag. This is either a string or the value None, if there was no text. What's odd (yes, I'm just now getting to this) about the element tail is that it appears that if the initial attempt to append to an element's tail fails due to illegal characters, the tail is reset to None even if I'd previously set it to be a string. So after we legalize the string and attempt to append it to the existing tail we get a TypeError because we can't, of course, append a string to None. I'm not sure why it gets reset. By the way, the same stuff happens for the element's text - it also would get reset to None, it appears. It sure would be nice if an empty string was the default instead of None, but that's just based on my current needs. It would also be nice if lxml had the ability to legalize one's text for you (maybe depending on setting some option). But I'm not going to wait around for that to happen any day soon. It's already an amazing library, so I guess I shouldn't be too needy. Anyway, here's a Python snippet below with two functions, fails() and works(). Strangely enough, I decided that the first should be the one to raise an error. Go figure. In fails(), I first force the tail to be an empty string and not None. After I fail to append a form feed to the tail, I try again by legalizing the form feed and converting it to a line break. But the tail becomes None again, so I get a TypeError for trying to append a line break to None. In works(), I force the tail to be an empty string and also create a variable to be equal to the tail's value. Once appending the form feed fails, I legalize the form feed and append the safe line break to the variable. Then I just set the tail to equal the variable and all seems OK. Just something I wanted to share in case anyone else is spending their night debugging stuff like this. Now it's time to head to the bar - where I'll no doubt be checking the tail (no pun intended) of our server's log from my phone while I try to run our code again. Hopefully, this time without incurring an aforementioned ValueError or TypeError. #!/usr/bin/env python3 import unicodedata from lxml import etree # our legalizing function. def _legalize_xml_text(xtext): """ A static method that alters @xtext by replacing vertical tabs, form feeds, and carriage returns with line breaks and removing control characters except line breaks and tabs. This is so that @xtext can be written to XML without raising a ValueError. Args: - xtext (str): The text to alter. Returns: str: The return value. """ # legalize @xtext. for ws in ["\f","\r","\v"]: xtext = xtext.replace(ws, "\n") xtext = "".join([char for char in xtext if unicodedata.category(char)[0] != "C" or char in ("\t", "\n")]) return xtext # create an element. root = etree.Element("root") print("It is {} that root.tail is None.".format(root.tail == None)) # a wicked form feed to append to root's tail. ILLEGAL = "\f" def fails(): root.tail = "" # make it a string. try: print("The current tail is: {}".format(repr(root.tail))) # still a string. root.tail += ILLEGAL except ValueError: print("The current tail is: {}".format(repr(root.tail))) # it's None again! root.tail += _legalize_xml_text(ILLEGAL) # raises: # TypeError: unsupported operand type(s) for +=: 'NoneType' and 'str' def works(): root.tail = "" # make it a string. temp_tail = root.tail # store the current tail value in a var. try: print("The current tail is: {}".format(repr(root.tail))) root.tail += ILLEGAL except ValueError: temp_tail += _legalize_xml_text(ILLEGAL) root.tail = temp_tail # now replace the tail with the var's value. print("The current tail is: {}".format(repr(root.tail))) Now, if I run the code I get this: It is True that root.tail is None. >>> fails() The current tail is: '' The current tail is: None Traceback (most recent call last): File "", line 42, in fails root.tail += ILLEGAL File "src\lxml\lxml.etree.pyx", line 1048, in lxml.etree._Element.tail.__set__ (src\lxml\lxml.etree.c:53346) File "src\lxml\apihelpers.pxi", line 728, in lxml.etree._setTailText (src\lxml\lxml.etree.c:24538) File "src\lxml\apihelpers.pxi", line 703, in lxml.etree._createTextNode (src\lxml\lxml.etree.c:24267) File "src\lxml\apihelpers.pxi", line 1443, in lxml.etree._utf8 (src\lxml\lxml.etree.c:31486) ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<pyshell#0>", line 1, in fails() File "", line 45, in fails root.tail += _legalize_xml_text(ILLEGAL) TypeError: unsupported operand type(s) for +=: 'NoneType' and 'str' >>> works() The current tail is: '' The current tail is: '\n' ... Update, March 8, 2018: I didn't want to poke around too much in the lxml source code, but it looks like the behavior emanates here: [] - maybe around lines 189 - 198?. And that if it fails to write the tail (or, I'm guessing text) property, then it just blows it away, hence the return to None.