blog.humaneguitarist.org
fun with lxml, part 2
[Sat, 09 Apr 2011 15:56:34 +0000]
Just following up on a previous post [http://blog.humaneguitarist.org/2011/03/07/fun-with-libxml/]from about a month ago ...
Per a request, I need to tweak some software of mine to allow a user to specify a parent element in an XML document and in turn retrieve child element values. Big deal. That's what XSLT is for - blah, blah, blah. But this is particularly for PubMed XML exports and turning those into Excel files [http://blog.humaneguitarist.org/projects/pubmed2xl/].
Anyway, the value of a given child element needs to be able to be specified (i.e. by position) and placed into an Excel cell. Alternatively, all children values need to be able to be placed into one cell separated by a delimiter.
So before I try and tinker with the software I want to work a solution out using test code:
from lxml import etree
##### Step 1
# make an XML example
xml = '<a> \
<b> \
<c>cee1</c> \
<d>dee1</d> \
<c>cee2</c> \
<d>dee2</d> \
</b> \
<b>bee</b> \
<c>cee3</c> \
</a>'
##### Step 2
# parse the XML example
parseXML = etree.XML(xml)
##### Step 3
# make a list of the first (i.e. the Zero-th) <b> element
b_list = parseXML.findall('.//b')[0]
##### Step 4
# get a list of all the children in that first <b> element
b_childList = b_list.getchildren()
##### Step 5
# make a new list called "c_list" with only <c> elements
# that are children of our first <b> element
c_list = [] # make an empty list to put things in and
# place into that list only element *values* for child elements
# of first <b> element from children that are <c> elements only
for child in b_childList:
if child.tag == 'c':
c_list.append(child.text)
##### Step 6
# print desired results
for c in c_list: #print all values, one per line
print (c)
print ('-'*4) # print dash line for reading ease
print ('; '.join(c_list)) # print all values on one line with delimeter
print ('-'*4)
print (c_list[1]) #print only the second <c> element value
Here are the results:
<br/>
>>> <br/>
cee1<br/>
cee2<br/>
----<br/>
cee1; cee2<br/>
----<br/>
cee2