blog.humaneguitarist.org

discoveries in digital audio, music notation, and information encoding

Archive for the ‘XML’ Category

pretty printing XML with Python, lxml, and XSLT

leave a comment

Last week or so I was doing some work with Python and lxml. And, it seems like a lot of people, using lxml's pretty printing wasn't really doing anything for me.

I couldn't find any native lxml solutions to make my XML look pretty. All I found were some functions on various code sites written by people to pretty print the XML using a bunch of regular expressions. Yuck.

So I thought, "Why not use XSLT to pretty print my XML?" and I found an XSL written by none other than Michael Kay on this page (see comment #4).

And it seems to work just fine as a function to return pretty XML, not to mention it's super short and sweet.

Anyway, here's an example of using the XSL for pretty printing.

from lxml import etree

def prettify(someXML):
  #for more on lxml/XSLT see: http://lxml.de/xpathxslt.html#xslt-result-objects
  xslt_tree = etree.XML('''\
    <!-- XSLT taken from Comment 4 by Michael Kay found here:
    http://www.dpawson.co.uk/xsl/sect2/pretty.html#d8621e19 -->
    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="yes" encoding="UTF-8"/>
      <xsl:strip-space elements="*"/>
      <xsl:template match="/">
        <xsl:copy-of select="."/>
      </xsl:template>
    </xsl:stylesheet>''')
  transform = etree.XSLT(xslt_tree)
  result = transform(someXML)
  return unicode(result)

myXML = etree.XML('<a><b><c><d/></c></b></a>')
print prettify(myXML)

The example above would output the following:

>>>
<?xml version="1.0"?>
<a>
  <b>
    <c>
      <d/>
    </c>
  </b>
</a>

By the way I don't even need to see the XML I'm processing most of the time, so why all the pretty printing fuss?

Well, because it bothers me …

And all good XML should look like an X-wing starfighter. If it doesn't your probably doing something wrong or your schema just sucks.

It isn't called an X-wing for no reason.

:P

--------------

Related Content:

Written by nitin

November 12th, 2011 at 11:05 am

Posted in XML

Tagged with , ,

indexing and searching timed text with Solr

leave a comment

I'm still learning about Solr so maybe this post is much ado about nothing. But according to this nabble.com thread, one can't index a source XML document in Solr with it's native XML structure intact and then in turn search that structure as one can in an XML database like BaseX.

For most things, that's fine. I mean for indexing titles, creators, and descriptions, etc. I just need to index the value of a given element like <title> so that I can search for that element's value.

But for timed text, it's different. Or at least, it can be.

Say I have this DFXP snippet for an audio file with an "id" value of "XYZ".

<p begin="10.0s" end="30.0s">Hello world!</p>

I would need the user to search for the string "Hello world!" or part of it but I would also need to index at least the value of the "begin" attribute so that I can pass that to a page that will play the file "XYZ" starting at the 10 second mark – if the user clicks on the "Hello world!" line in their search result. And I don't want the "10" second value to be something they search against since they might be searching for the string "10" within the text itself.

So I'm wondering how to do that with Solr.

Maybe when I learn more I'll discover a better way to do this, but for now I'm thinking I could do the following:

First, I would pretty much index the timed text twice in Solr.

<doc>
  <field name="id">XYZ</field>
...
  <field name="timedText-stripped">Hello world!</field>
  <field name="timedText">Hello World! {10}</field>
</doc>

After indexing the "id" of the audio file this would index:

  • just the text "Hello world!"
  • the text of "Hello world!" with the "begin" attribute value in curly quotes.

I guess this way the user could be made to search across the "timedText-stripped" field but, via the XSL that can be passed to Solr to display results, the "timedText" field could be displayed in a manner that would make the text "Hello World!" linked to whatever file will play file "XYZ" starting at the 10 second mark. Basically, by planting the "begin" value in curly quotes, I can parse the string for the text and the "begin" value as separate things.

So, here's a really crappy XSL snippet that would do something like that. It assumes a variable "$id" exists that equals "XYZ", the identifier for the example audio file.

<xsl:for-each select="//field[@name='timedText']">
  <xsl:variable name="whole">
    <xsl:value-of select="."/>
    <!-- Gets entire element string -->
  </xsl:variable>
  <xsl:variable name="text">
    <xsl:value-of select="substring-before($whole,'{')"/>
    <!-- Gets text prior to seconds -->
  </xsl:variable>
  <xsl:variable name="begin">
    <xsl:value-of select="substring-before(substring-after($whole,'{'),'}')"/>
    <!-- Gets seconds value from end of string -->
  </xsl:variable>
  <a href="someMediaPlayer.php?id={$id)&amp;begin={$begin}">
    <xsl:value-of select="$text"/>
  </a>
  <!-- So, I'm saying that
  "someMediaPlayer.php?id=XYZ&start=10"
  would launch a player that would start file XYZ at the 10 seconds mark.
  -->
</xsl:for-each>

The search output would be some HTML code like so:

<a href="someMediaPlayer.php?id=XYZ&amp;begin=10>Hello World!</a>

It seems weird to index something twice, more or less, but as user Erick says in the nabble.com thread, "You've gotta take off your DB hat and not worry about duplicating data."

But now as I write this, I'm wondering if I can't just index as follows:

  <field name="text">Hello world!</field>
  <field name="begin">10</field>

and trust that for each "text" field, there will be a matching "begin" field and that they can't just be used in tandem to create the same HTML link as above. Sounds like I need to play around some more.

:)

Update, September 6, 2012: I wrote a related post to this yesterday in terms of searching across timed text with MySQL and in doing so I realized that the way I was thinking of doing it in Solr was off. Rather than doing it the way I outlined in the original post content (above) in which I was thinking to index all the timed text for a given recording in one Solr "doc" element, I think it makes much more sense to index each line in its own "doc" element as such:

<doc>
  <field name="id">someMediaPlayer.php?source=someFile.mp3&amp;begin=10&amp;end=30</field>
  ...
  <field name="startTime">10</field>
  <field name="stopTime">30</field> 
  <field name="timedText">Hello world!</field>
  <field name="source">someFile.mp3</field> 
</doc>

That way there's no need to post-parse any data fields to get the start and stop time. And, moreover, rather than construct the URL to launch that segment of audio you can just put the URL directly in the "id" field. You can always use Solr built-in support for facets to facet off of the "source" field or some descriptive metadata like "title".

I'll file the original post under the "thinking out loud yet poorly" category.

--------------

Related Content:

Written by nitin

October 16th, 2011 at 10:54 am

learning about XProc on a Sunday morning

leave a comment

There are some cool PowerPoint slides on the  xfront.com  page about XProc, which I didn't know anything about until today.

I like the idea of a one-stop-shop for all kinds of XML processing, but I think unless I had a specific need to use it I'd probably use a Python script or something to sequentially do some batch XML work on a given document. That's exactly what XProc is a solution against, but I guess it all depends on one's needs. I should certainly think about it in terms of doing things with MusicXML though.

Anyway, I've only been through one slide – and it's long at about 170 slides, but I found it well done and easy to understand.

Also, there's a list of XProc implementations here – Java, Java, Java …

Apparently, there used to be a Python implementation on GitHub, but it's pulling a 404. Bummer. Well, at least GitHub's 404 message is a cool homage to Star Wars!

GitHub 404

Lastly, this daisy-pipeline for Daisy talking books looks interesting, too.

So is this post just a fancy way for me to save bookmarks for my future use or what?

:P

    Written by nitin

    August 28th, 2011 at 10:28 am

    Posted in XML

    Tagged with , , ,

    fun with lxml, part 2

    leave a comment

    Just following up on a previous post from about a month ago …

    Per a request, I need to tweak some software of mine to allow a user to specify a parent element in an XML document and in turn retrieve child element values. Big deal. That's what XSLT is for – blah, blah, blah. But this is particularly for PubMed XML exports and turning those into Excel files.

    Anyway, the value of a given child element needs to be able to be specified (i.e. by position) and placed into an Excel cell. Alternatively, all children values need to be able to be placed into one cell separated by a delimiter.

    So before I try and tinker with the software I want to work a solution out using test code:

    from lxml import etree
    
    ##### Step 1
        # make an XML example
    xml = '<a>  \
                <b>  \
                    <c>cee1</c>  \
                    <d>dee1</d>  \
                    <c>cee2</c>  \
                    <d>dee2</d>  \
                </b>  \
                <b>bee</b>  \
                <c>cee3</c> \
            </a>'
    
    ##### Step 2
        # parse the XML example
    parseXML = etree.XML(xml)
    
    ##### Step 3
        # make a list of the first (i.e. the Zero-th) <b> element
    b_list = parseXML.findall('.//b')[0]
    
    ##### Step 4
        # get a list of all the children in that first <b> element
    b_childList = b_list.getchildren()
    
    ##### Step 5
        # make a new list called "c_list" with only <c> elements
        # that are children of our first <b> element
    
    c_list = [] # make an empty list to put things in and
    # place into that list only element *values* for child elements
    # of first <b> element from children that are <c> elements only
    for child in b_childList:
        if child.tag == 'c':
            c_list.append(child.text)
    
    ##### Step 6
        # print desired results
    
    for c in c_list: #print all values, one per line
        print (c)
    
    print ('-'*4) # print dash line for reading ease
    print ('; '.join(c_list)) # print all values on one line with delimeter
    
    print ('-'*4)
    print (c_list[1]) #print only the second <c> element value
    
    

    Here are the results:
    >>>
    cee1
    cee2
    ----
    cee1; cee2
    ----
    cee2

    --------------

    Related Content:

    Written by nitin

    April 9th, 2011 at 10:56 am

    Posted in XML

    Tagged with ,

    fun with lxml

    leave a comment

    First off, I don't consider myself a programmer. I just know enough to dabble even though I try and learn new stuff all the time in the hope that I – as someone in digital libraries – can occasionally write something that can serve the needs of others, rather than serving my ego. Don't get me started on people who try and write software that has no utility other than patting themselves on the back …

    Anyway, that's another post for another time.

    So, the other day I got some questions/feature requests for PubMed2XL and so I started thinking about ways to tackle a few of the issues. It kinda makes me feel like a real programmer when people in the real world are asking about the software – but only for a few minutes before I make myself come back down to earth.

    :/

    Currently, the software places into a spreadsheet cell the value of one XML element, the position of which is defined by the user in the setup file. But there may potentially be a need it seems to be able to concatenate ALL the values for a given element into one spreadsheet cell. So I wrote a little function to help me get started with that.

    The code uses this simple restaurant-based XML file from W3Schools and uses the awesome lxml Python library.

    When run, it yields the following:

    >>>
    Calories for the first entree:
    650
    Calories for all entrees:
    650; 900; 900; 600; 950

    And here's the code:

    #import required modules (lxml is non-standard; it likely needs to be installed)
    import urllib #makes it easy to read documents from the web!
    from lxml import etree #great XML parser and more!
                           #see: http://lxml.de/
    
    #retrieve values from an XML file
    def ElementCherryPicker(xpathArg, positionArg):
        '''
        This places all the element values for the element passed as the
        "xpathArg" argument into a list called "elementBox". It then returns
        the list item preceeding the one specified by the "positionArg" argument.
        This means passing a "1" equates to the first item in the list instead
        of the traditional "0". If "0" is passed then the entire list will be
        returned as a string with a delimiter of '; '.
        '''
        positionArg = positionArg - 1
        elements = parseUrl.findall(xpathArg) #make list of all matching elements
        elementBox = [] #create empty list
        for element in elements:
            elementBox.append(element.text) #place element values into the list
        if positionArg != -1:
            try:
                elementBox = elementBox[positionArg]
            except:
                elementBox = [] #if no element at stated position exists,
                                #then make the list empty again
        else:
            delimiter = '; '
            elementBox = delimiter.join(elementBox)
        return elementBox
    
    #define, open, read, and parse an XML file
    url= 'http://www.w3schools.com/xml/simple.xml'
    readUrl = urllib.urlopen(url).read()
    parseUrl = etree.XML(readUrl)
    
    #print header and the values returned from ElementCherryPicker()
    print 'Calories for the first entree:'
    print ElementCherryPicker('.//calories', 1)
    print 'Calories for all entrees:'
    print ElementCherryPicker('.//calories', 0)
    
    
    	
    
    	
    --------------

    Related Content:

    Written by nitin

    March 7th, 2011 at 11:13 am

    Posted in scripts,XML

    Tagged with , ,

    of ADLs and SMIL and stuff

    leave a comment

    Even more than usual – this post is me thinking out loud. So some of the stuff at the bottom might not make sense since it refers to some software of mine that really only I use.

    This morning I played around a little with Kino, an open-source video editor, and Adobe Audition, Adobe's flagship audio editor – which is based on their acquisition of Cool Edit.

    The reason I wanted to play around with Kino is because it can export the project timeline to SMIL. I was mainly interested in seeing if it could be used as a pseudo audio editor – the idea being it could be a quick and dirty SMIL exporter. Well, it doesn't seem to support importing audio formats. I couldn't get it to import WAV or OGG files. It's still a cool application though.

    The session exports from Audition are, as expected, pretty dense. For people like me who work in libraries there are issues involved in terms of setting limits for how much can and should be done in digital audio "preservation" (funny, I don't remember ordering jam and bread …). Well, at least I think there need to be limits, lest libraries want to start being creators, too, and admit that in doing so they are donating material of their own editorial designs back onto themselves. Anyway, by imposing limits I'm not sure XML session exports of thousands of lines for simple edits are a good idea.

    I'd like to see other session formats without downloading demos for all kinds of audio editing software (some more expensive packages don't even seem to offer demos). For a small fee, there's always AATranslator.

    But getting back to SMIL, I'm wondering how to use it in conjunction with AudioRegent without writing more code into the application – for now.

    It would seem pretty easy to create a SMIL to SimpleADL XSLT and set up a chain to create derivative files.

    Specifically, say I have a source file called source.wav. And I have two SMIL files as such:

    source-1.smil.xml

    <?xml version="1.0"?>
    <smil xmlns="http://www.w3.org/2001/SMIL20/Language">
      <body>
        <seq>
          <audio src="source.wav" clipBegin="00:00:00.000" clipEnd="00:00:30.000."/>
        </seq>
      </body>
    </smil>
    

    and source-2.smil.xml

    <?xml version="1.0"?>
    <smil xmlns="http://www.w3.org/2001/SMIL20/Language">
      <body>
        <seq>
          <audio src="source.wav" clipBegin="00:00:30.000" clipEnd="00:00:50.000."/>
        </seq>
        <seq>
          <audio src="source.wav" clipBegin="00:01:00.000" clipEnd="00:02:00.000."/>
        </seq>
      </body>
    </smil>
    

    For both, the assumption is that two clips are to be made from source.wav: source-1 and source-2.

    All I'd need to do is then setup a chain as such:

    1. Do source-1.smil.xml to temp.adl.xml via XSLT.
    2. Have AudioRegent make source.ogg by pointing it, via the command line options, to the source file, source.wav, and the SimpleADL file, temp.adl.xml.
    3. Rename source.ogg to source-1.ogg – i.e. with the same prefix as the corresponding SMIL file.
    4. Do source-2.smil.xml to temp.adl.xml via XSLT, overwriting temp.adl.xml.
    5. Have AudioRegent make source-2.ogg by pointing it, via the command line options, to the source file, source.wav, and the SimpleADL file, temp.adl.xml.
    6. Rename source.ogg to source-2.ogg – i.e. with the same prefix as the corresponding SMIL file.

    Here's what temp.adl.wav would look like initially (step 1):

    <?xml version="1.0" encoding="UTF-8"?>
    <audioDecisionList filename="source.wav">
      <region id="_01">
        <in unit="seconds">0</in>
        <duration unit="seconds">30</duration>
      </region>
      <outputAsTracks>false</outputAsTracks>
    </audioDecisionList>
    

    And then it would look like this during the second pass (step 4):

    <?xml version="1.0" encoding="UTF-8"?>
    <audioDecisionList filename="source.wav">
      <region id="_01">
        <in unit="seconds">30</in>
        <duration unit="seconds">20</duration>
      </region>
      <region id="_02">
        <in unit="seconds">60</in>
        <duration unit="seconds">60</duration>
      </region>
      <outputAsTracks>false</outputAsTracks>
    </audioDecisionList>

    By the way, since the SimpleADL files are temporary, I don't see why – rather than converting time format to seconds – I couldn't just use something like this:

    <in unit="time">00:01:00.000</in>
    <duration unit="time">00:01:00.000</duration>

    or something …

    --------------

    Related Content:

    Written by nitin

    January 9th, 2011 at 12:02 pm

    MXMLiszt version 0.9.1 released

    leave a comment

    I've made some minor changes to MXMLiszt to address a bug that began to appear after months of trouble-free performance.

    So here are the changes I made to address the issue related to the display of MODS metadata:

    • Created mods.css file to display MODS on a transparent background.
    • Changed displayMODS.php to display MODS files via an <iframe>. The previous version was using the mods.xsl stylesheet to parse the MODS element values in real-time.

    You can read the documentation and download the source code for version 0.9.1 here.

    --------------

    Related Content:

    Written by nitin

    October 2nd, 2010 at 12:08 pm

    Posted in music notation,scripts,XML

    Tagged with

    PubMed to Excel: PubMed2XL version 0.9

    4 comments

    I've released the first Beta version of PubMed2XL, a Windows application that converts article lists from pubmed.gov into Microsoft Excel files.

    If you'd like to use the software you can download it. Yes, it's free.

    :P

    Here's a little video tutorial on installing and using the software:

    PubMed2XL: Basic Installation and Use from nitin arora on Vimeo.

    PubMed2XL's documentation is available at: blog.humaneguitarist.org/​projects/pubmed2xl/.

    The documentation includes a download link to the program files.

    --------------

    Related Content:

    Written by nitin

    September 19th, 2010 at 7:03 pm

    Posted in scripts,XML

    Tagged with , ,

    MXMLiszt release 0.9.0

    4 comments

    MXMLiszt version 0.9.0 is now available for download.

    MXMLiszt is a web-based delivery and search/retrieval environment for MusicXML files and their manifestations.

    MXMLiszt was created in order to complete a Master’s in Library and Information Science at the University of Alabama under the direction of Dr. Steven L. MacCall.

    The documentation and source-code download links are available here.

    The accompanying research paper, “Beyond Images: Encoding Music for Access and Retrieval” can be accessed here.

    As of June, 2010 the live demo of MXMLiszt can be accessed at:

    http://opensourcelibrarian.org/MXMLiszt

    --------------

    Related Content:

    Written by nitin

    June 13th, 2010 at 6:32 pm

    Posted in music notation,scripts,XML

    Tagged with

    segmenting audio with AudioRegent, SoX and XML

    leave a comment

    For some reason I feel obligated to point out that I haven’t blogged in a while for a few reasons:

    1. Christmas break from school/work at the University of Alabama
    2. the desire not to blog for the sake of blogging
    3. and …

    I’ve been working on something huge – at least for me. It’s a piece of software called AudioRegent that harnesses XML to create derivative "clips" of regions within WAV audio files. A region is simply a user-defined segment within an audio file, like a track on a Compact Disc.

    Besides writing the program in Python, which I pretty much finished in December, I had to also develop the XML format which I call SimpleADL (Simple Audio Decision List) that AudioRegent looks at and then makes derivative audio clips by leveraging SoX, the Sound Exchange command line audio editor. AudioRegent and SimpleADL can also be used to sync audio to text, like transcripts.

    Actually, the programming and devising SimpleADL were the easy part. The hard stuff was the documentation and deciding on a license for the software.

    I tried to find a balance in documenting the software: being thorough without writing a novel. I’m not sure I succeeded, but I can always improve it with time.

    I used the W3C’s Amaya editor to write the documentation in XHTML. Sure, you can use OpenOffice to export a document to XHTML, but man is it bloated and messy. Amaya writes really clean XHTML.

    As for the license, I chose the BSD license. As I understand it, this allows one to use the source code at will in future open or closed-source applications as long as you maintain the credits for AudioRegent. I was tempted to use the Mozilla Public License (MPL) which, again from what I can tell, is similar to the BSD license except that any source derived from AudioRegent would have to stay open-source though any peripheral code can be closed-source. I absolutely decided against the GNU General Public License which is viral and imposes its philosophy perpetually on all subsequent code, even peripheral code. Some have even argued that it works against its own objectives and is less "open" than the MPL.

    Now I realize that, practically speaking, a skilled programmer could write better code from scratch in 30 minutes as opposed to the some 30 hours I needed, but I wanted to go about this quasi-professionally. And I learned more about licensing, which was cool.

    Anyway, rather than try and explain the software itself and how to get it, I’d be better off pointing you to the documentation if you have any interest …

    --------------

    Related Content:

    Written by nitin

    January 16th, 2010 at 2:05 pm

    Switch to our mobile site