blog.humaneguitarist.org

discoveries in digital audio, music notation, and information encoding

Archive for the ‘XML’ Category

pixelation: custom XSLT functions with Python and lxml

leave a comment

I'll be brief.

Because the Python "lxml" module doesn't support XSLT 2.0 functions, I was looking at support for EXSLT

… but then stumbled on how to write my own functions and call them from stylesheets.

Freakin' cool.

I like calling it "pxslt" for "Python XSLT" and pronouncing it like "pixelate".

:P

Example below of the "module" I made;  the script that calls it, and the results.

Told you I'd be brief.

Module:

#pxslt.py

def underscore(context, word):
  '''Replace whitespace with underscore.'''
  out = word[0].replace(' ', '_')
  return out

def multiply(context, int_val, int2_val):
  '''Multiply two integers.'''
  int_val, int2_val = int(int_val[0]), int(int2_val[0])
  return int_val * int2_val

def libraryThing(context, isbn):
  '''Get language for a work based on ISBN using LibraryThing API.'''
  isbn = isbn[0]
  import urllib
  res = urllib.urlopen('http://www.librarything.com/api/thingLang.php?isbn=' + isbn)
  res_r = res.read()
  return res_r

##### DO NOT EDIT
##### makes it possible to call the above functions with XSLT
def pxslt():
  myFunctions = []
  gbs = globals()
  from inspect import isfunction
  for gb in gbs:
    if isfunction(gbs[gb]) and gb != 'pxslt':
      #print gb
      myFunctions.append(gbs[gb])

  from lxml import etree
  #see: http://lxml.de/extensions.html
  ns = etree.FunctionNamespace('file://libs/pxslt.py')
  ns.prefix = 'pxsl'
  for myFunction in myFunctions:
    name = str(myFunction.func_name)
    ns[name] = myFunction
  return ns

Usage example:

from lxml import etree

#####
myXML = etree.XML('''\
<a>
  <b>Hello. This will appear with whitespaces replaced by underscores.</b>
  <c>3</c>
</a>''')

myXSL = etree.XSLT(etree.XML('''\
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:pxslt="file://libs/pxslt.py">
  <xsl:output method="text" version="1.0" />
  <xsl:template match="a">
    <xsl:variable name="isbn">9955081260</xsl:variable>
    <xsl:value-of select="pxslt:libraryThing($isbn)" />
    <xsl:text>\n</xsl:text> <!-- Python will line break here -->
    <xsl:value-of select="pxslt:underscore(b/text())" />
    <xsl:text>\n</xsl:text> <!-- Python will line break here -->
    <xsl:call-template name="mathFunc">
    </xsl:call-template>
  </xsl:template>
  <xsl:template name="mathFunc">
    <xsl:variable name="myNum">10</xsl:variable>
    <xsl:value-of select="pxslt:multiply(c/text(), $myNum)" />
  </xsl:template>
</xsl:stylesheet>'''))

import pxslt
pxslt.pxslt() #get all set up with namespaces and function stuff

print(myXSL(myXML))

#myXSL_file = etree.XSLT(etree.parse('foo.xsl')) #for testing with a real XSL file
#print(myXSL_file(myXML))

Output:

>>>
lit
Hello._This_will_appear_with_whitespaces_replaced_by_underscores.
30

--------------

Related Content:

Written by nitin

November 2nd, 2012 at 5:28 pm

let’s fighting love: using Jinja templates with XSL

leave a comment

It's Friday and I should be out with a drink in my hand.

Instead, I have pinkeye.

Dammit.

Anyway, I've been finishing up a script at work that uses XSL and I have a way to pass some variables with into the XSL code to augment what can be done. I've mentioned this a little before here although now I use a prettier template-like syntax as such:

<xsl:variable name="baseURL" select="'{{$BASE_URI}}'" />
<xsl:variable name="URL_params" select="'{{$CURRENT_PARAMS}}'" />

But today I was wondering about eventually (i.e. not now) using Jinja2 with XSL, i.e. use its templating within XSL code. I really like what little I've done using Jinja templates with Google App Engine (Python).

So anyway, here's some test code. It seems very promising.

I'm sure it's been done before, but I'm just a little bored.

Hell, I named the XSL transformer function "gobot", m'kay?

;(

def jinjafy(xsl):
  from jinja2 import Template
  template = Template(xsl)
  result = template.render(name="John Doe")
  return result

def gobot(xml, xsl):
  from lxml import etree
  xml_tree = etree.XML(xml)
  xslt_tree = etree.XML(xsl)
  transform = etree.XSLT(xslt_tree)
  result = transform(xml_tree)
  return result
 
#####
myXML = ('''\
<a>
  <b>Hello </b>
</a>''')

myXSL = jinjafy('''\
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text" version="1.0" /> 
  <xsl:template match="a">
    <xsl:value-of select="b" />
    <xsl:value-of select="'{{ name }}'" />
    <xsl:text>!</xsl:text>
  </xsl:template>
</xsl:stylesheet>''')

print gobot(myXML, myXSL) #yields: "Hello John Doe!"

Here's a better example …

def jinjafy(xsl):
  from jinja2 import Template
  template = Template(xsl)
  result = template.render(template_values)
  return result

def gobot(xml, xsl):
  from lxml import etree
  xml_tree = etree.XML(xml)
  xslt_tree = etree.XML(xsl)
  transform = etree.XSLT(xslt_tree)
  result = transform(xml_tree)
  return result
 
#####
names = ["John Doe", "Jane Doe"]
template_values = {"names": names}

myXML = ('''\
<a>
  <b>Hello </b>
</a>''')

myXSL = jinjafy('''\
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text" version="1.0" />
  <xsl:template match="a">
    {% for name in names %}
      <xsl:value-of select="b" />
      <xsl:value-of select="'{{ name }}'" />
      <xsl:text>! </xsl:text>
    {% endfor %}
  </xsl:template>
</xsl:stylesheet>''')

print gobot(myXML, myXSL) #yields: "Hello John Doe! Hello Jane Doe!"

And while we're on the subject …

    Written by nitin

    October 26th, 2012 at 7:24 pm

    Posted in XML

    Tagged with , ,

    Python, lxml, and xsl:include

    leave a comment

    Keeping this short because yes, dammit, I'm home sick.

    I needed/wanted to do some XSL transformations with Python using an <xsl:include> statement. But I kept getting some errors along the lines of "lxml cannot' resolve uri string".

    So anyway after deciding I didn't want to read through all the crap on the lxml site about this, I fumbled my way through to what appears to work.

    It seems the include statements work fine when I DO NOT read() the XSL file before using it for a transformation.

    In the interest of really keeping it short like I said, here's some code and the results below.

    from lxml import etree
                    
    def works(someXML):
      #don't even open() the XSL file ...
      xslt_tree = etree.parse(xslFile)
      transform = etree.XSLT(xslt_tree)
      result = transform(someXML)
      return result
    
    def also_works(someXML):
      #open() the XSL file, but don't read() it ...
      xsl_opened = open(xslFile, "r")
      xslt_tree = etree.parse(xsl_opened)
      transform = etree.XSLT(xslt_tree)
      result = transform(someXML)
      return result
    
    def fails(someXML):
      #open() and read() the XSL file ...
      xsl_opened = open(xslFile, "r")
      xsl_read = xsl_opened.read()
      xsl_parsed = etree.XML(xsl_read)
      transform = etree.XSLT(xsl_parsed)
      result = transform(someXML)
      return result
    
    #####
    xslFile = "b.xsl"
    
    myXML = etree.XML('''\
    <a>
      <b>b-val</b>
      <c>c-val</c>
      <d>d-val</d>
    </a>''')
    
    print "Trying works() ..."
    print works(myXML)
    
    print "Trying also_works() ..."
    print also_works(myXML)
    
    print "Trying fails() ..."
    print fails(myXML)
    

    Here's what the code spits out …

    Trying works() ...
    <?xml version="1.0" encoding="iso-8859-1"?>
    <div>
      <p>I'm from a.xsl.</p>
      <p>I'm from b.xsl.</p>
      <p>b-val c-val d-val</p>
    </div>

    Trying also_works() ...
    <?xml version="1.0" encoding="iso-8859-1"?>
    <div>
      <p>I'm from a.xsl.</p>
      <p>I'm from b.xsl.</p>
      <p>b-val c-val d-val</p>
    </div>

    Trying fails() ...

    Traceback (most recent call last):
      File "C:\Users\nitaro\Dropbox\lxml_include\inc.py", line 44, in <module>
        print fails(myXML)
      File "C:\Users\nitaro\Dropbox\lxml_include\inc.py", line 23, in fails
        style = etree.XSLT(xsl_parsed)
      File "xslt.pxi", line 399, in lxml.etree.XSLT.__init__ (src/lxml/lxml.etree.c:118852)
      File "lxml.etree.pyx", line 280, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:7959)
    XSLTParseError: Cannot resolve URI string://__STRING__XSLT__/a.xsl

    Oh and here are the XSL files, "a.xsl" and "b.xsl" …

    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
      <xsl:output method="xml" version="1.0" encoding="iso-8859-1" indent="yes"/>  
      <xsl:template match="a">
        <div>     
          <p>I'm from a.xsl.</p>    
          <xsl:call-template name="canUCme">
            <xsl:with-param name="name" select="/" />
          </xsl:call-template>  
        </div>
      </xsl:template>
    </xsl:stylesheet>

    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
      <xsl:include href="a.xsl"/>
      <xsl:template name="canUCme">
        <xsl:param name="name" />
        <p>I'm from b.xsl.</p> 
        <p><xsl:value-of select="normalize-space($name)" /></p>
      </xsl:template>
    </xsl:stylesheet>
    
    --------------

    Related Content:

    Written by nitin

    October 25th, 2012 at 12:27 pm

    Posted in scripts,XML

    Tagged with , , , ,

    awesome sauce: augmenting PubMed Central’s OAI response

    leave a comment

    Update, 9 pm EST, May 27, 2012: Well, this is interesting. After reading this page, I see that by setting the "metadataPrefix" to "pmc_fm" I can bypass steps #3 and #4 altogether it seems – provided one's OAI harvester/indexer is set to ingest the data in that format instead of Dublin Core or provided the script below transforms the data to Dublin Core before returning it. Anyway … score one for documentation and reading it after-the-fact!

    I saw a post from a Metadata Librarian on the code4lib list about their work with placing article data from PubMed into DSpace. They are doing some metadata additions and cleanup in Excel so I emailed them off-list and let them know about PubMed2XL and we went back and forth on a few things. Among the things I learned from them was that PubMed Central has an OAI feed. Cool!

    But that OAI feed doesn't return all the data they need.

    Here's an example: http://www.pubmedcentral.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=oai_dc&set=aac.

    One of the additional bits of data they wanted was author affiliation which is available from PubMed.gov's XML output. Same for the MESH terms.

    Anyway, besides pushing PubMed2XL, I also mentioned that it would be interesting to make a sauce, if you will, for PubMed Central's OAI feed. In other words, rather than using the OAI link above, one would use a service on top of that a la: http://myPubMedCentralOAI_sauce.com/oai?verb=ListRecords&metadataPrefix=oai_dc&set=aac. And when one went to that URL, the service would fetch the real OAI feed from PubMed Central and then get the additional metadata from the NCBI EFetch APIs. It would then drop the additional metadata into the original OAI response and finally serve it up to the user (e.g. the OAI harvester).

    I went ahead and played with a proof-of-concept using Google App Engine and it's working although it's adding about 20 – 25 seconds to the OAI response time. BTW: it's faster when I run it from localhost and not actually live on App Engine.

    Here's how it's done.

    1. The user goes to http://localhost:8084/oai?verb=ListRecords&metadataPrefix=oai_dc&set=aac.
    2. The app then fetches http://www.pubmedcentral.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=oai_dc&set=aac.
    3. For each record, the app parses out the PubMed Central ID and uses the EFetch API with PubMed Central as the database to get more data about the item.
    4. Unfortunately, the API for PubMed Central doesn't return MESH terms, so in step #3 the app just uses the returned data to translate the PubMed Central ID to the regular PubMed ID.
    5. With the PubMed ID now in hand, the app goes to the EFetch API and specifies PubMed as the database and hands the API the PubMed ID from step #4.
    6. Now the app gets the <Affiliation> value and the MESH terms and adds them to the real OAI response from step #2.
    7. Finally (whew!), the app returns the OAI feed with more metadata than before.

    This seems super klunky, so I'd love to hear about more elegant ways to do this … like having more options from PubMed Central without 3rd party hacks!

    But it is working. And it's just a proof-of-concept …

    Below, I've pasted a snippet of the augmented OAI data.

    Below that is the Python code if anyone's interested.

    ps: Python users will notice I used Google App Engine's "urlfetch" instead of "urllib" to request URLs. This is because using the latter was causing 500 errors due to timeouts. I don't think, from what I've read, that you can set the timeout with "urllib" in App Engine, so I used "urlfetch" which lets one set it up to 60 seconds.

    <!--
      This is just a test to use the NCBI EFetch APIs to augment the ouput of PubMed Central's OAI feed.
      In short, it's a web servive that sits on top of the PubMed Central OAI API.
    
      *** DO NOT use this service to harvest OAI records from PubMed Central ... you will probably mess up your repository!
      ... and I haven't verified that the additional data being added to the OAI feed is accurate per the item.
    
      Currently, this supports the following OAI parameters:
     
       - ListRecords
       - set
       - metadataPrefix (must use "oai_dc"/Dublin Core)
       - resumptionToken
     
      Thanks, Nitin Arora (humaneguitarist.org), May 2012.
     
      ps: adding metadata increased the OAI response time by 22.6178297997 seconds.
      -->
    <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
     <responseDate>2012-05-27T13:34:17Z</responseDate>
     <request verb="ListRecords" metadataPrefix="oai_dc" set="aac">http://www.pubmedcentral.nih.gov/oai/oai.cgi</request>
     <ListRecords>
      <record>
       <header>
        <identifier>oai:pubmedcentral.nih.gov:89011</identifier>
        <datestamp>2002-09-12</datestamp>
        <setSpec>aac</setSpec>
       </header>
       <metadata>
        <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
         <dc:title>Antifungal Peptides: Novel Therapeutic Compounds against Emerging Pathogens</dc:title>
         <dc:creator>De Lucca, Anthony J.</dc:creator>
         <dc:creator>Walsh, Thomas J.</dc:creator>
         <dc:subject>Minireviews</dc:subject>
         <dc:description/>
         <dc:publisher>American Society for Microbiology</dc:publisher>
         <dc:identifier>http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=89011</dc:identifier>
         <dc:type>Text</dc:type>
         <dc:language>en</dc:language>
         <dc:rights/>
         <dc:contributor.affiliation>Southern Regional Research Center, Agricultural Research Service, U. S. Department of Agriculture, New Orleans, Louisiana 70124, USA. adelucca@nola.srrc.usda.gov</dc:contributor.affiliation>
         <dc:subject.mesh>Animals</dc:subject.mesh>
         <dc:subject.mesh>Anti-Bacterial Agents</dc:subject.mesh>
         <dc:subject.mesh>Antifungal Agents</dc:subject.mesh>
         <dc:subject.mesh>Fungi</dc:subject.mesh>
         <dc:subject.mesh>Humans</dc:subject.mesh>
         <dc:subject.mesh>Mycoses</dc:subject.mesh>
         <dc:subject.mesh>Peptides</dc:subject.mesh>
        </oai_dc:dc>
       </metadata>
      </record>
      <resumptionToken>oai%3Apubmedcentral.nih.gov%3A89061!!!oai_dc!aac</resumptionToken>
     </ListRecords>
    </OAI-PMH>
    

    Python:

    ### pmc-oai-topper.py
    ### 2012, Nitin Arora
    
    ### import modules
    ##import urllib #DELETE
    from google.appengine.api import urlfetch #see: https://developers.google.com/appengine/docs/python/urlfetch/overview
    from lxml import etree
    import time
    import webapp2
    
    ### set what additional metadata to get from the EFetch API
    additions = [('contributor.affiliation', 'Affiliation'),
                 ('subject.mesh', 'DescriptorName')] #(name of element to output to, XPath); eventually needs to be in external config file
                #note: the XPath has to refer to elements in the EFetch XML output for the PubMed database as in "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=12654674&retmode=xml"
    
    #####
    class pmctopper(webapp2.RequestHandler):
      def get(self):
    
        #GET OAI parameter values
        verb_value = self.request.get('verb')
        metadataPrefix_value = self.request.get('metadataPrefix')
        set_value = self.request.get('set')
        resumptionToken_value = self.request.get('resumptionToken')
    
        #define the *real* OAI feed URL and read it
        if resumptionToken_value: #if a resumptionToken is being used
          url = 'http://www.pubmedcentral.gov/oai/oai.cgi?verb=%s&resumptionToken=%s' %(verb_value, resumptionToken_value)
        elif set_value:
          url = 'http://www.pubmedcentral.gov/oai/oai.cgi?verb=%s&set=%s&metadataPrefix=%s' %(verb_value, set_value, metadataPrefix_value)
        else:
          url = 'http://www.pubmedcentral.gov/oai/oai.cgi?verb=%s&metadataPrefix=%s' %(verb_value, metadataPrefix_value)
    
    ##    oai_in = urllib.urlopen(url).read() #DELETE
        oai_in = urlfetch.fetch(url=url, deadline=60).content
        time_in = time.time() #tracking how long this takes
    
        #parse OAI response as XML
        oai_parsed = etree.XML(oai_in)
        root = oai_parsed.xpath('.') #root node
        dc = root[0].xpath('//oai_dc:dc',
                                namespaces={'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/',
                                'dc': 'http://purl.org/dc/elements/1.1/'}) #access dc:* nodes (i.e. each item)
    
        #loop through all items and for each go fetch additional metadata via the EFetch APIs for PubMed Central and PubMed
        #place that additional data into the original OAI feed
        i = 0
        for record in dc:
          identifier = record.xpath('//dc:identifier',
                                namespaces={'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/',
                                'dc': 'http://purl.org/dc/elements/1.1/'})
          pmc_id =(identifier[i].text).replace('http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=','') #get the article's unique ID
    
          #request PubMed ID from Pubmed Central API ... ugh!
          efetch_url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=%s' %pmc_id #this is the URL to get metadata about the article per it's ID
    ##      efetch_read = urllib.urlopen(efetch_url).read() #DELETE
          efetch_read = urlfetch.fetch(url=efetch_url, deadline=60).content #read the API response
          efetch_parsed = etree.XML(efetch_read) #parse as XML
          pubmed_id = efetch_parsed.xpath('//article-id[@pub-id-type="pmid"]/text()') #pubmed id
    
          #now(!) get the additional data from the PubMed API
          efetch_url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=%s&retmode=xml' %pubmed_id
    ##      efetch_read = urllib.urlopen(efetch_url).read() #DELETE
          efetch_read = urlfetch.fetch(url=efetch_url, deadline=60).content
          efetch_parsed = etree.XML(efetch_read)
    
          for addition in additions:
            added_element = efetch_parsed.xpath('//%s/text()' %addition[1]) #get data from API XML tree
            for added_value in added_element:
              etree.SubElement(record, '{http://purl.org/dc/elements/1.1/}%s' %addition[0]).text = added_value
    
          i = i + 1
    
        #for reporting how long this all takes
        time_out = time.time()
        time_diff = str(time_out - time_in)
        
        #output the *new* OAI results with the additional metadata
        self.response.headers['Content-Type'] = 'text/xml' #output as XML doc
        disclaimer= '''<!--
        This is just a test to use the NCBI EFetch APIs to augment the ouput of PubMed Central's OAI feed.
        In short, it's a web servive that sits on top of the PubMed Central OAI API.
    
        *** DO NOT use this service to harvest OAI records from PubMed Central ... you will probably mess up your repository!
        ... and I haven't verified that the additional data being added to the OAI feed is accurate per the item.
    
        Currently, this supports the following OAI parameters:
        
          - ListRecords
          - set
          - metadataPrefix (must use "oai_dc"/Dublin Core)
          - resumptionToken
        
        Thanks, Nitin Arora (humaneguitarist.org), May 2012.
        
        ps: adding metadata increased the OAI response time by %s seconds.
        -->''' %time_diff
        self.response.out.write(disclaimer)
        for node in root:
          self.response.out.write(etree.tostring(node))
    
    ### app engine stuff ...
    app = webapp2.WSGIApplication([('/oai', pmctopper)],
                                  debug=True)
    --------------

    Related Content:

    Written by nitin

    May 27th, 2012 at 10:11 am

    museline: charting melodic contours via web service

    leave a comment

    In the last post, I mentioned I was playing with Google App Engine and Google Chart Tools.

    Last night, with some silly movie streaming in the background, I was in bed tinkering with a little idea that I'm sure has been done a-thousand times already and that may be built into high end music notation applications. But it hasn't been done by anyone as stoopid as me!

    :P

    What I did was whip up a little App Engine/Python app where one can pass it a partwise MusicXML file and it will use Google Chart Tools to create a little line chart of the melodic contour of the first <part> element.

    Here's a screenshot below of the results using the MusicXML sample file available on the MakeMusic site of Schumann's "Im wunderschönen Monat Mai" from the Dichterliebe. The app has an "mxml" parameter that tells it which MusicXML file to use a la "http://localhost:8083/?mxml=http://downloads2.makemusic.com/musicxml/Dichterliebe01.xml".

     

    I've embedded a really nice performance on YouTube if anyone wants to follow along. The contour graph represents the vocal part only.

     

    Now, this is just a start. There's a lot of work to do if I pursue this. For starters, I'd like to make the chart synced with an audio/video recording. I don't know if I can do that with Chart Tools, but probably with the <canvas> element if nothing else. Also, I haven't tried this yet with any non-homophonic parts. Anyway, it's a start and it's kinda fun.

    I tried to add another line for the actual pitch class contour but it wasn't as interesting to look at as the melodic contour so I disabled that "feature". By pitch class, I mean I was using octave equivalency so that all "C" notes, for example, were plotted at the exact same vertical position as opposed to the screenshot above where two "C" notes an octave apart would have different vertical points on the graph to depict the intervallic difference.

    As far as plotting the notes, I ignored rests and durations. I just plotted the pitches as below, starting with "C" with a value of "1" and with the "B" a seventh up from that "C" receiving a "12".

    • C : 1
    • D : 3
    • E : 5
    • F : 6
    • G : 8
    • A : 10
    • B : 12

    This way a "C-sharp" and "D-flat" receive a score of "2", for example, because they lie between "C as 1" and "D as 3".

    In MusicXML, the <step> element has the note name and the optional <alter> element, which is a number, tells you if it's sharp or flat, etc. The numerical <octave> element tells you what octave range the pitch is in.

    So what I'm doing is pulling out the <step> value and converting it to a number as above, adding the <alter> value (a flat is a negative number), and then multiplying adding that sum to 12 times the <octave> value. Then, I multiple the value by ".01" just to reduce the number because I want the graph's vertical limit to be a small number even though this shouldn't change the contour itself.

    Last, I'm trying to pull some basic descriptive metadata if they are present in the MusicXML file and show it below the graph.

    Maybe I'll do more with this later. Just goofin' for now.

    --------------

    Related Content:

    Written by nitin

    May 3rd, 2012 at 3:55 pm

    geo this, geo that: easy acquisition of KML files with BatchGeo

    leave a comment

    Geolocation/geocoding is so "hip" these days. Everyone's so obsessed where where they and other things are. There's almost a comparison with 3-D filmmaking …

    Funny. Not too many folks seem all that concerned with when things are.

    Anyway …

    At work, we have a database with all the libraries we serve and their addresses. And the other week we needed to quickly make a map with all their locations.

    If necessity is the mother of invention, laziness is it's favorite uncle.

    Enter BatchGeo. We were able to take those values from our database and get a map generated in minutes. But it gets better.

    One of the nice things about this process is that in addition to a map, you also get a KML file download option. Taking this little XML file, it's a simple process (via XSL or other) to make a delimited file containing the inputted names of institutions and their latitude and longitude (altitude is also available).

    From there, it's not brain surgery to get those coordinates into a database and using an SQL JOIN to be able to push out an institution's name and now its coordinates, too, whenever.

    Just in case someone wants/needs to do something similar with an address book or a list of businesses, etc.

    --------------

    Related Content:

    Written by nitin

    January 28th, 2012 at 9:52 am

    Posted in technophilia,XML

    Tagged with , , , ,

    layer cake: XML config files with XSL inside CDATA

    leave a comment

    Sometimes in life – or coding projects – there are regrets.

    But there is cake, too.

    yummy looking layer cake

    Anyway, for a current project I want to place some XSL inside an XML config file.

    But of course, you can't just drop XML inside XML without coating it in something.

    So for another project, PubMed2XL, I did something like this:

    <cell>{{?xml version="1.0" encoding="UTF-8"?}}
      {{xsl:stylesheet version="1.0" encoding="UTF-8" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"}}
      {{xsl:output method = "text" /}}
      {{xsl:template match="/"}}
        {{xsl:value-of select="//PMID" /}}
      {{/xsl:template}}
      {{/xsl:stylesheet}}
    </cell>
    

    Putting XSL inside double curly brackets works just fine, but now I know a better way: just put it inside a CDATA section!

    <map name="LibriVox">
      <XSLT>./XSLT/LibriVox_to_Solr.xsl</XSLT>
      <nextXSL>
      <![CDATA[
      <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
        <xsl:output method="text"/>
        <xsl:template match="/">
          <xsl:variable name="baseURL" select="'%s'" />
          <xsl:variable name="URL_params" select="'%s'" />
          <xsl:variable name="offset_" select="substring-after($URL_params,'=')" />
          <xsl:variable name="offset" select="substring-before($offset_,'&amp;')" />
          <xsl:variable name="limit_" select="substring-after($URL_params,'&amp;')" />
          <xsl:variable name="limit" select="substring-after($limit_,'=')" />
          <xsl:variable name="output">
            <xsl:value-of select="$baseURL" />
            <xsl:text>?offset=</xsl:text>
            <xsl:value-of select="$offset+50" />
            <xsl:text>&amp;limit=</xsl:text>
            <xsl:value-of select="50" />
          </xsl:variable>
          <xsl:value-of select="$output" />
        </xsl:template>
      </xsl:stylesheet>
      ]]>
      </nextXSL>
    </map>
    

    Duh. I guess like a good XML parser that I always just ignored anything inside a CDATA section. Never thought I'd need to use one.

    Putting the XSL inside a CDATA section worked like a charm in terms of being able to read it with a script and perform an XSLT with it.

    Luckily, the PubMed2XL script can use either the CDATA way of embedding XSL or my Curly solution – not that I knew that when I wrote it!

    It's certainly easier to cut/paste the XSL in the CDATA block without having to replace the brackets with curly quotes or vice versa. It's also just easier to read, which makes it easier to edit and troubleshoot. And it tastes better, too.

    --------------

    Related Content:

    Written by nitin

    November 12th, 2011 at 12:18 pm

    Posted in XML

    Tagged with , ,

    pretty printing XML with Python, lxml, and XSLT

    leave a comment

    Last week or so I was doing some work with Python and lxml. And, it seems like a lot of people, using lxml's pretty printing wasn't really doing anything for me.

    I couldn't find any native lxml solutions to make my XML look pretty. All I found were some functions on various code sites written by people to pretty print the XML using a bunch of regular expressions. Yuck.

    So I thought, "Why not use XSLT to pretty print my XML?" and I found an XSL written by none other than Michael Kay on this page (see comment #4).

    And it seems to work just fine as a function to return pretty XML, not to mention it's super short and sweet.

    Anyway, here's an example of using the XSL for pretty printing.

    from lxml import etree
    
    def prettify(someXML):
      #for more on lxml/XSLT see: http://lxml.de/xpathxslt.html#xslt-result-objects
      xslt_tree = etree.XML('''\
        <!-- XSLT taken from Comment 4 by Michael Kay found here:
        http://www.dpawson.co.uk/xsl/sect2/pretty.html#d8621e19 -->
        <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
        <xsl:output method="xml" indent="yes" encoding="UTF-8"/>
          <xsl:strip-space elements="*"/>
          <xsl:template match="/">
            <xsl:copy-of select="."/>
          </xsl:template>
        </xsl:stylesheet>''')
      transform = etree.XSLT(xslt_tree)
      result = transform(someXML)
      return unicode(result)
    
    myXML = etree.XML('<a><b><c><d/></c></b></a>')
    print prettify(myXML)

    The example above would output the following:

    >>>
    <?xml version="1.0"?>
    <a>
      <b>
        <c>
          <d/>
        </c>
      </b>
    </a>

    By the way I don't even need to see the XML I'm processing most of the time, so why all the pretty printing fuss?

    Well, because it bothers me …

    And all good XML should look like an X-wing starfighter. If it doesn't your probably doing something wrong or your schema just sucks.

    It isn't called an X-wing for no reason.

    :P

    --------------

    Related Content:

    Written by nitin

    November 12th, 2011 at 11:05 am

    Posted in XML

    Tagged with , ,

    indexing and searching timed text with Solr

    leave a comment

    I'm still learning about Solr so maybe this post is much ado about nothing. But according to this nabble.com thread, one can't index a source XML document in Solr with it's native XML structure intact and then in turn search that structure as one can in an XML database like BaseX.

    For most things, that's fine. I mean for indexing titles, creators, and descriptions, etc. I just need to index the value of a given element like <title> so that I can search for that element's value.

    But for timed text, it's different. Or at least, it can be.

    Say I have this DFXP snippet for an audio file with an "id" value of "XYZ".

    <p begin="10.0s" end="30.0s">Hello world!</p>

    I would need the user to search for the string "Hello world!" or part of it but I would also need to index at least the value of the "begin" attribute so that I can pass that to a page that will play the file "XYZ" starting at the 10 second mark – if the user clicks on the "Hello world!" line in their search result. And I don't want the "10" second value to be something they search against since they might be searching for the string "10" within the text itself.

    So I'm wondering how to do that with Solr.

    Maybe when I learn more I'll discover a better way to do this, but for now I'm thinking I could do the following:

    First, I would pretty much index the timed text twice in Solr.

    <doc>
      <field name="id">XYZ</field>
    ...
      <field name="timedText-stripped">Hello world!</field>
      <field name="timedText">Hello World! {10}</field>
    </doc>

    After indexing the "id" of the audio file this would index:

    • just the text "Hello world!"
    • the text of "Hello world!" with the "begin" attribute value in curly quotes.

    I guess this way the user could be made to search across the "timedText-stripped" field but, via the XSL that can be passed to Solr to display results, the "timedText" field could be displayed in a manner that would make the text "Hello World!" linked to whatever file will play file "XYZ" starting at the 10 second mark. Basically, by planting the "begin" value in curly quotes, I can parse the string for the text and the "begin" value as separate things.

    So, here's a really crappy XSL snippet that would do something like that. It assumes a variable "$id" exists that equals "XYZ", the identifier for the example audio file.

    <xsl:for-each select="//field[@name='timedText']">
      <xsl:variable name="whole">
        <xsl:value-of select="."/>
        <!-- Gets entire element string -->
      </xsl:variable>
      <xsl:variable name="text">
        <xsl:value-of select="substring-before($whole,'{')"/>
        <!-- Gets text prior to seconds -->
      </xsl:variable>
      <xsl:variable name="begin">
        <xsl:value-of select="substring-before(substring-after($whole,'{'),'}')"/>
        <!-- Gets seconds value from end of string -->
      </xsl:variable>
      <a href="someMediaPlayer.php?id={$id)&amp;begin={$begin}">
        <xsl:value-of select="$text"/>
      </a>
      <!-- So, I'm saying that
      "someMediaPlayer.php?id=XYZ&start=10"
      would launch a player that would start file XYZ at the 10 seconds mark.
      -->
    </xsl:for-each>
    

    The search output would be some HTML code like so:

    <a href="someMediaPlayer.php?id=XYZ&amp;begin=10>Hello World!</a>

    It seems weird to index something twice, more or less, but as user Erick says in the nabble.com thread, "You've gotta take off your DB hat and not worry about duplicating data."

    But now as I write this, I'm wondering if I can't just index as follows:

      <field name="text">Hello world!</field>
      <field name="begin">10</field>

    and trust that for each "text" field, there will be a matching "begin" field and that they can't just be used in tandem to create the same HTML link as above. Sounds like I need to play around some more.

    :)

    Update, September 6, 2012: I wrote a related post to this yesterday in terms of searching across timed text with MySQL and in doing so I realized that the way I was thinking of doing it in Solr was off. Rather than doing it the way I outlined in the original post content (above) in which I was thinking to index all the timed text for a given recording in one Solr "doc" element, I think it makes much more sense to index each line in its own "doc" element as such:

    <doc>
      <field name="id">someMediaPlayer.php?source=someFile.mp3&amp;begin=10&amp;end=30</field>
      ...
      <field name="startTime">10</field>
      <field name="stopTime">30</field> 
      <field name="timedText">Hello world!</field>
      <field name="source">someFile.mp3</field> 
    </doc>
    

    That way there's no need to post-parse any data fields to get the start and stop time. And, moreover, rather than construct the URL to launch that segment of audio you can just put the URL directly in the "id" field. You can always use Solr built-in support for facets to facet off of the "source" field or some descriptive metadata like "title".

    I'll file the original post under the "thinking out loud yet poorly" category.

    --------------

    Related Content:

    Written by nitin

    October 16th, 2011 at 10:54 am

    learning about XProc on a Sunday morning

    leave a comment

    There are some cool PowerPoint slides on the  xfront.com  page about XProc, which I didn't know anything about until today.

    I like the idea of a one-stop-shop for all kinds of XML processing, but I think unless I had a specific need to use it I'd probably use a Python script or something to sequentially do some batch XML work on a given document. That's exactly what XProc is a solution against, but I guess it all depends on one's needs. I should certainly think about it in terms of doing things with MusicXML though.

    Anyway, I've only been through one slide – and it's long at about 170 slides, but I found it well done and easy to understand.

    Also, there's a list of XProc implementations here – Java, Java, Java …

    Apparently, there used to be a Python implementation on GitHub, but it's pulling a 404. Bummer. Well, at least GitHub's 404 message is a cool homage to Star Wars!

    GitHub 404

    Lastly, this daisy-pipeline for Daisy talking books looks interesting, too.

    So is this post just a fancy way for me to save bookmarks for my future use or what?

    :P

      Written by nitin

      August 28th, 2011 at 10:28 am

      Posted in XML

      Tagged with , , ,

      Switch to our mobile site