blog.humaneguitarist.org

discoveries in digital audio, music notation, and information encoding

Archive for the ‘XSLT’ tag

pixelation: custom XSLT functions with Python and lxml

leave a comment

I'll be brief.

Because the Python "lxml" module doesn't support XSLT 2.0 functions, I was looking at support for EXSLT

… but then stumbled on how to write my own functions and call them from stylesheets.

Freakin' cool.

I like calling it "pxslt" for "Python XSLT" and pronouncing it like "pixelate".

:P

Example below of the "module" I made;  the script that calls it, and the results.

Told you I'd be brief.

Module:

#pxslt.py

def underscore(context, word):
  '''Replace whitespace with underscore.'''
  out = word[0].replace(' ', '_')
  return out

def multiply(context, int_val, int2_val):
  '''Multiply two integers.'''
  int_val, int2_val = int(int_val[0]), int(int2_val[0])
  return int_val * int2_val

def libraryThing(context, isbn):
  '''Get language for a work based on ISBN using LibraryThing API.'''
  isbn = isbn[0]
  import urllib
  res = urllib.urlopen('http://www.librarything.com/api/thingLang.php?isbn=' + isbn)
  res_r = res.read()
  return res_r

##### DO NOT EDIT
##### makes it possible to call the above functions with XSLT
def pxslt():
  myFunctions = []
  gbs = globals()
  from inspect import isfunction
  for gb in gbs:
    if isfunction(gbs[gb]) and gb != 'pxslt':
      #print gb
      myFunctions.append(gbs[gb])

  from lxml import etree
  #see: http://lxml.de/extensions.html
  ns = etree.FunctionNamespace('file://libs/pxslt.py')
  ns.prefix = 'pxsl'
  for myFunction in myFunctions:
    name = str(myFunction.func_name)
    ns[name] = myFunction
  return ns

Usage example:

from lxml import etree

#####
myXML = etree.XML('''\
<a>
  <b>Hello. This will appear with whitespaces replaced by underscores.</b>
  <c>3</c>
</a>''')

myXSL = etree.XSLT(etree.XML('''\
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:pxslt="file://libs/pxslt.py">
  <xsl:output method="text" version="1.0" />
  <xsl:template match="a">
    <xsl:variable name="isbn">9955081260</xsl:variable>
    <xsl:value-of select="pxslt:libraryThing($isbn)" />
    <xsl:text>\n</xsl:text> <!-- Python will line break here -->
    <xsl:value-of select="pxslt:underscore(b/text())" />
    <xsl:text>\n</xsl:text> <!-- Python will line break here -->
    <xsl:call-template name="mathFunc">
    </xsl:call-template>
  </xsl:template>
  <xsl:template name="mathFunc">
    <xsl:variable name="myNum">10</xsl:variable>
    <xsl:value-of select="pxslt:multiply(c/text(), $myNum)" />
  </xsl:template>
</xsl:stylesheet>'''))

import pxslt
pxslt.pxslt() #get all set up with namespaces and function stuff

print(myXSL(myXML))

#myXSL_file = etree.XSLT(etree.parse('foo.xsl')) #for testing with a real XSL file
#print(myXSL_file(myXML))

Output:

>>>
lit
Hello._This_will_appear_with_whitespaces_replaced_by_underscores.
30

--------------

Related Content:

Written by nitin

November 2nd, 2012 at 5:28 pm

Python, lxml, and xsl:include

leave a comment

Keeping this short because yes, dammit, I'm home sick.

I needed/wanted to do some XSL transformations with Python using an <xsl:include> statement. But I kept getting some errors along the lines of "lxml cannot' resolve uri string".

So anyway after deciding I didn't want to read through all the crap on the lxml site about this, I fumbled my way through to what appears to work.

It seems the include statements work fine when I DO NOT read() the XSL file before using it for a transformation.

In the interest of really keeping it short like I said, here's some code and the results below.

from lxml import etree
                
def works(someXML):
  #don't even open() the XSL file ...
  xslt_tree = etree.parse(xslFile)
  transform = etree.XSLT(xslt_tree)
  result = transform(someXML)
  return result

def also_works(someXML):
  #open() the XSL file, but don't read() it ...
  xsl_opened = open(xslFile, "r")
  xslt_tree = etree.parse(xsl_opened)
  transform = etree.XSLT(xslt_tree)
  result = transform(someXML)
  return result

def fails(someXML):
  #open() and read() the XSL file ...
  xsl_opened = open(xslFile, "r")
  xsl_read = xsl_opened.read()
  xsl_parsed = etree.XML(xsl_read)
  transform = etree.XSLT(xsl_parsed)
  result = transform(someXML)
  return result

#####
xslFile = "b.xsl"

myXML = etree.XML('''\
<a>
  <b>b-val</b>
  <c>c-val</c>
  <d>d-val</d>
</a>''')

print "Trying works() ..."
print works(myXML)

print "Trying also_works() ..."
print also_works(myXML)

print "Trying fails() ..."
print fails(myXML)

Here's what the code spits out …

Trying works() ...
<?xml version="1.0" encoding="iso-8859-1"?>
<div>
  <p>I'm from a.xsl.</p>
  <p>I'm from b.xsl.</p>
  <p>b-val c-val d-val</p>
</div>

Trying also_works() ...
<?xml version="1.0" encoding="iso-8859-1"?>
<div>
  <p>I'm from a.xsl.</p>
  <p>I'm from b.xsl.</p>
  <p>b-val c-val d-val</p>
</div>

Trying fails() ...

Traceback (most recent call last):
  File "C:\Users\nitaro\Dropbox\lxml_include\inc.py", line 44, in <module>
    print fails(myXML)
  File "C:\Users\nitaro\Dropbox\lxml_include\inc.py", line 23, in fails
    style = etree.XSLT(xsl_parsed)
  File "xslt.pxi", line 399, in lxml.etree.XSLT.__init__ (src/lxml/lxml.etree.c:118852)
  File "lxml.etree.pyx", line 280, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:7959)
XSLTParseError: Cannot resolve URI string://__STRING__XSLT__/a.xsl

Oh and here are the XSL files, "a.xsl" and "b.xsl" …

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="xml" version="1.0" encoding="iso-8859-1" indent="yes"/>  
  <xsl:template match="a">
    <div>     
      <p>I'm from a.xsl.</p>    
      <xsl:call-template name="canUCme">
        <xsl:with-param name="name" select="/" />
      </xsl:call-template>  
    </div>
  </xsl:template>
</xsl:stylesheet>

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:include href="a.xsl"/>
  <xsl:template name="canUCme">
    <xsl:param name="name" />
    <p>I'm from b.xsl.</p> 
    <p><xsl:value-of select="normalize-space($name)" /></p>
  </xsl:template>
</xsl:stylesheet>
--------------

Related Content:

Written by nitin

October 25th, 2012 at 12:27 pm

Posted in scripts,XML

Tagged with , , , ,

pretty printing XML with Python, lxml, and XSLT

leave a comment

Last week or so I was doing some work with Python and lxml. And, it seems like a lot of people, using lxml's pretty printing wasn't really doing anything for me.

I couldn't find any native lxml solutions to make my XML look pretty. All I found were some functions on various code sites written by people to pretty print the XML using a bunch of regular expressions. Yuck.

So I thought, "Why not use XSLT to pretty print my XML?" and I found an XSL written by none other than Michael Kay on this page (see comment #4).

And it seems to work just fine as a function to return pretty XML, not to mention it's super short and sweet.

Anyway, here's an example of using the XSL for pretty printing.

from lxml import etree

def prettify(someXML):
  #for more on lxml/XSLT see: http://lxml.de/xpathxslt.html#xslt-result-objects
  xslt_tree = etree.XML('''\
    <!-- XSLT taken from Comment 4 by Michael Kay found here:
    http://www.dpawson.co.uk/xsl/sect2/pretty.html#d8621e19 -->
    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="yes" encoding="UTF-8"/>
      <xsl:strip-space elements="*"/>
      <xsl:template match="/">
        <xsl:copy-of select="."/>
      </xsl:template>
    </xsl:stylesheet>''')
  transform = etree.XSLT(xslt_tree)
  result = transform(someXML)
  return unicode(result)

myXML = etree.XML('<a><b><c><d/></c></b></a>')
print prettify(myXML)

The example above would output the following:

>>>
<?xml version="1.0"?>
<a>
  <b>
    <c>
      <d/>
    </c>
  </b>
</a>

By the way I don't even need to see the XML I'm processing most of the time, so why all the pretty printing fuss?

Well, because it bothers me …

And all good XML should look like an X-wing starfighter. If it doesn't your probably doing something wrong or your schema just sucks.

It isn't called an X-wing for no reason.

:P

--------------

Related Content:

Written by nitin

November 12th, 2011 at 11:05 am

Posted in XML

Tagged with , ,

indexing and searching timed text with Solr

leave a comment

I'm still learning about Solr so maybe this post is much ado about nothing. But according to this nabble.com thread, one can't index a source XML document in Solr with it's native XML structure intact and then in turn search that structure as one can in an XML database like BaseX.

For most things, that's fine. I mean for indexing titles, creators, and descriptions, etc. I just need to index the value of a given element like <title> so that I can search for that element's value.

But for timed text, it's different. Or at least, it can be.

Say I have this DFXP snippet for an audio file with an "id" value of "XYZ".

<p begin="10.0s" end="30.0s">Hello world!</p>

I would need the user to search for the string "Hello world!" or part of it but I would also need to index at least the value of the "begin" attribute so that I can pass that to a page that will play the file "XYZ" starting at the 10 second mark – if the user clicks on the "Hello world!" line in their search result. And I don't want the "10" second value to be something they search against since they might be searching for the string "10" within the text itself.

So I'm wondering how to do that with Solr.

Maybe when I learn more I'll discover a better way to do this, but for now I'm thinking I could do the following:

First, I would pretty much index the timed text twice in Solr.

<doc>
  <field name="id">XYZ</field>
...
  <field name="timedText-stripped">Hello world!</field>
  <field name="timedText">Hello World! {10}</field>
</doc>

After indexing the "id" of the audio file this would index:

  • just the text "Hello world!"
  • the text of "Hello world!" with the "begin" attribute value in curly quotes.

I guess this way the user could be made to search across the "timedText-stripped" field but, via the XSL that can be passed to Solr to display results, the "timedText" field could be displayed in a manner that would make the text "Hello World!" linked to whatever file will play file "XYZ" starting at the 10 second mark. Basically, by planting the "begin" value in curly quotes, I can parse the string for the text and the "begin" value as separate things.

So, here's a really crappy XSL snippet that would do something like that. It assumes a variable "$id" exists that equals "XYZ", the identifier for the example audio file.

<xsl:for-each select="//field[@name='timedText']">
  <xsl:variable name="whole">
    <xsl:value-of select="."/>
    <!-- Gets entire element string -->
  </xsl:variable>
  <xsl:variable name="text">
    <xsl:value-of select="substring-before($whole,'{')"/>
    <!-- Gets text prior to seconds -->
  </xsl:variable>
  <xsl:variable name="begin">
    <xsl:value-of select="substring-before(substring-after($whole,'{'),'}')"/>
    <!-- Gets seconds value from end of string -->
  </xsl:variable>
  <a href="someMediaPlayer.php?id={$id)&amp;begin={$begin}">
    <xsl:value-of select="$text"/>
  </a>
  <!-- So, I'm saying that
  "someMediaPlayer.php?id=XYZ&start=10"
  would launch a player that would start file XYZ at the 10 seconds mark.
  -->
</xsl:for-each>

The search output would be some HTML code like so:

<a href="someMediaPlayer.php?id=XYZ&amp;begin=10>Hello World!</a>

It seems weird to index something twice, more or less, but as user Erick says in the nabble.com thread, "You've gotta take off your DB hat and not worry about duplicating data."

But now as I write this, I'm wondering if I can't just index as follows:

  <field name="text">Hello world!</field>
  <field name="begin">10</field>

and trust that for each "text" field, there will be a matching "begin" field and that they can't just be used in tandem to create the same HTML link as above. Sounds like I need to play around some more.

:)

Update, September 6, 2012: I wrote a related post to this yesterday in terms of searching across timed text with MySQL and in doing so I realized that the way I was thinking of doing it in Solr was off. Rather than doing it the way I outlined in the original post content (above) in which I was thinking to index all the timed text for a given recording in one Solr "doc" element, I think it makes much more sense to index each line in its own "doc" element as such:

<doc>
  <field name="id">someMediaPlayer.php?source=someFile.mp3&amp;begin=10&amp;end=30</field>
  ...
  <field name="startTime">10</field>
  <field name="stopTime">30</field> 
  <field name="timedText">Hello world!</field>
  <field name="source">someFile.mp3</field> 
</doc>

That way there's no need to post-parse any data fields to get the start and stop time. And, moreover, rather than construct the URL to launch that segment of audio you can just put the URL directly in the "id" field. You can always use Solr built-in support for facets to facet off of the "source" field or some descriptive metadata like "title".

I'll file the original post under the "thinking out loud yet poorly" category.

--------------

Related Content:

Written by nitin

October 16th, 2011 at 10:54 am

PubMed2XL 1.0 available

2 comments

I've uploaded a new version of PubMed2XL, a Windows application that converts article lists from PubMed.gov into Microsoft Excel files.

Unlike downloading the CSV directly from PubMed.gov, PubMed2XL gives users (OK … advanced users) the ability to customize the output but even the default format includes Abstract, links to each article, and even links to related articles, and reviews.

Here's an example of a spreadsheet made with PubMed2XL and here's the source file used to make it. The source file was downloaded from PubMed.gov using a search for "Mexican flu".

If you'd like to use the software you can download it for free.

If you notice any bugs or have any questions or remarks, please feel free to leave a comment on the site. Thanks!

--------------

Related Content:

Written by nitin

June 18th, 2011 at 2:28 pm

Posted in news,scripts

Tagged with , ,

XSLT: a practical usage example with Pubmed records

leave a comment

Update, December 10, 2010: If you are interested in getting PubMed citations into a spreadsheet application (Excel, etc.) please see PubMed2XL. PubMed2XL is free software that can convert PubMed citations into a Microsoft Excel file.

As part of my coursework for the University of Alabama SLIS program, I took a database class last year. Long story short, one of assignments was to create a Microsoft Access dbase based on Medline records.

The records were already provided for us as well as Java-based script to parse the information into a tab-delimited format prior to import into Access.

For extra credit, we were given another script that would parse records from an Ovid database. If we could find access to an Ovid dbase (I couldn't as they were all password protected, understandably), we could run the script, parse the records and bring them into Access for additional credit.

But there was a way to use a free source, Pubmed, and still get the job done.

How? Well, Pubmed allows article information to be exported as XML.

Once in XML, there was no need for a script to parse the information. From there it was simple to bring the information into Access. I found it easier to import it into Excel, clean it up, and then import that Excel data source into Access.

But what if you have OpenOffice?

I'm not aware of a simple way to import XML documents into OpenOffice Calc (their spreadsheet app) or Base (their dbase app).

But by using XSLT, there's a way around this issue.

Here are the steps:

  1. Conduct searches in Pubmed.
  2. Send your articles to the Clipboard.
  3. Set display to "XML".
  4. Send the results to "File" (see image below).
  5. Save the file as "pubmed_results.txt".
  6. Change the file's extension from "txt" to "xml".
  7. Open the document in a text editor.
  8. Above the DTD (i.e. <!DOCTYPE PubmedArticleSet PUBLIC … ">), add the following line:

<?xml-stylesheet type="text/xsl" href="pubmed_xslt.xsl"?>

  1. Re-save the file.
  2. Then, download this file to the same directory as your "pubmed_results.xml" file.
  3. Now click on "pubmed_results.xml" ; your browser should now display select data in an HTML tabular format.
  4. From here, simply copy/paste the tabular data into OpenOffice Calc, clean it up as desired, save it as a ".ods" file, hook it up to OpenOffice Base, and design your queries, etc.

And now you've got a totally Free (minus the cost of a laptop, internet connexion, etc.) desktop dbase of Medline results.

* Note that the XML stylesheet I provided only displays certain info. You can always open the stylesheet in a text editor and set it to display more information, such as Abstract, etc.


 

--------------

Related Content:

Written by nitin

August 15th, 2009 at 1:48 pm

Posted in XML

Tagged with , , , ,

XSLT transformations: "more than meets the eye"

one comment

A few months ago, my department head had encouraged us to learn about XML stylesheets and XSLT transformations. After picking at it here and there, I finally had my breakthrough with it this weekend. Of course, were I more patient, I could have gotten paid to do this at work tomorrow.

As usual, the majority of the work is in finding examples and explanations that speak to me. This thread was particularly helpful.

One of the biggest breakthroughs – as embarrassing as it is to admit – was my realization that one needed an XSLT processor to actually create a new XML document based on the instructions provided in the stylesheet.

I’ve been experimenting with both the Saxon and Microsoft processors. Rather than run them from the Windows command prompt, I’ve been using the command line interface in the jEdit text editor. There’s a built in XSLT processor plug-in with jEdit, but I couldn’t get it to work, hence the use of the afformentioned methods.

If I understand corrently, one of the uses of this will be to take XML data about audio files generated from the JSTOR/Harvard Object Validation Environment (JHOVE) and map the pertinent information to another schema/XML document. That’s a bit out of my league right now, but a modest start is yet a start.

I’ll also be interested in using transformations to make customized XML documents from MusicXML sources and Zotero exports. Admittedly, I have no real ideas as to what I’d need to do this for, but I simply have a hankering to think of related projects. Maybe pulling the lyrics out of a MusicXML document into a TEI verse document?

--------------

Related Content:

Written by nitin

August 9th, 2009 at 6:38 pm

Posted in XML

Tagged with , , , , ,

Switch to our mobile site