blog.humaneguitarist.org

discoveries in digital audio, music notation, and information encoding

Archive for the ‘XSLT’ tag

pretty printing XML with Python, lxml, and XSLT

leave a comment

Last week or so I was doing some work with Python and lxml. And, it seems like a lot of people, using lxml's pretty printing wasn't really doing anything for me.

I couldn't find any native lxml solutions to make my XML look pretty. All I found were some functions on various code sites written by people to pretty print the XML using a bunch of regular expressions. Yuck.

So I thought, "Why not use XSLT to pretty print my XML?" and I found an XSL written by none other than Michael Kay on this page (see comment #4).

And it seems to work just fine as a function to return pretty XML, not to mention it's super short and sweet.

Anyway, here's an example of using the XSL for pretty printing.

from lxml import etree
from lxml.etree import XSLT

def prettify(someXML):
  #for more on lxml/XSLT see: http://lxml.de/xpathxslt.html#xslt-result-objects
  xslt_tree = etree.XML('''\
    <!-- XSLT taken from Comment 4 by Michael Kay found here:
    http://www.dpawson.co.uk/xsl/sect2/pretty.html#d8621e19 -->
    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="yes" encoding="UTF-8"/>
      <xsl:strip-space elements="*"/>
      <xsl:template match="/">
        <xsl:copy-of select="."/>
      </xsl:template>
    </xsl:stylesheet>''')
  transform = etree.XSLT(xslt_tree)
  result = transform(someXML)
  return unicode(result)

myXML = etree.XML('<a><b><c><d/></c></b></a>')
print prettify(myXML)

The example above would output the following:

>>>
<?xml version="1.0"?>
<a>
  <b>
    <c>
      <d/>
    </c>
  </b>
</a>

By the way I don't even need to see the XML I'm processing most of the time, so why all the pretty printing fuss?

Well, because it bothers me …

And all good XML should look like an X-wing starfighter. If it doesn't your probably doing something wrong or your schema just sucks.

It isn't called an X-wing for no reason.

:P

--------------

Related Content:

Written by nitin

November 12th, 2011 at 11:05 am

Posted in XML

Tagged with , ,

indexing and searching timed text with Solr

leave a comment

I'm still learning about Solr so maybe this post is much ado about nothing. But according to this nabble.com thread, one can't index a source XML document in Solr with it's native XML structure intact and then in turn search that structure as one can in an XML database like BaseX.

For most things, that's fine. I mean for indexing titles, creators, and descriptions, etc. I just need to index the value of a given element like <title> so that I can search for that element's value.

But for timed text, it's different. Or at least, it can be.

Say I have this DFXP snippet for an audio file with an "id" value of "XYZ".

<p begin="10.0s" end="30.0s">Hello world!</p>

I would need the user to search for the string "Hello world!" or part of it but I would also need to index at least the value of the "begin" attribute so that I can pass that to a page that will play the file "XYZ" starting at the 10 second mark – if the user clicks on the "Hello world!" line in their search result. And I don't want the "10" second value to be something they search against since they might be searching for the string "10" within the text itself.

So I'm wondering how to do that with Solr.

Maybe when I learn more I'll discover a better way to do this, but for now I'm thinking I could do the following:

First, I would pretty much index the timed text twice in Solr.

<doc>
  <field name="id">XYZ</field>
...
  <field name="timedText-stripped">Hello world!</field>
  <field name="timedText">Hello World! {10}</field>
</doc>

After indexing the "id" of the audio file this would index:

  • just the text "Hello world!"
  • the text of "Hello world!" with the "begin" attribute value in curly quotes.

I guess this way the user could be made to search across the "timedText-stripped" field but, via the XSL that can be passed to Solr to display results, the "timedText" field could be displayed in a manner that would make the text "Hello World!" linked to whatever file will play file "XYZ" starting at the 10 second mark. Basically, by planting the "begin" value in curly quotes, I can parse the string for the text and the "begin" value as separate things.

So, here's a really crappy XSL snippet that would do something like that. It assumes a variable "$id" exists that equals "XYZ", the identifier for the example audio file.

<xsl:for-each select="//field[@name='timedText']">
  <xsl:variable name="whole">
    <xsl:value-of select="."/>
    <!-- Gets entire element string -->
  </xsl:variable>
  <xsl:variable name="text">
    <xsl:value-of select="substring-before($whole,'{')"/>
    <!-- Gets text prior to seconds -->
  </xsl:variable>
  <xsl:variable name="begin">
    <xsl:value-of select="substring-before(substring-after($whole,'{'),'}')"/>
    <!-- Gets seconds value from end of string -->
  </xsl:variable>
  <a href="someMediaPlayer.php?id={$id)&amp;begin={$begin}">
    <xsl:value-of select="$text"/>
  </a>
  <!-- So, I'm saying that
  "someMediaPlayer.php?id=XYZ&start=10"
  would launch a player that would start file XYZ at the 10 seconds mark.
  -->
</xsl:for-each>

The search output would be some HTML code like so:

<a href="someMediaPlayer.php?id=XYZ&amp;begin=10>Hello World!</a>

It seems weird to index something twice, more or less, but as user Erick says in the nabble.com thread, "You've gotta take off your DB hat and not worry about duplicating data."

But now as I write this, I'm wondering if I can't just index as follows:

  <field name="text">Hello world!</field>
  <field name="begin">10</field>

and trust that for each "text" field, there will be a matching "begin" field and that they can't just be used in tandem to create the same HTML link as above. Sounds like I need to play around some more.

:)

--------------

Related Content:

Written by nitin

October 16th, 2011 at 10:54 am

PubMed2XL 1.0 available

leave a comment

I've uploaded a new version of PubMed2XL, a Windows application that converts article lists from PubMed.gov into Microsoft Excel files.

Unlike downloading the CSV directly from PubMed.gov, PubMed2XL gives users (OK … advanced users) the ability to customize the output but even the default format includes Abstract, links to each article, and even links to related articles, and reviews.

Here's an example of a spreadsheet made with PubMed2XL and here's the source file used to make it. The source file was downloaded from PubMed.gov using a search for "Mexican flu".

If you'd like to use the software you can download it for free.

If you notice any bugs or have any questions or remarks, please feel free to leave a comment on the site. Thanks!

--------------

Related Content:

Written by nitin

June 18th, 2011 at 2:28 pm

Posted in news,scripts

Tagged with , ,

XSLT: a practical usage example with Pubmed records

leave a comment

Update, December 10, 2010: If you are interested in getting PubMed citations into a spreadsheet application (Excel, etc.) please see PubMed2XL. PubMed2XL is free software that can convert PubMed citations into a Microsoft Excel file.

As part of my coursework for the University of Alabama SLIS program, I took a database class last year. Long story short, one of assignments was to create a Microsoft Access dbase based on Medline records.

The records were already provided for us as well as Java-based script to parse the information into a tab-delimited format prior to import into Access.

For extra credit, we were given another script that would parse records from an Ovid database. If we could find access to an Ovid dbase (I couldn't as they were all password protected, understandably), we could run the script, parse the records and bring them into Access for additional credit.

But there was a way to use a free source, Pubmed, and still get the job done.

How? Well, Pubmed allows article information to be exported as XML.

Once in XML, there was no need for a script to parse the information. From there it was simple to bring the information into Access. I found it easier to import it into Excel, clean it up, and then import that Excel data source into Access.

But what if you have OpenOffice?

I'm not aware of a simple way to import XML documents into OpenOffice Calc (their spreadsheet app) or Base (their dbase app).

But by using XSLT, there's a way around this issue.

Here are the steps:

  1. Conduct searches in Pubmed.
  2. Send your articles to the Clipboard.
  3. Set display to "XML".
  4. Send the results to "File" (see image below).
  5. Save the file as "pubmed_results.txt".
  6. Change the file's extension from "txt" to "xml".
  7. Open the document in a text editor.
  8. Above the DTD (i.e. <!DOCTYPE PubmedArticleSet PUBLIC … ">), add the following line:

<?xml-stylesheet type="text/xsl" href="pubmed_xslt.xsl"?>

  1. Re-save the file.
  2. Then, download this file to the same directory as your "pubmed_results.xml" file.
  3. Now click on "pubmed_results.xml" ; your browser should now display select data in an HTML tabular format.
  4. From here, simply copy/paste the tabular data into OpenOffice Calc, clean it up as desired, save it as a ".ods" file, hook it up to OpenOffice Base, and design your queries, etc.

And now you've got a totally Free (minus the cost of a laptop, internet connexion, etc.) desktop dbase of Medline results.

* Note that the XML stylesheet I provided only displays certain info. You can always open the stylesheet in a text editor and set it to display more information, such as Abstract, etc.


 

--------------

Related Content:

Written by nitin

August 15th, 2009 at 1:48 pm

Posted in XML

Tagged with , , , ,

XSLT transformations: "more than meets the eye"

one comment

A few months ago, my department head had encouraged us to learn about XML stylesheets and XSLT transformations. After picking at it here and there, I finally had my breakthrough with it this weekend. Of course, were I more patient, I could have gotten paid to do this at work tomorrow.

As usual, the majority of the work is in finding examples and explanations that speak to me. This thread was particularly helpful.

One of the biggest breakthroughs – as embarrassing as it is to admit – was my realization that one needed an XSLT processor to actually create a new XML document based on the instructions provided in the stylesheet.

I’ve been experimenting with both the Saxon and Microsoft processors. Rather than run them from the Windows command prompt, I’ve been using the command line interface in the jEdit text editor. There’s a built in XSLT processor plug-in with jEdit, but I couldn’t get it to work, hence the use of the afformentioned methods.

If I understand corrently, one of the uses of this will be to take XML data about audio files generated from the JSTOR/Harvard Object Validation Environment (JHOVE) and map the pertinent information to another schema/XML document. That’s a bit out of my league right now, but a modest start is yet a start.

I’ll also be interested in using transformations to make customized XML documents from MusicXML sources and Zotero exports. Admittedly, I have no real ideas as to what I’d need to do this for, but I simply have a hankering to think of related projects. Maybe pulling the lyrics out of a MusicXML document into a TEI verse document?

--------------

Related Content:

Written by nitin

August 9th, 2009 at 6:38 pm

Posted in XML

Tagged with , , , , ,

Switch to our mobile site