blog.humaneguitarist.org

discoveries in digital audio, music notation, and information encoding

Archive for the ‘information retrieval’ Category

do you two know each other? Bash meet Python

leave a comment

I'm working on a cool project at work that's about harvesting metadata, indexing it with Solr, and providing a simple UI so that people wanting to search for digital items from North Carolina libraries can have some fun searching from a single interface. It's fun working with other people re: making decisions and all, but also with coding. I'm totally the "backend" guy re: harvesting metadata and indexing and the UI is being handled, very awesomely, by one of the programmers who works at one of the partner institutions. Once the site's up and running on a non-development server (hopefully in just a few weeks), I'll offer up more information and a link or two.

Anyway, once a user makes a selection through the UI and clicks on a link, they go straight to the corresponding page on the originating website. Right now, everything is using an OAI feed for the pilot project, but the Python script that does the harvesting can support lots of other things, like WordPress sites, for example, by harvesting RSS feeds or whatever.

It's nothing new, but what we have works and has a very small footprint in terms of scripts and setup files. The only real requirements are that the data be openly available via HTTP and that there's a programmatic way to construct a new URL to get the next "batch" of metadata.

For instance:

http://blog.humaneguitarist.org/?feed=rss&paged=1

http://blog.humaneguitarist.org/?feed=rss&paged=2

etc.

… oh and that the data be parse-able by XSLT 1.0, but as I mentioned before I'll eventually add support, in an extensible manner, for what I hope is just about any scripting language.

Anyway, I wanted to set up a cron job to run the harvester, so I wrote a Bash script that runs the harvester and the cron job in turn runs the Bash script.

All the partners involved for the pilot agreed that we'd harvest and index every two weeks. Currently, I'm running it nightly, but same difference. The real thing I want to say is that, after harvest, I delete the entire index before re-indexing. This keeps the thing up-to-date and prevents old items from lingering in the index if, in fact, they've been taken down from the originating collection. And, let's face it, that's the reality of it. Things change.

Of course, this entails a huge risk. If something goes wrong with the harvesting script (which is still in it's early stages of development) or with one or more of the feeds, then deleting the index is potentially disastrous. So I discussed this with our main IT/programming guy in the office. And he said, "You gotta make your Python script talk to your Bash script."

What he meant was that while the Python script will push through most issues, foreseen and not, I needed the Python script to report if something went wrong with a feed or whatnot along the way. So, what I did was simply set it to print a "0" if all went well and a "1" if anything I identified as a point of concern occurred: Python script failed, one of the feeds returned a non-200, etc. The Bash script, in turn, reads this output and will only delete the index if a "0" was returned by the Python script, called "pOAIndexter.py".

So, here's the Bash. I think the logic is laid out well enough with the echo statements, so I'll just cough it up, as is, below:

#!/bin/bash

#####
echo "HARVESTING metadata (this may take a long time)."
cd //srv/heritageIndexing/pOAIndexter/
output=$(./pOAIndexter.py)
echo ""

echo "Return code:" $output
echo ""

#####
cd /srv/heritageIndexing/apache-solr-3.4.0/example/exampledocs
if [ $output != "0" ]; then
 echo "NOT deleting existing index."
else
 echo "DELETING existing index."
 java -Ddata=args -jar post.jar "<delete><query>*:*</query></delete>"
fi
echo ""

#####
echo "INDEXING harvested metadata."
java -jar post.jar /srv/heritageIndexing/pOAIndexter/output/*.xml
echo ""

#####
echo "DELETING temporary harvested metadata files."
cd /srv/heritageIndexing/pOAIndexter
rm output/*.xml
echo ""

#####
echo "Farewell."
--------------

Related Content:

Written by nitin

February 19th, 2012 at 7:56 am

trying to easily format Solr results as HTML with Python

leave a comment

Just a quick Saturday morning post …

One of the nice things about Solr is the ability to pass parameters that will return the results in various formats, including a Python dictionary.

I wanted to see if I could whip up a little function that would let me pass to it both the name of a Solr element (like "title") and then the HTML element I want it mapped to.

It doesn't seem that bad, and is a good reminder that building UIs is in large part about parsing data into HTML, upon which things like CSS and JavaScript can enter and act re: display and interface.

Anyway, so here's an example of some code that gets five results from a Solr instance and then uses the function I wrote to output some HTML elements:

import urllib2 as urllib

### first, get 5 Solr results formatted as a Python dictionary
query_url = 'http://data.twigkit.com/solr-gutenberg/select/?q=poe&version=2.2&start=0&rows=5&wt=python&explainOther'
solr = urllib.urlopen(query_url).read() #read the results
print type(solr) #returns that it's a string :-[
solr = eval(solr) #turns the string into a dictionary. yay.
print type(solr) #returns that it's now a dictionary!

### second, write a function that converts stuff to HTML
def pysolr2html(tagIn, tagOut):
    tagVal = solr['response']['docs'][i][tagIn]
    htmlVars = (tagOut, tagIn, tagVal, tagOut)
    return '<%s class="solr_%s">%s</%s>' %htmlVars

### third, iterate over the response
i = 0
for doc in solr['response']['docs']:
    print pysolr2html('id','span')
    print pysolr2html('title','p')
    i = i + 1

And here's the output from IDLE:

>>>
<type 'str'>
<type 'dict'>
<span class="solr_id">etext8893</span>
<p class="solr_title">Selections from Poe</p>
<span class="solr_id">etext9511</span>
<p class="solr_title">Several Works by Edgar Allan Poe</p>
<span class="solr_id">etext9512</span>
<p class="solr_title">The Works of Edgar Allan Poe, Volume 1</p>
<span class="solr_id">etext9516</span>
<p class="solr_title">The Works of Edgar Allan Poe, Volume 5</p>
<span class="solr_id">etext9513</span>
<p class="solr_title">The Works of Edgar Allan Poe, Volume 2</p>

I'll probably do this with PHP in the end and see how easy it might be to make a small Solr wrapper, kind of like Tempo which is super light-weight. But for now, I need to remind myself it's the weekend.

:P

--------------

Related Content:

Written by nitin

January 21st, 2012 at 12:03 pm

simple point and search with a maps API

leave a comment

I'm currently working with some folks on a pilot project to build a shared index of digital collection metadata from libraries in North Carolina. My part entails harvesting the metadata and indexing it with Solr.

Since most of the stuff is North Carolina centric, I thought it might be neat to use the API that the index will have to drag a marker on a map and then use the marker location to send a search to the index. My co-worker also wants to do something similar so people can search using a map marker for a project of hers. So, I thought I'd investigate.

I wanted to see how easy it was to do this with the Google Maps API and, well, it is pretty easy. Especially, since I found this marker dragging example.

There's lot of stuff that can be done, like dynamically populating the page with a set of results via AJAX but for now I'm just using the city name plus a pre-written string to create a search string that the user has to manually click.

In the little example I made (below), I'm using my home state of South Carolina. If you move the marker within SC and drop it, a simple search string is created for sending a search, ironically, to Bing Maps for the term "restaurants" and that city.

I think a search like this is really for fun, but it might be a nice way to search and learn a little geography along the way. I've also got an idea for a game using this kind of search that's sort of a spin on Concentration for digital collections, but I'll write that up later. I'm hungry and need to start enjoying my Saturday. Looking at these restaurants searches is making me even hungrier.

:/

--------------

Related Content:

Written by nitin

January 7th, 2012 at 11:29 am

choose your own toppings: whatever code inside CDATA

leave a comment

I really should be packing for an overseas vacation that begins tomorrow, but I wanted to jot some stuff down before I forget – and I intend to forget a lot!

Anywho, in a previous post I wrote about putting XSL inside a CDATA block inside an XML config file. I had the following example:

<map name="LibriVox">
  <XSLT>./XSLT/LibriVox_to_Solr.xsl</XSLT>
  <nextXSL>
  <![CDATA[
  <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text"/>
    <xsl:template match="/">
      <xsl:variable name="baseURL" select="'%s'" />
      <xsl:variable name="URL_params" select="'%s'" />
      <xsl:variable name="offset_" select="substring-after($URL_params,'=')" />
      <xsl:variable name="offset" select="substring-before($offset_,'&amp;')" />
      <xsl:variable name="limit_" select="substring-after($URL_params,'&amp;')" />
      <xsl:variable name="limit" select="substring-after($limit_,'=')" />
      <xsl:variable name="output">
        <xsl:value-of select="$baseURL" />
        <xsl:text>?offset=</xsl:text>
        <xsl:value-of select="$offset+50" />
        <xsl:text>&amp;limit=</xsl:text>
        <xsl:value-of select="50" />
      </xsl:variable>
      <xsl:value-of select="$output" />
    </xsl:template>
  </xsl:stylesheet>
  ]]>
  </nextXSL>
</map>

This is part of this pOAIndexter script I'm working on for, well, work.

The XML code above is from one of the config files where the <XSLT> element points to an XSL file used to process metadata retrieved from a website. In this case, the point is to make a Solr-compatible XML document that can be used for indexing purposes. The second element, <nextXSL>, is used to return to pOAIndexter the URL for the next batch of metadata for a given feed, i.e. the next page or the next set within a collection, etc.

And as you can see there are two weird looking variables at the top:

      <xsl:variable name="baseURL" select="'%s'" />
      <xsl:variable name="URL_params" select="'%s'" />

The reason being that the pOAIndexter script actually populates these with the actual base URL for the batch just retrieved and the parameters, respectively, before the XSL within the <nextXSL> element is run, returning the string of the next URL.

I chose XSL because I think, as a librarian, it seems to be common to a lot of metadata and digital library folk and such people could extend the capabilities of pOAIndexter without having to know Python. But all along I wanted people to be able to process the metadata and return the next URL with whatever scripting language they want, provided the interpreter exists on their system.

So, say for example you like PHP instead. Instead of the using XSL you could use something like this:

<map name="LibriVox">
  <PHP>./PHP/LibriVox_to_Solr.php</PHP>
  <nextPHP>
  <![CDATA[
  <?php
  $baseURL=%s;
  $URL_params=%s;

  //some PHP code here

  echo $output; //where $output is the next URL ...
  ?>
  ]]>
  </nextPHP>
</map>

That way PHP could be used to make the Solr-XML file and to return to pOAIndexter the next URL string so that the next batch of metadata from a feed could be processed/transformed. Of course, you could mix and match, too – XSLT for making the Solr-XML file and PHP just for getting the next URL.

That's actually pretty simply to do with the common scripting languages like Python, PHP, Perl, Ruby, etc. But what I really wanted to support was JavaScript because, well, it would just be cool, but also because that's another one of those common languages that a lot of people might know even though there might be great variation amongst the other scripting languages they know when compared to a lot of their colleagues.

But I didn't know how to execute Javascript via the command line so that pOAIndexter can capture the next URL via the standard output stream.

Well, enter PhantomJS.

That is all. Time to pack.

--------------

Related Content:

Written by nitin

December 21st, 2011 at 11:03 pm

on facets and unordered combinations

leave a comment

I have to come up with some sort of basic taxonomy for our files at work, so today I was doing some research to help my mind get where it needs to be.

First, I need to mention this really nice PowerPoint on developing a taxonomy for work called "Getting Started with Business Taxonomy Design". It got me thinking about tags and how if I tagged something with four tags, "a", "b", "c", and the equally dull "d", then I could calculate the number of ways I could filter down to a set of results that would contain that document.

I don't know if my thinking is correct, but I'm thinking there should be 15 ways to get to that document with four tags because I should be able to filter, inclusively, not only by each individual tag but also by unordered combinations of those tags, i.e. tagged with both "b" AND "a", etc.

So, I read this neat page on calculating unordered combinations for a set and wrote a Python script, below, to calculate the total. I had to also see this page about what the value of zero factorial is as I'd assumed it would be undefinable.

The script tells how many unordered combinations for four tags there would be for a group of four tags (i.e. all of them), then three, then two, then each of the four.

import math

tags = ['a','b','c','d'] #say I have assigned 4 tags to a file
tags_len = len(tags)
numerator = math.factorial(tags_len)
combinations_value = 0
i = tags_len

def calculateCombinations(i):
    denominator = (math.factorial(i)) * (math.factorial(tags_len-i))
    value = (numerator/denominator)
    return value

while i > 0:
    print calculateCombinations(i), 'unordered combination(s) per', i
    combinations_value = combinations_value + calculateCombinations(i)
    i = i - 1

print '-----\n=', combinations_value, 'unordered combinations'

Here's the output:

>>>
1 unordered combination(s) per 4
4 unordered combination(s) per 3
6 unordered combination(s) per 2
4 unordered combination(s) per 1
-----
= 15 unordered combinations
>>>

And here's me manually calculating it just to be sure …

QUADS (1):
    abcd

TRIPLES (4):
    abc; abd; acd
    bcd;

DOUBLES (6):
    ab; ac; ad
    bc; bd
    cd

SINGLES (4):
    a
    b
    c
    d
-----
1 + 4 + 6 + 4 = 15 unordered combinations

--------------

Related Content:

Written by nitin

October 20th, 2011 at 10:13 pm

PivotViewer: oh, the possibilities

leave a comment

I've been casually learning Solr in order to easily create a faceted search/retrieval interface for some digital collections with OAI feeds but now I kinda wish I was learning to do that with Microsoft's PivotViewer instead.

I won't do that, at least for now, but I probably won't be able to resist in the near future.

Here's a cool demo of it with some Netflix OData.

Pivot View of Netflix Instant Watch Movies

For a description of how it's done, check out this post: Pivot, OData, and Windows Azure: Visual Netflix Browsing.

Adding this follow-up thought the next day: Actually, Pivot would be a really good way to make eBooks and eAudio titles discoverable … all those book covers. Sheet music (first page) would be cool, too.

Oh, the possibilities.

--------------

Related Content:

Written by nitin

October 17th, 2011 at 12:16 pm

indexing and searching timed text with Solr

leave a comment

I'm still learning about Solr so maybe this post is much ado about nothing. But according to this nabble.com thread, one can't index a source XML document in Solr with it's native XML structure intact and then in turn search that structure as one can in an XML database like BaseX.

For most things, that's fine. I mean for indexing titles, creators, and descriptions, etc. I just need to index the value of a given element like <title> so that I can search for that element's value.

But for timed text, it's different. Or at least, it can be.

Say I have this DFXP snippet for an audio file with an "id" value of "XYZ".

<p begin="10.0s" end="30.0s">Hello world!</p>

I would need the user to search for the string "Hello world!" or part of it but I would also need to index at least the value of the "begin" attribute so that I can pass that to a page that will play the file "XYZ" starting at the 10 second mark – if the user clicks on the "Hello world!" line in their search result. And I don't want the "10" second value to be something they search against since they might be searching for the string "10" within the text itself.

So I'm wondering how to do that with Solr.

Maybe when I learn more I'll discover a better way to do this, but for now I'm thinking I could do the following:

First, I would pretty much index the timed text twice in Solr.

<doc>
  <field name="id">XYZ</field>
...
  <field name="timedText-stripped">Hello world!</field>
  <field name="timedText">Hello World! {10}</field>
</doc>

After indexing the "id" of the audio file this would index:

  • just the text "Hello world!"
  • the text of "Hello world!" with the "begin" attribute value in curly quotes.

I guess this way the user could be made to search across the "timedText-stripped" field but, via the XSL that can be passed to Solr to display results, the "timedText" field could be displayed in a manner that would make the text "Hello World!" linked to whatever file will play file "XYZ" starting at the 10 second mark. Basically, by planting the "begin" value in curly quotes, I can parse the string for the text and the "begin" value as separate things.

So, here's a really crappy XSL snippet that would do something like that. It assumes a variable "$id" exists that equals "XYZ", the identifier for the example audio file.

<xsl:for-each select="//field[@name='timedText']">
  <xsl:variable name="whole">
    <xsl:value-of select="."/>
    <!-- Gets entire element string -->
  </xsl:variable>
  <xsl:variable name="text">
    <xsl:value-of select="substring-before($whole,'{')"/>
    <!-- Gets text prior to seconds -->
  </xsl:variable>
  <xsl:variable name="begin">
    <xsl:value-of select="substring-before(substring-after($whole,'{'),'}')"/>
    <!-- Gets seconds value from end of string -->
  </xsl:variable>
  <a href="someMediaPlayer.php?id={$id)&amp;begin={$begin}">
    <xsl:value-of select="$text"/>
  </a>
  <!-- So, I'm saying that
  "someMediaPlayer.php?id=XYZ&start=10"
  would launch a player that would start file XYZ at the 10 seconds mark.
  -->
</xsl:for-each>

The search output would be some HTML code like so:

<a href="someMediaPlayer.php?id=XYZ&amp;begin=10>Hello World!</a>

It seems weird to index something twice, more or less, but as user Erick says in the nabble.com thread, "You've gotta take off your DB hat and not worry about duplicating data."

But now as I write this, I'm wondering if I can't just index as follows:

  <field name="text">Hello world!</field>
  <field name="begin">10</field>

and trust that for each "text" field, there will be a matching "begin" field and that they can't just be used in tandem to create the same HTML link as above. Sounds like I need to play around some more.

:)

--------------

Related Content:

Written by nitin

October 16th, 2011 at 10:54 am

on why search and cloud tags will ruin your dinner parties

2 comments

Just shooting from the hip here …

I'm imagining a library in which I couldn't browse the collection physically by walking up and down the aisles.

Where all I can do is approach a reference librarian and have them bring me back items they thought matched my needs based on a short "interview".

Where I'd then assess what's before me, clarify a few things, and have them bring me back more things – only to learn that a few items I sent back are ones I now want back.

What a friggin' mess. It's like shopping for a suit at a men's clothing store. I don't even want to think about having to do this to buy groceries. Oh, the horror …

But isn't that what search is?

And yet, the Kool-Aid tells us search is better than browsing.

Bull poop.

Implementing search and even keyword tags on a website (like this blog!) is easy to implement programmatically – a matter of some simple SQL, assuming SQL is the backend for the site's metadata. Maybe that's the real reason they're so prevalent.

Employing some kind of extensible taxonomy to categorize the information takes more work … and more thought. For sites with a lot of content that will be used over and over for research and referential purposes, it's better if the user has both search and true browsing opportunities, says I.

--------------

Related Content:

Written by nitin

July 4th, 2011 at 10:11 am

MIR article opportunities

leave a comment

I just saw the posting below in my email today. I'm certainly going to see if I can submit something on MXMLiszt

Call for articles: Music information retrieval (MIR) special issue

_OCLC Systems & Services: International Digital Library Perspectives_ (OSS:IDLP) will be publishing a special issue on music information retrieval (MIR) and libraries. The editor is looking for articles that articulate the planning, development, testing, systems work, marketing, etc. related to MIR, as well as the challenges of providing access to MIR materials. Articles can be of any length, and figures and screen shots are encouraged. OSS:IDLP is a peer-reviewed journal.

If you are interested in contributing, please send the editor your name, a short proposal of the topic, and a tentative title for the article. Deadline for proposals is September 1, 2010. Articles would be due to the editor by February 1, 2011. Any questions and proposal should be directed to the editor, not to this listserv. Thank you.

Dr. Brad Eden
Editor, _OCLC Systems & Services: International Digital Library Perspectives_
Associate University Librarian for Technical Services and Scholarly Communication
University of California, Santa Barbara
eden@library.ucsb.edu

--------------

Related Content:

Written by nitin

July 22nd, 2010 at 8:23 pm

MusicSQL: initial thoughts

one comment

One of the nice things about an emerging standard, namely MusicXML, having a command center (Recordare LLC) is having a central place to learn about what’s new.

On Friday, I was looking at Recordare’s page of MusicXML related software for software that worked from the command line and noticed something new and really interesting: MusicSQL.

According the the Goodle Code page that hosts this project, MusicSQL is:

… a system for conducting complex searches of symbolic music databases. The database can import and export MusicXML files. In the current version searches are constructed using a command line interface or through simple Python scripting tools.

Basically, at least as I understand it, MusicSQL is a Python program that sits on top of a MySQL database – now I really hope Oracle doesn’t kill MySQL if it buys Sun.

I was so excited to get MusicSQL working that I didn’t notate all the little problems I had along the way. The documentation for MusicSQL is very good and is written for Windows, Mac, and Linux (Ubuntu) users. But I’m inconceivably impatient, so I just mowed through the installation with little care for remembering what I was doing.

I do remember that I had to install Python 2.5, whereas I already have Python 2.6 installed – now I have both. I put/installed all the dependencies in my Python 2.5 directory just to compartmentalized everything – the exception being MySQL, which I installed wherever the default is.

So far, I only ran the first query in the documentation that uses "scientific" musical notation in the form Nx, where "N" is the alphabetical note name, say C, and "x" is an integer that denotes what octave the note is a member of. In other words, a C-Major scale would be "Cx Dx Ex Fx Gx Ax Bx Cx+1", something like "C5 D5 … B5 C6", etc. You can place an integer before the note name to denote its duration.

Running the query from the command line, I was really happy with the speed and the output of MusicSQL for the test query.

One problem I did have, though, is I kept getting errors for another great feature of MusicSQL. Basically, after you run your query, you can see a PDF of the results (i.e. the music excerpt pertaining to the query results). The PDF is made by Lilypond, a text-based notation software that produces – in my opinion – the absolute best looking engraving out there, that’s why I use it (and yes, it’s free).

Now Lilypond doesn’t natively read MusicXML, it uses its own encoding. So MusicSQL takes advantage of a Python script that comes with the Lilypond install called "xml2ly" that converts MusicXML to Lilypond format. I left a message on the project forum for MusicSQL, so I’m hoping I can figure out what I need to do to get the Lilypond outout of the query results to work. At any rate, I do wonder how effective it can be since the conversion from MusicXML to Lilypond can sometimes get ugly.

I wonder if an alternative solution is to use the command line options for the MuseScore notation software to generate a PDF of the query results. Musescore can also convert MusicXML to other graphics formats (PNG) and even audio (WAV, FLAC, OGG), so theoretically it could be leveraged to make audio files for the corresponding query results.

At any rate, I’m really looking forward to the future developments of MusicSQL.

And as for using MuseScore’s command line in conjunction with MusicXML and how it can add value to a web collection of MusicXML docs – there will be more to that later …

--------------

Related Content:

Written by nitin

November 15th, 2009 at 3:54 pm

Switch to our mobile site