blog.humaneguitarist.org

discoveries in digital audio, music notation, and information encoding

museline: trying to add support for compressed MusicXML

2 comments

Just a quick follow up to the last post about using Google Chart Tools to outline melodic contours from MusicXML files …

I wanted to add support for compressed MusicXML files in addition to the non-compressed ones. So far, the code I've got seems to be working with the two or three compressed MusicXML files from Wikifonia I tested.

Here's a screenshot below of A-Ha's "Take On Me", one of the best songs from the 80's with one of the absolute best videos, too! To make the graph I passed it to the app a la "http://localhost:8083/?mxml=http://static.wikifonia.org/1934/musicxml.mxl".

museline_aha_screenshot.png

Here's the video:

Keep in mind the contour script doesn't take repeats into account and that the entire melody repeats three times in the song.

Also, I don't like to make code downloadable if I'm still working on it because I don't want to junk up my web directory, but I'll paste everything essential below: the Google App Engine YAML file, the Python code, and the Jinja/HTML template.

YAML:

application: museline
version: 1
runtime: python27
api_version: 1
threadsafe: true

handlers:
- url: /stylesheets
  static_dir: stylesheets
- url: /.*
  script: museline.app

libraries:
- name: jinja2
  version: latest
- name: lxml
  version: latest

Python:

### museline.py
### 2012, Nitin Arora

### import modules
import urllib
from lxml import etree
import math
import re
import webapp2
import jinja2
import os

jinja_environment = jinja2.Environment(
  loader=jinja2.FileSystemLoader(os.path.dirname(__file__)))

#####
class museline(webapp2.RequestHandler):
  def get(self):

    ### read MusicXML file
    try:
      url = self.request.get('mxml')
##      url = 'http://blog.humaneguitarist.org/uploads/i_heart_thee.xml' #test line
      if url[-4:] == '.xml': # uncompressed MusicXML
        readUrl = urllib.urlopen(url).read()

      else: # compressed MusicXML
      ### References:
        # http://stackoverflow.com/a/8858735
        # http://stackoverflow.com/questions/1313845/if-i-have-the-contents-of-a-zipfile-in-a-python-string-can-i-decompress-it-with
        from cStringIO import StringIO
        compressed = urllib.urlopen(url)
        compressedString = StringIO(compressed.read())
        import zipfile
        zipped = zipfile.ZipFile(compressedString, "r")

        archiveFiles = zipped.namelist()
##        self.response.out.write(archiveFiles) # test line
        for archiveFile in archiveFiles:
          if archiveFile[-4:] == ".xml" and "/" not in archiveFile:
            realXML = archiveFile
        extracted = zipped.open(realXML,'r')
        readUrl = extracted.read()

##      self.response.out.write(readUrl) # test line

    except:
      errorMessage = '''<pre>
You must pass an "mxml" parameter.
If you have but still see this message, then there is a problem accessing/reading the MusicXML file.
</pre>'''
      self.response.out.write(errorMessage)
      return

    ### setup pitch values
    notes = ['C','D','E','F','G','A','B']
    i = 0
    noteVals = {}
    for note in notes:
      if note == 'C' or note == 'F':
        noteVals[note] = i + 1
        i = i + 1
      else:
        noteVals[note] = i + 2
        i = i + 2

    ### parse MusicXML file
    parsed = etree.XML(readUrl)

    ### get basic descriptive metadata
    metadata = []
    elementList = ['work-title',
                   'work-number',
                   'movement-number',
                   'movement-title',
                   'creator[@type="composer"]',
                   'creator[@type="lyricist"]']
    for element in elementList:
      xpath = str(".//%s") %element
      if parsed.find(xpath) !=None:
        found = parsed.find(xpath).text
        att = re.match(r'(.*)type="(.*)\"', element)
        if att:
          element = att.group(2)
        if found:
          metadata.append((element,found))
##    self.response.out.write(metadata) # test line

    ### access part one tree
    part = parsed.find('.//part[@id="P1"]')
    pitches = part.findall('.//pitch')
##    self.response.out.write(str(len(pitches)) + " pitches.\n") # test line, number of notes (non-rests)
##    self.response.out.write(str(len(pitches)*.618) + " Golden Ratio.\n") # test line, maybe something for the future.

    ### put pitch values in a list
    pitchList = []
    i = 1
    for pitch in pitches:
      if pitch.find('.//alter') != None:
        alter = int(pitch.find('.//alter').text)
      else:
        alter = 0
      step = pitch.find('.//step')
      octave = int(pitch.find('.//octave').text)
      pitchPos = str('pitch: ' + str(i))
      pitchClassVal = ((int(noteVals[step.text]) + alter)) * .01
      pitchVal = ((int(noteVals[step.text]) + alter) + (octave * 12)) * .01
      label = (pitchPos, pitchVal, pitchClassVal)
      pitchList.append(label)
      i = i + 1

##    for pitch in pitchList: # test block
##      self.response.out.write(str(pitch)+'<br>')

    #data for the Jinja template
    template_values = {
      'pitchList': pitchList,
      'url': url,
      'metadata': metadata}

    template = jinja_environment.get_template('museline.html')
    self.response.out.write(template.render(template_values)) #write data to the html template

app = webapp2.WSGIApplication([('/', museline)],
                              debug=True)

Template:

<!DOCTYPE HTML>
<!-- museline.html -->
<html>
  <head>
    <title>
      museline
    </title>
    <link type="text/css" rel="stylesheet" href="/stylesheets/style.css" />
    <script type="text/javascript" src="http://www.google.com/jsapi"></script>
    <script type="text/javascript">
      google.load('visualization', '1', {packages: ['corechart']});
    </script>
    <script type="text/javascript">
      function drawVisualization() {
        // Create and populate the data table.
        var data = google.visualization.arrayToDataTable([
        ['pitch position', 'melodic contour'],
        {% for pitch in pitchList %}
          ['{{ pitch[0] }}', {{ pitch[1] }}],
        {% endfor %}
        ]);

        // Create and draw the visualization.
        new google.visualization.LineChart(document.getElementById('visualization')).
        draw(data, {curveType: "function",
          width: 800, height: 400,
        vAxis: {maxValue: 1}}
        );
      }
      google.setOnLoadCallback(drawVisualization);
    </script>
  </head>
  <body>
    <div id="visualization"></div>
    <p>Metadata:</p>
    <ul>
    {% for metadatum in metadata %}
      <li>{{ metadatum[0] }} : {{ metadatum[1] }}</li>
    {% endfor %}
      <li>URL: <a href="{{ url }}">{{ url }}</a></li>
    </ul>
  </body>
</html>
--------------

Related Content:

Written by nitin

May 5th, 2012 at 5:36 pm

museline: charting melodic contours via web service

leave a comment

In the last post, I mentioned I was playing with Google App Engine and Google Chart Tools.

Last night, with some silly movie streaming in the background, I was in bed tinkering with a little idea that I'm sure has been done a-thousand times already and that may be built into high end music notation applications. But it hasn't been done by anyone as stoopid as me!

:P

What I did was whip up a little App Engine/Python app where one can pass it a partwise MusicXML file and it will use Google Chart Tools to create a little line chart of the melodic contour of the first <part> element.

Here's a screenshot below of the results using the MusicXML sample file available on the MakeMusic site of Schumann's "Im wunderschönen Monat Mai" from the Dichterliebe. The app has an "mxml" parameter that tells it which MusicXML file to use a la "http://localhost:8083/?mxml=http://downloads2.makemusic.com/musicxml/Dichterliebe01.xml".

 

I've embedded a really nice performance on YouTube if anyone wants to follow along. The contour graph represents the vocal part only.

 

Now, this is just a start. There's a lot of work to do if I pursue this. For starters, I'd like to make the chart synced with an audio/video recording. I don't know if I can do that with Chart Tools, but probably with the <canvas> element if nothing else. Also, I haven't tried this yet with any non-homophonic parts. Anyway, it's a start and it's kinda fun.

I tried to add another line for the actual pitch class contour but it wasn't as interesting to look at as the melodic contour so I disabled that "feature". By pitch class, I mean I was using octave equivalency so that all "C" notes, for example, were plotted at the exact same vertical position as opposed to the screenshot above where two "C" notes an octave apart would have different vertical points on the graph to depict the intervallic difference.

As far as plotting the notes, I ignored rests and durations. I just plotted the pitches as below, starting with "C" with a value of "1" and with the "B" a seventh up from that "C" receiving a "12".

  • C : 1
  • D : 3
  • E : 5
  • F : 6
  • G : 8
  • A : 10
  • B : 12

This way a "C-sharp" and "D-flat" receive a score of "2", for example, because they lie between "C as 1" and "D as 3".

In MusicXML, the <step> element has the note name and the optional <alter> element, which is a number, tells you if it's sharp or flat, etc. The numerical <octave> element tells you what octave range the pitch is in.

So what I'm doing is pulling out the <step> value and converting it to a number as above, adding the <alter> value (a flat is a negative number), and then multiplying adding that sum to 12 times the <octave> value. Then, I multiple the value by ".01" just to reduce the number because I want the graph's vertical limit to be a small number even though this shouldn't change the contour itself.

Last, I'm trying to pull some basic descriptive metadata if they are present in the MusicXML file and show it below the graph.

Maybe I'll do more with this later. Just goofin' for now.

--------------

Related Content:

Written by nitin

May 3rd, 2012 at 3:55 pm

North Carolina grants, Google App Engine, and pie … mmm.

leave a comment

I took April off from blogging after realizing I was over blogging, as opposed to over logging.

I'll keep this short. Well, I'll try.

I'm shacked up in the apartment due to some unexpected circumstances and yesterday I decided to try and be a little productive and learn something I could potentially use in the workplace.

I learned a little about Google App Engine. I was drawn to it because of the Python support and because it gives me a free environment where I can deploy Python apps using the ever-elusive lxml library.

While I wrote some silly stuff using lxml and data available from the Business.gov API I ended uploading a simple app – if you can call it that – that parses a CSV file from North Carolina's (USA) NCOpenBook.

I didn't use the csv module because the CSV file I used has like three lines at the top that aren't headers (people: don't do that!). I don't know if there's a way to handle that with the csv module (there probably is) but I wasn't interested in digging around. Instead, I used a modified version of this code I wrote previously.

The CSV file lists grantees who've received funding by North Carolina and the app pulls out the top ten since 2007 based on cumulative grant totals. The app uses Google Chart Tools to make a pie chart of the top ten recipients. I'm not so sure about the colors in the pie chart – it's hard to see the difference between some of the colors associated with each grantee – but it's a simple start.

Here's a screenshot:

Top Ten NC Grants by Grantee

.. and here's the link to the app online: http://top-ten-nc-totals-by-grantee.appspot.com.

I've also pasted the app.yaml file, my Python code, and the Jinja/HTML template below if anyone's interested.

YAML:

application: top-ten-nc-totals-by-grantee
version: 1
runtime: python27
api_version: 1
threadsafe: true

handlers:
- url: /stylesheets
  static_dir: stylesheets
- url: /.*
  script: nctotals.app

libraries:
- name: jinja2
  version: latest

Python:

#import modules
import urllib
import webapp2

import jinja2
import os

jinja_environment = jinja2.Environment(
  loader=jinja2.FileSystemLoader(os.path.dirname(__file__)))

#####

#see: http://stackoverflow.com/a/2827664
class Object(object):
  pass

#my CSV parser
def csv2dict(fileName, delimiter):
  f = urllib.urlopen(fileName) #open file
  lines = f.read() #read file

  rows = lines.split("\n") #put lines in list

  #cut out non-header rows at top of this particular CSV file
  for i in range(0,3):
    rows.pop(0)

  #shorten the CSV data to 10 rows (there were too many damn rows in the CSV file!)
  for i in range(12,len(rows)+1):
    rows.pop(-1)

  headers = rows[0].split(delimiter) #put header titles in list
  rows.pop(0) #remove header from "rows" list

  i = 0
  worksheet = {}
  for header in headers: #for each header, i.e. each column
    columnCells = []
    #print header #test line
    for row in rows: #for each non-header row in delimited file
      if row != "": #!!!you need to also add a test for lines that don't split on the delimeter (i.e. notes)
        rowCells = row.split(delimiter) #get cells in row
        columnCells.append(rowCells[i].strip()) #put column's cells in list
    worksheet[header] = columnCells #set header as KEY and set "columnCells" list as VALUE
    i = i + 1

  return worksheet

#####

class MainPage(webapp2.RequestHandler):
  def get(self):
    parsed = csv2dict("http://data.osbm.state.nc.us/openbook/comma_grant_cumulative_awards_and_annual_disbursements_by_grantee.csv", '","') #pass filename and delimiter

    topTen = range(0,len(parsed['"Non-Profit Name (*)'])) #i.e. range is 1 to 10, or 0 to 9 depending on your p.o.v.

    for i in topTen: #add attributes to each of the ten agencies in the CSV file
      topTen[i] = Object()
      topTen[i].name = parsed['"Non-Profit Name (*)'][i].replace('"','')
      topTen[i].total = parsed['Cumulative Total Award'][i]
      raw_total = parsed['Cumulative Total Award'][i]
      raw_total = raw_total.replace('$','')
      raw_total = raw_total.replace(',','')
      topTen[i].raw_total = raw_total

    #data for the Jinja template
    template_values = {
      'topTen': topTen}

    template = jinja_environment.get_template('index.html')
    self.response.out.write(template.render(template_values)) #write data to the index.html template

app = webapp2.WSGIApplication([('/', MainPage)],debug=True)

Template:

<!DOCTYPE HTML>
<html>
  <head>
    <title>
      Top Ten NC Grants by Grantee (since 2007)
    </title>
    <link type="text/css" rel="stylesheet" href="/stylesheets/style.css" />
    <script type="text/javascript" src="http://www.google.com/jsapi"></script>
    <script type="text/javascript">
      google.load('visualization', '1', {packages: ['imagepiechart']});
    </script>
    <script type="text/javascript">
      function drawVisualization() {
        // Create and populate the data table.
        var data = new google.visualization.DataTable();
        data.addColumn('string', 'name');
        data.addColumn('number', 'raw_total');
        data.addRows([
          {% for topper in topTen %}
          ["{{ topper.name }} - {{ topper.total }}", {{ topper.raw_total }}],
          {% endfor %}
        ]);

        // Create and draw the visualization.
        new google.visualization.ImagePieChart(document.getElementById('visualization')).
          draw(data, null);
      }
      google.setOnLoadCallback(drawVisualization);
    </script>
  </head>
  <body>
    <h3>Top Ten <a href="http://www.ncopenbook.gov/NCOpenBook/GrantsHome.jsp">NC Grants</a> by Grantee (cumulative totals since 2007)</h3>
    <p>see the source CSV file <a href="http://data.osbm.state.nc.us/openbook/comma_grant_cumulative_awards_and_annual_disbursements_by_grantee.csv">here</a></p>
    <div id="visualization"></div>
    <p>Made with:</p>
    <ul>
      <li><a href="https://developers.google.com/appengine/docs/python/gettingstartedpython27/">Google App Engine (Python 2.7)</a></li>
      <li><a href="https://developers.google.com/chart/">Google Chart Tools</a></li>
    </ul>
    <p>More info (blog post):</p>
    <ul>
      <li><a href="http://blog.humaneguitarist.org/2012/05/01/north-carolina-grants-google-app-engine-and-pie-mmm/">North Carolina grants, Google App Engine, and pie ... mmm.</a></li>
    </ul>
  </body>
</html>
--------------

Related Content:

Written by nitin

May 1st, 2012 at 10:42 am

Full Metal Alchemyapi.com or “more term extraction crap and linky data crud”

leave a comment

As I mentioned before, I'm playing with the idea of using term generating APIs to build facets in a Solr index project that I'm working on with some people.

The results seem really promising.

If I wasn't in need of a nap before some more college basketball gets underway, I'd say more than I'm about to.

Instead, I'm going to do three quick things here:

  1. Provide a screenshot of the index UI using Calais "social tags" for facets.
    1. This is a local (my computer) copy of the index using a very small set of item metadata. That's to say we currently have about 37k items in the index, and I'm just using about 1k.
    2. I'm only using Calais tags if the "importance" attribute is equal to "1", so I'm leaving out tags Calais considers less relevant because, well, some of the terms generated with an "importance" of greater than "1" were making me think "WTF?".
    3. Some of the terms with underscores like "War_Conflict" appear to be those used in the news industry and are potentially ones to throw out.
  2. Post a small Python script to make a call to Alchemyapi.com, which is similar – and possible better – than Calais.
  3. Post the Alchemyapi.com results XML document and talk a little about what I think it can be used for in our project.

So, here's the Calais screenshot (you'll need to view the image at full-resolution to read it):

Calais Facets

Here's the Python script to call the Alchemyapi.com API:

import urllib, urllib2

#set API url and API key
url = 'http://access.alchemyapi.com/calls/text/TextGetRankedConcepts'
apikey = '' #your API key goes here
#get Alchemy API key from: http://www.alchemyapi.com/api/register.html

#set some text for the API
text = '''
Episcopal churches
Churches Cemeteries
Tombs and sepulchral monuments
Postcards--North Carolina.
Flat Rock (N.C.)
Henderson County (N.C.)
'''

#send data to API
params = urllib.urlencode({
  'apikey': apikey,
  'text': text,
  'showSourceText': '1', #shows the original text sent to the API
})
alchemyThis = urllib2.urlopen(url, params).read()

#view results
print alchemyThis

And here's the output for the code above:

<?xml version="1.0" encoding="UTF-8"?>
<results>
  <status>OK</status>
  <usage>By accessing AlchemyAPI or using information generated by AlchemyAPI, you are agreeing to be bound by the AlchemyAPI Terms of Use: http://www.alchemyapi.com/company/terms.html</usage>
  <url/>
  <language>english</language>
  <text>Episcopal churches Churches Cemeteries Tombs and sepulchral monuments Postcards--North Carolina. Flat Rock (N.C.) Henderson County (N.C.)</text>
  <concepts>
    <concept>
      <text>North Carolina</text>
      <relevance>0.920839</relevance>
      <website>http://www.nc.gov</website>
      <dbpedia>http://dbpedia.org/resource/North_Carolina</dbpedia>
      <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f800000000002b62d</freebase>
      <opencyc>http://sw.opencyc.org/concept/Mx4rvViyspwpEbGdrcN5Y29ycA</opencyc>
      <yago>http://mpii.de/yago/resource/North_Carolina</yago>
      <geonames>http://sws.geonames.org/4482348/</geonames>
    </concept>
    <concept>
      <text>Tomb</text>
      <relevance>0.837256</relevance>
      <geo>29.855 31.219</geo>
      <dbpedia>http://dbpedia.org/resource/Tomb</dbpedia>
      <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f800000000007ff03</freebase>
      <opencyc>http://sw.opencyc.org/concept/Mx4rwQw2p5wpEbGdrcN5Y29ycA</opencyc>
    </concept>
    <concept>
      <text>Burial monuments and structures</text>
      <relevance>0.773605</relevance>
      <dbpedia>http://dbpedia.org/resource/Burial_monuments_and_structures</dbpedia>
    </concept>
    <concept>
      <text>Flat Rock, Henderson County, North Carolina</text>
      <relevance>0.718415</relevance>
      <geo>35.266666666666666 -82.45333333333333</geo>
      <website>http://villageofflatrock.org/</website>
      <dbpedia>http://dbpedia.org/resource/Flat_Rock,_Henderson_County,_North_Carolina</dbpedia>
      <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f80000000000ebc28</freebase>
      <yago>http://mpii.de/yago/resource/Flat_Rock,_Henderson_County,_North_Carolina</yago>
    </concept>
    <concept>
      <text>Henderson County, North Carolina</text>
      <relevance>0.615825</relevance>
      <geo>35.34 -82.48</geo>
      <website>http://www.hendersoncountync.org</website>
      <dbpedia>http://dbpedia.org/resource/Henderson_County,_North_Carolina</dbpedia>
      <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f80000000000a10b4</freebase>
      <yago>http://mpii.de/yago/resource/Henderson_County,_North_Carolina</yago>
    </concept>
    <concept>
      <text>Asheville, North Carolina</text>
      <relevance>0.610351</relevance>
      <website>http://www.ashevillenc.gov/</website>
      <dbpedia>http://dbpedia.org/resource/Asheville,_North_Carolina</dbpedia>
      <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f80000000000eb2ac</freebase>
      <census>http://www.rdfabout.com/rdf/usgov/geo/us/nc/counties/buncombe_county/asheville</census>
      <yago>http://mpii.de/yago/resource/Asheville,_North_Carolina</yago>
      <geonames>http://sws.geonames.org/4453066/</geonames>
    </concept>
    <concept>
      <text>Episcopal Church in the United States of America</text>
      <relevance>0.610029</relevance>
      <dbpedia>http://dbpedia.org/resource/Episcopal_Church_in_the_United_States_of_America</dbpedia>
      <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f8000000000015f1b</freebase>
      <yago>http://mpii.de/yago/resource/Episcopal_Church_in_the_United_States_of_America</yago>
    </concept>
    <concept>
      <text>New York</text>
      <relevance>0.592008</relevance>
      <geo>43.0 -75.0</geo>
      <website>http://www.ny.gov</website>
      <dbpedia>http://dbpedia.org/resource/New_York</dbpedia>
      <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f800000000054dd5d</freebase>
      <opencyc>http://sw.opencyc.org/concept/Mx4rvViNs5wpEbGdrcN5Y29ycA</opencyc>
      <census>http://www.rdfabout.com/rdf/usgov/geo/us/ny</census>
      <yago>http://mpii.de/yago/resource/New_York</yago>
    </concept>
  </concepts>
</results>

As you can see, "New York" shows up but it has less than 60% relevance, so maybe that's a threshold to consider when indexing automated subject terms with Alchemyapi. That's just my theory and only lots of testing will help determine what the threshold really is – if there's one at all.

As you can also see, there's a lot of potential for linked data with this output: to data from relevant dbpedia pages, etc. One neat thing would be to make it so that if the user hovers over a facet, that the UI pops-up more information from these linked data sources like relevant websites, mapped geo-coords using the Google Maps API, definitions of the faceted term, and similar concept visualizations, etc.

That's all. Sleepy time and B-ball starts soon …

--------------

Related Content:

Written by nitin

March 25th, 2012 at 4:57 pm

easy calls to OpenCalais with Python, daggummit!

leave a comment

Yesterday, I wrote this post about using Yahoo's deprecated term extraction web service to generate "subjects" – or whatever you want to call them – for an item based on the metadata housed in a Solr-compatible XML file. I'd also wondered about doing the same thing with OpenCalais.

Before we go any further, I'd just like to say I wrote that post from my hotel room. I'm writing today's from the Denver airport with about 2 hours to kill before my flight departs. And I'd also like to point out that when writing blog posts with spotty Wi-Fi connections, one should not compose their post online through WordPress. I'm using WordPad, and I should probably make that a habit.

Yeah, so anyway there's not that much good documentation on how to make calls on the Calais site. By "good" I mean there's no code sample to rip off. I'm sure it's perfectly fine for people who actually know what they're doing.

Using "The Google" I found this helpful post on making calls to OpenCalais. While I found it very well written and the code very helpful, I didn't want to have "httplib2" as a dependency since it's not available out-of-the-box with Python 2.7, as far as I know. Nor did I want to do anything with JSON. I'm just trying to make a simple POST request to the OpenCalais REST API – is all.

Using that post's code as a starting point, I whipped up some simple Python without "httplib2".

Note that this code passes three parameters to the API through the following variables:

  • "myCalaisAPI_key": this is where to paste your API key once you get it from Calais here.
  • "sampleText": this is a string of plain text to send to Calais for it to analyze and build terms for.
  • "calaisParams": these are the options to pass to the service in XML format. 

Note that I'm specifically requesting what I really want, "social tags", via the following option:

c:enableMetadataType="GenericRelations,SocialTags"

… and I'm specifically requesting a simple result format as follows:

c:outputFormat="Text/Simple"

There are other options, including RDF, that can be requested per the options mentioned on this page.

If you look at the code, you can see I'm asking Calais to analyze some text about Tim Tebow since I was in Denver when the Denver Broncos football team acquired Peyton Manning and traded Tebow to the New York Jets. The text is from a USA Today article from, um, yesterday.

The Jets, I'd like to state, are not worthy of a hyperlink. And that's only part of the reason I'm sad to see Tebow go there. Alas.

Anway, here's the output below, followed by the code. Note that – as mentioned in the code – I'm using the slightly older REST API. But what do I care right now. I'm just testing.

Here's the output:

<!--Use of the Calais Web Service is governed by the Terms of Service located at http://www.opencalais.com. By using this service or the results of the service you agree to these terms of service.-->
<!--
Company: HBO,
Organization: New York Jets,
Person: Tim Tebow,
TVShow: Hard Knocks,
-->
<OpenCalaisSimple>
  <Description>
    <calaisRequestID>dafa6c80-b4f6-77b1-1363-de96bb7764f4</calaisRequestID>
    <id>http://id.opencalais.com/ODNr1ciDte8wwv0nU3G1jw</id>
    <about>http://d.opencalais.com/dochash-1/895ba8ff-4c32-3ae1-9615-9a9a9a1bcb39</about>
    <docTitle/>
    <docDate>2012-03-23 00:56:09.679</docDate>
    <externalMetadata/>
  </Description>
  <CalaisSimpleOutputFormat>
    <Company count="1" relevance="0.643" normalized="HBO &amp; Company">HBO</Company>
    <Organization count="1" relevance="0.643">New York Jets</Organization>
    <Person count="1" relevance="0.643">Tim Tebow</Person>
    <TVShow count="1" relevance="0.643">Hard Knocks</TVShow>
    <SocialTags>
      <SocialTag importance="2">Training camp<originalValue>Training camp (National Football League)</originalValue>
      </SocialTag>
      <SocialTag importance="2">New York Jets<originalValue>New York Jets</originalValue>
      </SocialTag>
      <SocialTag importance="2">Florida Gators football team<originalValue>2008 Florida Gators football team</originalValue>
      </SocialTag>
      <SocialTag importance="1">Tim Tebow<originalValue>Tim Tebow</originalValue>
      </SocialTag>
      <SocialTag importance="1">HBO<originalValue>HBO</originalValue>
      </SocialTag>
      <SocialTag importance="1">Hard Knocks<originalValue>Hard Knocks (TV series)</originalValue>
      </SocialTag>
      <SocialTag importance="1">Entertainment_Culture</SocialTag>
      <SocialTag importance="1">Sports</SocialTag>
    </SocialTags>
    <Topics>
      <Topic Taxonomy="Calais" Score="1.000">Entertainment_Culture</Topic>
      <Topic Taxonomy="Calais" Score="1.000">Sports</Topic>
    </Topics>
  </CalaisSimpleOutputFormat>
</OpenCalaisSimple>

And the code:

# this code is based on: http://www.flagonwiththedragon.com/2011/06/08/dead-simple-python-calls-to-open-calais-api/

import urllib, urllib2

#########################
##### set API key and REST URL values.

myCalaisAPI_key = '' # your Calais API key.
calaisREST_URL = 'http://api.opencalais.com/enlighten/rest/' # this is the older REST interface.
# info on the newer one: http://www.opencalais.com/documentation/calais-web-service-api/api-invocation/rest

# alert user and shut down if the API key variable is still null.
if myCalaisAPI_key == '':
  print "You need to set your Calais API key in the 'myCalaisAPI_key' variable."
  import sys
  sys.exit()

#########################
##### set the text to ask Calais to analyze.

# text from: http://www.usatoday.com/sports/football/nfl/story/2012-03-22/Tim-Tebow-Jets-hoping-to-avoid-controversy/53717542/1
sampleText = '''
Like millions of football fans, Tim Tebow caught a few training camp glimpses of the New York Jets during the summer of 2010 on HBO's Hard Knocks.
'''

#########################
##### set XML parameters for Calais.

# see "Input Parameters" at: http://www.opencalais.com/documentation/calais-web-service-api/forming-api-calls/input-parameters
calaisParams = '''
<c:params xmlns:c="http://s.opencalais.com/1/pred/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <c:processingDirectives c:contentType="text/txt"
      c:enableMetadataType="GenericRelations,SocialTags"
      c:outputFormat="Text/Simple"/>
  <c:userDirectives/>
  <c:externalMetadata/>
</c:params>
'''

#########################
##### send data to Calais API.

# see: http://www.opencalais.com/APICalls
dataToSend = urllib.urlencode({
    'licenseID': myCalaisAPI_key,
    'content': sampleText,
    'paramsXML': calaisParams
})

#########################
##### get API results and print them.

results = urllib2.urlopen(calaisREST_URL, dataToSend).read()
print results
--------------

Related Content:

Written by nitin

March 23rd, 2012 at 1:28 pm

make you some facets, boy!

leave a comment

As I mentioned the other day in this post, I've been working with some awesome people to harvest, index, and make searchable metadata for digital library collections from multiple institutions across the state of North Carolina, USA.

In the post I just linked to, I talked about the problems of inconsistent metadata across institutions and how that negatively impacts browsing via facets with Solr. I also wondered out loud about resolving/aligning small discrepancies via text analysis.

Well, another way to tackle this problem is – after harvesting the metadata but before indexing it – to "make" facet-able terms via some sort of term extraction. While at DrupalCon 2012 in Denver, CO this week I went to a presentation where the presenter mentioned a project he'd worked on pulling in RSS feeds. In passing, he mentioned using OpenCalais to make a tag cloud. I totally forgot I had an API key for OpenCalais!

Anyway, now I see there are lots of similar web services. Which one is best in terms of term extraction and which one allows the most API hits per day is a matter for another day, but today – in my hotel now that the conference has ended – I thought I'd do a little scripting to get me on the path to really testing this.

Using the soon-to-be deprecated Yahoo Term Extraction Web Service I tested taking a sample Solr-compatible XML index file and sending the metadata in it to the service to retrieve new subject terms. While my test script doesn't do it here, the idea is that after retrieving from the API these new terms, the terms could be placed into the Solr-compatible index file. After indexing the updated file, these new terms could be exposed to the user as click-able facets.

I'll have to test this with lots of real-world metadata from across our test-set of metadata to see if the term extraction service can be used to produce nicer facets with disparate metadata than what we currently see, but for now I just needed to write a play/test script.

Below, I've pasted the Python script and the the output which explains a little what it's doing.

Actually, I've pasted the output first since people might not need or want to see the code. At the end, I've posted the "social tags" that OpenCalais would seem to generate for the same metadata – for comparison purposes.

The output:

Here's an XML file that can indexed by Solr (it was generated via harvesting data from the Library of Congress using Python and XSL).

<add>
  <doc>
    <field name="identifier">http://hdl.loc.gov/loc.mbrsmi/amrlv.4007</field>
    <field name="title">[Theater commercial--electric refrigerators]. Buy an electric refrigerator /</field>
    <field name="creator">AFI/Kalinowski (Eugene) Collection (Library of Congress)</field>
    <field name="subject">Refrigerators.</field>
    <field name="subject">Advertising--Electric household appliances--Pennsylvania--Pittsburgh.</field>
    <field name="subject">Trade shows--Pennsylvania--Pittsburgh.</field>
    <field name="subject">Silent films.</field>
    <field name="subject">Pittsburgh (Pa.)--Manufactures.</field>
    <field name="description">Largely graphic commercial for electric refrigerators in general and a refrigerator show, presumably in Pittsburgh, in particular.</field>
  </doc>
 </add>

-----

After using the Yahoo term extraction service we can create more <field> elements.

<field name="yahooTerm">electric household appliances</field>
<field name="yahooTerm">electric refrigerators</field>
<field name="yahooTerm">electric refrigerator</field>
<field name="yahooTerm">library of congress</field>
<field name="yahooTerm">silent films</field>
<field name="yahooTerm">collection library</field>
<field name="yahooTerm">pittsburgh pa</field>
<field name="yahooTerm">pennsylvania</field>

-----

If we place those new terms into the original XML file and reindex the item, we'll have new facets to play with.

This is a *potential* solution for creating practical, useable, and consistent(?) facets for metadata harvested from different institutions that use different subject terms and internal taxonomies, etc.

I think the basic Yahoo term extractor is deprecated(?), but there are other options such as their newer Context Analysis API, OpenCalais, and AlchemyAPI.com, etc.

The script:

#####
## merge all <fields> into one string; place in "context" variable.
SolrXML = '''
<add>
  <doc>
    <field name="identifier">http://hdl.loc.gov/loc.mbrsmi/amrlv.4007</field>
    <field name="title">[Theater commercial--electric refrigerators]. Buy an electric refrigerator /</field>
    <field name="creator">AFI/Kalinowski (Eugene) Collection (Library of Congress)</field>
    <field name="subject">Refrigerators.</field>
    <field name="subject">Advertising--Electric household appliances--Pennsylvania--Pittsburgh.</field>
    <field name="subject">Trade shows--Pennsylvania--Pittsburgh.</field>
    <field name="subject">Silent films.</field>
    <field name="subject">Pittsburgh (Pa.)--Manufactures.</field>
    <field name="description">Largely graphic commercial for electric refrigerators in general and a refrigerator show, presumably in Pittsburgh, in particular.</field>
  </doc>
 </add>
'''

from lxml import etree # see: http://lxml.de/ for this library.

SolrXML_parsed = etree.XML(SolrXML)
SolrXML_combined = SolrXML_parsed.findall(".//field")
SolrXML_combined.pop(0) #remove <field name="indentifier"> since we don't want
                        #a term generated from the URL; ideally this should be
                        #removed by having an attribute of "identifier" rather
                        #than by position, but this is just a test.

SolrXML_combinedList = []
for element in SolrXML_combined:
  SolrXML_combinedList.append(element.text)
context = (" ".join(SolrXML_combinedList))
#print context #test line

#####
## send XML example to Yahoo termExtraction service; print generated terms
## reference example: http://developer.yahoo.com/python/python-rest.html#post
import urllib, urllib2

url = 'http://search.yahooapis.com/ContentAnalysisService/V1/termExtraction'
appid = 'YahooTermTest'

params = urllib.urlencode({
  'appid': appid,
  'context': context,
})

yahooResultsXML = urllib2.urlopen(url, params).read()
#print yahooResultsXML #test line

yahooResultsXML_parsed = etree.XML(yahooResultsXML)
newSolrTerms = ""
for yahooTerm in yahooResultsXML_parsed:
  newSolrTerms = newSolrTerms + "<field name=\"yahooTerm\">" + yahooTerm.text \
  + "</field>\n"

#####
## print what the script is trying to do and the results ...
print "Here's an XML file that can indexed by Solr\
 (it was generated via harvesting data from the Library of Congress and XSL)."

print SolrXML

print "-"*5 + "\n"

print "After using the Yahoo term extraction service we can create more\
 <field> elements.\n"

print newSolrTerms

print "-"*5 + "\n"

print "If we place those new terms into the original XML file and reindex the\
 item, we'll have new facets to play with.\n"

print "This is a *potential* solution for creating practical, useable, and\
 consistent(?) facets for metadata harvested from different institutions that use\
 different subject terms and internal taxonomies, etc.\n"

print "I think the basic Yahoo term extractor is deprecated(?), but there are\
 other options such as their newer Context Analysis API, OpenCalais, and\
 AlchemyAPI.com, etc."

And here's what OpenCalais extracted as "social tags":

  • Business Finance
  • Entertainment Culture
  • Food storage
  • Food preservation
  • Home appliances
  • Pittsburgh
  • Refrigerator
--------------

Related Content:

Written by nitin

March 22nd, 2012 at 7:58 pm

facet mashing, a tragedy in 0.987 acts

leave a comment

Update, March 21, 2012: I'm at DrupalCon 2012 and after going to a session on node.js – which I've had in the back of my head as a potential replacement for Python for some metadata harvesting software I'm working on – I was reminded of OpenCalais which I haven't looked at in forever, probably because I wouldn't have understood it before. Anyway, maybe that's a solution to the issues I'm describing below in terms of generating some sort of browse-able facets. This is definitely something to look into.

Home sick again, so that means another meaningless contribution to the "blogosphere" …

So, I've been working with some folks on a project to make a single site search for digital collections across the state I work in.

We're using Solr for the index and OAI feeds for now even though the metadata harvesting software is agnostic of OAI and can support other feed types, etc. But that's not the point here …

The point is that metadata coming in from different places makes for a mess if you want to expose facets … and we might veer to not showing them because noone wants to get into the murky waters of trying to control for that across multiple places.

I think subject facets are still useful though because I like to "play around", to stumble in the dark, and just have fun.

But, of course, there's still the fact-of-the-matter that across multiple institutions you might see subjects from one place written as "Asheville, NC" and another as "Asheville, (N.C.)".

Well, that stinks. There are essentially the same thing, but would get exposed as two separate facets.

So, in the spirit of stumbling in the dark, last Saturday morning I worked on a preliminary little function in Python to try and merge strings like the Asheville example above.

The idea is that the function should present to the user the version that has more "votes", i.e. the one that has more matches in the current search results. So, if "Asheville, NC" appeared 10 times and "Asheville, (N.C.)" appeared 15 times in the user's search results, the function would display "Asheville, (N.C.)" to the user and say it has 25 matches. When the user clicks "Asheville, (N.C.)" a search would be launched for either "Asheville, (N.C.)" or "Asheville, NC". Essentially, the idea is to beautify the facets at the last possible moment (i.e. through a function in the user interface) so the user doesn't have to see the ugly reality of metadata from all over the place; it's also about rectifying things based on text similarity not on semantic similarity – which is another ballgame altogether.

The function uses some known string similarity methods. It's promising but there's still lots of work to do if I really decide to pursue this. And by "lots of work" I really mean seeing if someone with the proper computer science and linguistic background has already written a library for this kind of thing. And (adding this the day after I originally wrote this), I also need to play with s-match.

Anyway, the test code is below and the results are below that but I need to stop writing because I'm dropping out and need to take a nap.

:/

#####
def facetMasher(x,y):
  info = "Comparing \"%s\" with %s facets, against \"%s\" with %s facets." %(x[0],x[1],y[0],y[1])
  print info

  output = ""

  import Levenshtein #Windows32/Python 2.7 installer: http://sourceforge.net/projects/translate/files/python-Levenshtein/
  lev = Levenshtein.jaro
  myJaro = lev(x[0], y[0])

  lev2 = Levenshtein.distance
  myDist = lev2(x[0], y[0])

  print "Jaro-Winkler score: ", myJaro
  print "Levenshtein distance: ", myDist
  if myJaro > .95 or (myJaro > .75 and myDist < 10):
      if myDist > 1:
          totalFacets = x[1] + y[1]
          if (x[1] >= y[1]):
              mergedString = x[0]
          else:
              mergedString = y[0]
          output =  "Merging to \"%s\" with %s facets." %(mergedString, totalFacets)
  if output == "":
    output = "Keeping \"%s\" with %s facets, and \"%s\" with %s facets." %(x[0],x[1],y[0],y[1])

  print output
  print ("--\n")

##### tests ...
facetMasher (("Bibles",3),("bible",2)) #interesting ...
facetMasher (("Fibles",3),("fible",2))

facetMasher (("World War 1",3),("World War 2",2))

facetMasher (("Images",4),("image",3))
facetMasher (("Images",2),("movies",3))

facetMasher (("Asheville, NC",3),("Asheville (N.C.)",2))
facetMasher (("Asheville, (NC)",3),("Asheville (N.C.)",2))
facetMasher (("Granville County (N.C.)",120),("Granville County, N.C.",2))

facetMasher (("foo & bar",3),("foo and bar",2))

facetMasher (("United States--History--Civil War, 1861-1865",3),("United States--History--Civil War, 1861-1865--Correspondence",2))

facetMasher (("United States--History--World War II",3),("United States--History--World War I",2))
facetMasher (("United States--History--World War Two",3),("United States--History--World War 2",2))
facetMasher (("United States--History--World War Two",3),("United States--History--World War 1",2))
facetMasher (("United States--History--World War 1",3),("United States--History--World War 2",2))

And here are the results, below. It's interesting how "Bibles" vs. "bible" doesn't merge, yet "Fibles" and "fible" do. Also, there are some undesired results such as merging "United States–History–World War Two" with "United States–History–World War 1" because the algorithm still sucks.

Comparing "Bibles" with 3 facets, against "bible" with 2 facets.
Jaro-Winkler score:  0.738888888889
Levenshtein distance:  2
Keeping "Bibles" with 3 facets, and "bible" with 2 facets.
--

Comparing "Fibles" with 3 facets, against "fible" with 2 facets.
Jaro-Winkler score:  0.822222222222
Levenshtein distance:  2
Merging to "Fibles" with 5 facets.
--

Comparing "World War 1" with 3 facets, against "World War 2" with 2 facets.
Jaro-Winkler score:  0.939393939394
Levenshtein distance:  1
Keeping "World War 1" with 3 facets, and "World War 2" with 2 facets.
--

Comparing "Images" with 4 facets, against "image" with 3 facets.
Jaro-Winkler score:  0.822222222222
Levenshtein distance:  2
Merging to "Images" with 7 facets.
--

Comparing "Images" with 2 facets, against "movies" with 3 facets.
Jaro-Winkler score:  0.666666666667
Levenshtein distance:  4
Keeping "Images" with 2 facets, and "movies" with 3 facets.
--

Comparing "Asheville, NC" with 3 facets, against "Asheville (N.C.)" with 2 facets.
Jaro-Winkler score:  0.891025641026
Levenshtein distance:  5
Merging to "Asheville, NC" with 5 facets.
--

Comparing "Asheville, (NC)" with 3 facets, against "Asheville (N.C.)" with 2 facets.
Jaro-Winkler score:  0.936111111111
Levenshtein distance:  3
Merging to "Asheville, (NC)" with 5 facets.
--

Comparing "Granville County (N.C.)" with 120 facets, against "Granville County, N.C." with 2 facets.
Jaro-Winkler score:  0.955862977602
Levenshtein distance:  3
Merging to "Granville County (N.C.)" with 122 facets.
--

Comparing "foo & bar" with 3 facets, against "foo and bar" with 2 facets.
Jaro-Winkler score:  0.809553872054
Levenshtein distance:  3
Merging to "foo & bar" with 5 facets.
--

Comparing "United States--History--Civil War, 1861-1865" with 3 facets, against "United States--History--Civil War, 1861-1865--Correspondence" with 2 facets.
Jaro-Winkler score:  0.911111111111
Levenshtein distance:  16
Keeping "United States--History--Civil War, 1861-1865" with 3 facets, and "United States--History--Civil War, 1861-1865--Correspondence" with 2 facets.
--

Comparing "United States--History--World War II" with 3 facets, against "United States--History--World War I" with 2 facets.
Jaro-Winkler score:  0.990740740741
Levenshtein distance:  1
Keeping "United States--History--World War II" with 3 facets, and "United States--History--World War I" with 2 facets.
--

Comparing "United States--History--World War Two" with 3 facets, against "United States--History--World War 2" with 2 facets.
Jaro-Winkler score:  0.963449163449
Levenshtein distance:  3
Merging to "United States--History--World War Two" with 5 facets.
--

Comparing "United States--History--World War Two" with 3 facets, against "United States--History--World War 1" with 2 facets.
Jaro-Winkler score:  0.963449163449
Levenshtein distance:  3
Merging to "United States--History--World War Two" with 5 facets.
--

Comparing "United States--History--World War 1" with 3 facets, against "United States--History--World War 2" with 2 facets.
Jaro-Winkler score:  0.980952380952
Levenshtein distance:  1
Keeping "United States--History--World War 1" with 3 facets, and "United States--History--World War 2" with 2 facets.
--
--------------

Related Content:

Written by nitin

March 15th, 2012 at 11:57 am

less is more, a SAVS update

leave a comment

Just a quick post before I find a movie to stream on Netflix and ride out my Sunday …

So, I've been working a tad on SAVS, aka the "Simple Audio/Verse Synchronizer". And the changes really have to do with the data model for the timed text and for the backend/technical requirements. It's now all done with HTML, CSS, and JavaScript – as it should be.

First, the data model for a line of timed text in what I'm calling "st2" or "SAVS timed text" is now like this:

  <span
    class="savs-st2"
    data-startTime="10"
    data-stopTime="13">My mistress' eyes are nothing like the sun
  </span>

Before, it was much clunkier, like this:

  <p onclick="seekTo(10)" id="1">
    <span class="savs-text">My mistress' eyes are nothing like the sun</span>
    <span class="savs-time">10</span>
  </p>

That's to say, now – using the HTML5 "data-" attribute – the demands for the HTML markup are far fewer given that the JavsScript file "savs.js" takes care of more.

Before, with the older mark up model, there was no support for a stop time value and one also had to take the responsibility for adding several attributes related to calling JavaScript functions and for creating "id" attributes for both the corresponding <audio> or <video> element as well as for the timed text, etc.

I actually have thought about doing this as a jQuery plugin, but I'm not sure I see the point. Simply including the "savs.js" file is easier. By editing the "savs.css" file, one can control the look of their page. But I digress …

Now that the data model is different and the JavaScript file does more, one can generate a "SAVS compliant" HTML doc with whatever they want.

See, before I was thinking I'd write a PHP script that would build the page, etc, etc. but then I realized that "No, that's not my job." People should be able to store their timed data however they want, generate their HTML however they want, and only have to use the "savs.js" file and the "st2" data model to get this to work.

Sort of.

One also needs to give their HTML5 <audio> or <video> element an id of "savs-player" and also needs to put a tag somewhere in their HTML doc with an id of "savs-caption" a la:

<span class="savs-caption"></span>

That's where the captions go and it's currently required. If someone doesn't want to display captions, then they can just use CSS to hide that element.

Anyway, I'm not explaining anything well since I'm in a rush to watch a movie and have a soda, so here's the latest demo and below is the original version shown via a screencast.

SAVS: a Simple Audio/Verse Synchronizer from nitin arora on Vimeo.

--------------

Related Content:

Written by nitin

March 11th, 2012 at 8:50 pm

Posted in digital audio,scripts

Tagged with , ,

on adding a JavaScript API to our Flash player at work

leave a comment

Sometimes being home sick means finding things to work on that I wouldn't have done had I been in the office, but need to be done eventually.

So, today I worked on augmenting a JavaScript API for our Flash player at work. Before I go any further, here's a screenshot below. Note that the JavaScript console for Googles' Chrome browser is visible at the bottom.

NC Live Media Player

As you can see, the "movie" is not just the "movie", so to speak. That's to say, there's a very cool bookmarking feature that, if clicked, returns a URL that if visited will start the given video at the point in time at which the user clicked the bookmark. The "movie" also includes some whitespace on the left hand side where links to "part 2"s of certain videos appear if they exist. That's great, but it totally makes our Flash player inappropriate for providing embed codes, etc. since the "movie" is more than the actual video screen and the controls (play, pause, captions, etc.). And, yes, we have to use our own player given that we have to deal with all kinds of authentication and rights issues, which this player support via calls to PHP scripts.

Anyway, a few months ago I'd created a basic JavaScript API so that I could make a demo using our player for SAVS, which is completely reliant on two things: being able to receive the current time of the player and also being able to send a new current time to the player.

Today, I expanded the API a little bit though in the image above you can see a call to the function that moves, in the case above, the player to the 10 second mark. I added support for changing the volume, pausing the video, playing the video if it's paused, turning captions on/off, and getting the total duration of the media file, etc. Basically, I'm trying to add support for anything we'd need to replace the bookmark button and other features with HTML buttons, etc.

There's still lots of work to do, but it's working well provided I embed the SWF file with the <object> tag:

<object
  id="thisMovie"
  data="video2_js.swf"
  style="height: 500px; width: 500px;"
  type="application/x-shockwave-flash">
  <param name="movie" value="video2_js.swf" />
</object>

Then I can use these JavaScript functions on the page that embeds the player …

<script type="text/javascript">
//see: http://kirill-poletaev.blogspot.com/2011/02/exchange-data-between-actionscript-3.html
function getFlashMovie(movieName) {
  var isIE = navigator.appName.indexOf("Microsoft") != -1;
  return (isIE) ? window[movieName] : document[movieName];
}

// ... more functions were here, but there's too much for a blog post. :-]

function ncl_getCurrentTime() {
  var callResult = getFlashMovie("thisMovie").getCurrentTime("");
  return callResult;
}

function ncl_getTotalTime() {
  var callResult = getFlashMovie("thisMovie").getTotalTime("");
  return callResult;
}
</script>

… provided I make sure the ActionScript in the Flash player is prepared for those callbacks …

//see: http://kirill-poletaev.blogspot.com/2011/02/exchange-data-between-actionscript-3.html

//send current time value to JavaScript function
function sendCurrentTimeToJS(name:Number):Number
{
    var now:int = cfp.playheadTime;
    return now;
}
ExternalInterface.addCallback("getCurrentTime", sendCurrentTimeToJS);

//send total time value to JavaScript function
function sendTotalTimeToJS(name:Number):Number
{
    var total:int = cfp.totalTime;
    return total;
}
ExternalInterface.addCallback("getTotalTime", sendTotalTimeToJS);
--------------

Related Content:

Written by nitin

February 23rd, 2012 at 6:14 pm

Posted in scripts

Tagged with , , ,

do you two know each other? Bash meet Python

leave a comment

I'm working on a cool project at work that's about harvesting metadata, indexing it with Solr, and providing a simple UI so that people wanting to search for digital items from North Carolina libraries can have some fun searching from a single interface. It's fun working with other people re: making decisions and all, but also with coding. I'm totally the "backend" guy re: harvesting metadata and indexing and the UI is being handled, very awesomely, by one of the programmers who works at one of the partner institutions. Once the site's up and running on a non-development server (hopefully in just a few weeks), I'll offer up more information and a link or two.

Anyway, once a user makes a selection through the UI and clicks on a link, they go straight to the corresponding page on the originating website. Right now, everything is using an OAI feed for the pilot project, but the Python script that does the harvesting can support lots of other things, like WordPress sites, for example, by harvesting RSS feeds or whatever.

It's nothing new, but what we have works and has a very small footprint in terms of scripts and setup files. The only real requirements are that the data be openly available via HTTP and that there's a programmatic way to construct a new URL to get the next "batch" of metadata.

For instance:

http://blog.humaneguitarist.org/?feed=rss&paged=1

http://blog.humaneguitarist.org/?feed=rss&paged=2

etc.

… oh and that the data be parse-able by XSLT 1.0, but as I mentioned before I'll eventually add support, in an extensible manner, for what I hope is just about any scripting language.

Anyway, I wanted to set up a cron job to run the harvester, so I wrote a Bash script that runs the harvester and the cron job in turn runs the Bash script.

All the partners involved for the pilot agreed that we'd harvest and index every two weeks. Currently, I'm running it nightly, but same difference. The real thing I want to say is that, after harvest, I delete the entire index before re-indexing. This keeps the thing up-to-date and prevents old items from lingering in the index if, in fact, they've been taken down from the originating collection. And, let's face it, that's the reality of it. Things change.

Of course, this entails a huge risk. If something goes wrong with the harvesting script (which is still in it's early stages of development) or with one or more of the feeds, then deleting the index is potentially disastrous. So I discussed this with our main IT/programming guy in the office. And he said, "You gotta make your Python script talk to your Bash script."

What he meant was that while the Python script will push through most issues, foreseen and not, I needed the Python script to report if something went wrong with a feed or whatnot along the way. So, what I did was simply set it to print a "0" if all went well and a "1" if anything I identified as a point of concern occurred: Python script failed, one of the feeds returned a non-200, etc. The Bash script, in turn, reads this output and will only delete the index if a "0" was returned by the Python script, called "pOAIndexter.py".

So, here's the Bash. I think the logic is laid out well enough with the echo statements, so I'll just cough it up, as is, below:

#!/bin/bash

#####
echo "HARVESTING metadata (this may take a long time)."
cd /srv/heritageIndexing/pOAIndexter
output=$(./pOAIndexter.py)
echo ""

echo "Return code:" $output
echo ""

#####
cd /srv/heritageIndexing/apache-solr-3.4.0/example/exampledocs
if [ $output != "0" ]; then
 echo "NOT deleting existing index."
else
 echo "DELETING existing index."
 java -Ddata=args -jar post.jar "<delete><query>*:*</query></delete>"
fi
echo ""

#####
echo "INDEXING harvested metadata."
java -jar post.jar /srv/heritageIndexing/pOAIndexter/output/*.xml
echo ""

#####
echo "DELETING temporary harvested metadata files."
cd /srv/heritageIndexing/pOAIndexter
rm output/*.xml
echo ""

#####
echo "Farewell."
--------------

Related Content:

Written by nitin

February 19th, 2012 at 7:56 am

Switch to our mobile site