blog.humaneguitarist.org

discoveries in digital audio, music notation, and information encoding

Archive for the ‘pOAIndexter’ tag

HammerFlicks and pOAIndexter source codes available

leave a comment

Quickie:

I've made the source code available for HammerFlicks (click on the "Source Code" link) and "pOAIndexter" (scroll down to "pOAIndexter").

The "pOIAndexter" scripts is used to drive the metadata harvesting for NC ECHO.

I seriously doubt anyone else will download/use these, but making them downloadable forces me to do a decent job – I hope – of being organized.

--------------

Related Content:

Written by nitin

March 31st, 2013 at 2:52 pm

Posted in news,scripts

Tagged with , ,

do you two know each other? Bash meet Python

leave a comment

I'm working on a cool project at work that's about harvesting metadata, indexing it with Solr, and providing a simple UI so that people wanting to search for digital items from North Carolina libraries can have some fun searching from a single interface. It's fun working with other people re: making decisions and all, but also with coding. I'm totally the "backend" guy re: harvesting metadata and indexing and the UI is being handled, very awesomely, by one of the programmers who works at one of the partner institutions. Once the site's up and running on a non-development server (hopefully in just a few weeks), I'll offer up more information and a link or two.

Anyway, once a user makes a selection through the UI and clicks on a link, they go straight to the corresponding page on the originating website. Right now, everything is using an OAI feed for the pilot project, but the Python script that does the harvesting can support lots of other things, like WordPress sites, for example, by harvesting RSS feeds or whatever.

It's nothing new, but what we have works and has a very small footprint in terms of scripts and setup files. The only real requirements are that the data be openly available via HTTP and that there's a programmatic way to construct a new URL to get the next "batch" of metadata.

For instance:

http://blog.humaneguitarist.org/?feed=rss&paged=1

http://blog.humaneguitarist.org/?feed=rss&paged=2

etc.

… oh and that the data be parse-able by XSLT 1.0, but as I mentioned before I'll eventually add support, in an extensible manner, for what I hope is just about any scripting language.

Anyway, I wanted to set up a cron job to run the harvester, so I wrote a Bash script that runs the harvester and the cron job in turn runs the Bash script.

All the partners involved for the pilot agreed that we'd harvest and index every two weeks. Currently, I'm running it nightly, but same difference. The real thing I want to say is that, after harvest, I delete the entire index before re-indexing. This keeps the thing up-to-date and prevents old items from lingering in the index if, in fact, they've been taken down from the originating collection. And, let's face it, that's the reality of it. Things change.

Of course, this entails a huge risk. If something goes wrong with the harvesting script (which is still in it's early stages of development) or with one or more of the feeds, then deleting the index is potentially disastrous. So I discussed this with our main IT/programming guy in the office. And he said, "You gotta make your Python script talk to your Bash script."

What he meant was that while the Python script will push through most issues, foreseen and not, I needed the Python script to report if something went wrong with a feed or whatnot along the way. So, what I did was simply set it to print a "0" if all went well and a "1" if anything I identified as a point of concern occurred: Python script failed, one of the feeds returned a non-200, etc. The Bash script, in turn, reads this output and will only delete the index if a "0" was returned by the Python script, called "pOAIndexter.py".

So, here's the Bash. I think the logic is laid out well enough with the echo statements, so I'll just cough it up, as is, below:

#!/bin/bash

#####
echo "HARVESTING metadata (this may take a long time)."
cd /srv/heritageIndexing/pOAIndexter
output=$(./pOAIndexter.py)
echo ""

echo "Return code:" $output
echo ""

#####
cd /srv/heritageIndexing/apache-solr-3.4.0/example/exampledocs
if [ $output != "0" ]; then
 echo "NOT deleting existing index."
else
 echo "DELETING existing index."
 java -Ddata=args -jar post.jar "<delete><query>*:*</query></delete>"
fi
echo ""

#####
echo "INDEXING harvested metadata."
java -jar post.jar /srv/heritageIndexing/pOAIndexter/output/*.xml
echo ""

#####
echo "DELETING temporary harvested metadata files."
cd /srv/heritageIndexing/pOAIndexter
rm output/*.xml
echo ""

#####
echo "Farewell."
--------------

Related Content:

Written by nitin

February 19th, 2012 at 7:56 am

choose your own toppings: whatever code inside CDATA

leave a comment

I really should be packing for an overseas vacation that begins tomorrow, but I wanted to jot some stuff down before I forget – and I intend to forget a lot!

Anywho, in a previous post I wrote about putting XSL inside a CDATA block inside an XML config file. I had the following example:

<map name="LibriVox">
  <XSLT>./XSLT/LibriVox_to_Solr.xsl</XSLT>
  <nextXSL>
  <![CDATA[
  <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text"/>
    <xsl:template match="/">
      <xsl:variable name="baseURL" select="'%s'" />
      <xsl:variable name="URL_params" select="'%s'" />
      <xsl:variable name="offset_" select="substring-after($URL_params,'=')" />
      <xsl:variable name="offset" select="substring-before($offset_,'&amp;')" />
      <xsl:variable name="limit_" select="substring-after($URL_params,'&amp;')" />
      <xsl:variable name="limit" select="substring-after($limit_,'=')" />
      <xsl:variable name="output">
        <xsl:value-of select="$baseURL" />
        <xsl:text>?offset=</xsl:text>
        <xsl:value-of select="$offset+50" />
        <xsl:text>&amp;limit=</xsl:text>
        <xsl:value-of select="50" />
      </xsl:variable>
      <xsl:value-of select="$output" />
    </xsl:template>
  </xsl:stylesheet>
  ]]>
  </nextXSL>
</map>

This is part of this pOAIndexter script I'm working on for, well, work.

The XML code above is from one of the config files where the <XSLT> element points to an XSL file used to process metadata retrieved from a website. In this case, the point is to make a Solr-compatible XML document that can be used for indexing purposes. The second element, <nextXSL>, is used to return to pOAIndexter the URL for the next batch of metadata for a given feed, i.e. the next page or the next set within a collection, etc.

And as you can see there are two weird looking variables at the top:

      <xsl:variable name="baseURL" select="'%s'" />
      <xsl:variable name="URL_params" select="'%s'" />

The reason being that the pOAIndexter script actually populates these with the actual base URL for the batch just retrieved and the parameters, respectively, before the XSL within the <nextXSL> element is run, returning the string of the next URL.

I chose XSL because I think, as a librarian, it seems to be common to a lot of metadata and digital library folk and such people could extend the capabilities of pOAIndexter without having to know Python. But all along I wanted people to be able to process the metadata and return the next URL with whatever scripting language they want, provided the interpreter exists on their system.

So, say for example you like PHP instead. Instead of the using XSL you could use something like this:

<map name="LibriVox">
  <PHP>./PHP/LibriVox_to_Solr.php</PHP>
  <nextPHP>
  <![CDATA[
  <?php
  $baseURL=%s;
  $URL_params=%s;
  
  //some PHP code here

  echo $output; //where $output is the next URL ...
  ?>
  ]]>
  </nextPHP>
</map>

That way PHP could be used to make the Solr-XML file and to return to pOAIndexter the next URL string so that the next batch of metadata from a feed could be processed/transformed. Of course, you could mix and match, too – XSLT for making the Solr-XML file and PHP just for getting the next URL.

That's actually pretty simply to do with the common scripting languages like Python, PHP, Perl, Ruby, etc. But what I really wanted to support was JavaScript because, well, it would just be cool, but also because that's another one of those common languages that a lot of people might know even though there might be great variation amongst the other scripting languages they know when compared to a lot of their colleagues.

But I didn't know how to execute Javascript via the command line so that pOAIndexter can capture the next URL via the standard output stream.

Well, enter PhantomJS.

That is all. Time to pack.

--------------

Related Content:

Written by nitin

December 21st, 2011 at 11:03 pm

bidi bidi bidi and more on pOAIndexter-ing metadata

leave a comment

It's shaping up to be a sunny day and this means I need to go on a long walk.

But before I do that, I'll follow up to this post about grabbing OAI metadata from an online source and throwing the metadata into Solr for searching purposes, etc.

Last night – while watching streaming the Gil Gerard iteration of Buck Rogers – I wrote a small PHP script to grab this OAI metadata from the Library of Congress' site. BTW: this is a cool page of theirs that helps one get started with OAI feeds, etc.

Aside: Is it only since the advent of hypertext that the word "this" began appearing in a referential context within documents?

As I mentioned in the previous post, an XML config file will instruct the code where to get the metadata and which XSL file will be used to transform the data into something Solr can chew on. I haven't bothered with the config file yet, so for now I just tested it on the specific metadata linked to above since the config file aspect of this is the most trivial component of the whole thing.

Anyway, below is the PHP file, the OAI to Solr XSL file, and a snippet of the output. Last is a Python script that does the same thing as the PHP. It's not OO like the PHP file, but I just whipped it up this morning for shiggles.

Here's the PHP …

<?php

function grabMetadata($urlArg) {
    $ch = curl_init(); // see: http://php.net/manual/en/book.curl.php
    curl_setopt($ch, CURLOPT_URL, $urlArg);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $curlOut = curl_exec($ch);
    return $curlOut;
    curl_close($ch);
}

// See "http://www.php.net/manual/en/xsltprocessor.transformtoxml.php" for instructions re: XSL processing as below.
function useXSL($output) {
    $search_results = new DOMDocument;
    $search_results->loadXML($output);
    // If you just use "load" instead of "loadXML" it won't work unless you first stored the XML results in a file (boo!).
    // For info on "loadXML" see: http://www.php.net/manual/en/domdocument.loadxml.php
    $proc = new XSLTProcessor;
    $xsl = new DOMDocument;
    $xsl->load('OAI_to_solr.xsl');
    $proc->importStyleSheet($xsl);
    $processed = $proc->transformToXML($search_results);
    return $processed;
}

function writeSOLR($solrXML) {
    $myFile = "for_solr-PHP.xml";
    $fh = fopen($myFile, 'w') or die("can't open file");
    fwrite($fh, utf8_encode($solrXML)); // For UTF-8, see: http://www.php.net/manual/en/function.fwrite.php#73764
    fclose($fh);
}

// Do stuff ...
$output = grabMetadata('http://memory.loc.gov/cgi-bin/oai2_0?verb=ListRecords&metadataPrefix=oai_dc&set=papr');
writeSOLR(useXSL($output));
?>
The XSL file …
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
exclude-result-prefixes="oai_dc dc">
  <xsl:output method="xml" indent="yes" encoding="UTF-8"/>
  <xsl:template match="/">
    <add>
      <xsl:for-each select="//oai_dc:dc">
        <doc>
          <field name="identifier">
            <xsl:value-of select="dc:identifier" />
          </field>
          <field name="title">
            <xsl:value-of select="dc:title" />
          </field>
          <field name="creator">
            <xsl:value-of select="dc:creator" />
          </field>
          <xsl:for-each select="dc:subject">
            <field name="subject">
              <xsl:value-of select="." />
            </field>
          </xsl:for-each>
          <field name="description">
            <xsl:value-of select="dc:description" />
          </field>
        </doc>
      </xsl:for-each>
    </add>
  </xsl:template>
</xsl:stylesheet>
The Millionare and his wife … er, wrong show. I mean the sample Solr XML snippet …
<add>
  <doc>
    <field name="identifier">http://hdl.loc.gov/loc.mbrsmi/amrlv.4007</field>
    <field name="title">[Theater commercial--electric refrigerators]. Buy an electric refrigerator /</field>
    <field name="creator">AFI/Kalinowski (Eugene) Collection (Library of Congress)</field>
    <field name="subject">Refrigerators.</field>
    <field name="subject">Advertising--Electric household appliances--Pennsylvania--Pittsburgh.</field>
    <field name="subject">Trade shows--Pennsylvania--Pittsburgh.</field>
    <field name="subject">Silent films.</field>
    <field name="subject">Pittsburgh (Pa.)--Manufactures.</field>
    <field name="description">Largely graphic commercial for electric refrigerators in general and a refrigerator show, presumably in Pittsburgh, in particular.</field>
  </doc>

...

</add>
Some Python for fun …
import codecs
import urllib
from lxml import etree, _elementpath # see: http://lxml.de/
from lxml.etree import XSLT,fromstring

## some OAI metadata from the Library of Congress
url = 'http://memory.loc.gov/cgi-bin/oai2_0?verb=ListRecords&metadataPrefix=oai_dc&set=papr'
metadata = urllib.urlopen(url).read()
metadata = etree.XML(metadata)

## the XSL file that will transform the OAI metadata to Solr
xsl = open('OAI_to_solr.xsl', 'r')
xsl = xsl.read()
xsl = etree.XML(xsl)

## XSL transformation
style = XSLT(xsl)
result = style.apply(metadata)

## the outputted Solr XML
fw = codecs.open('for_solr-PY.xml', 'w', 'utf-8-sig')
utf8_result = unicode(str(result), encoding='utf8')
fw.write(utf8_result)
fw.close()

And most importantly, the introduction to Buck Rogers in the 25th Century – Season 1, of course! I couldn't even make it through the first ten minutes of the Season 2 opener. I mean they changed the introduction which was brilliant and brilliantly narrated – as you shall see!

I'd prefer to watch the South Park spoof over the Season 2 insult-to-perfection any day of the week.

And here's a bad-ass fan trailer that I think respects the greatness of the first season.

--------------

Related Content:

Written by nitin

October 15th, 2011 at 9:05 am

Posted in scripts

Tagged with , , ,

pOAIndexter: grabbing and indexing online metadata

leave a comment

As per usual, a good bit of my computer-y stuff at home relates to something that's come up at work. And as usual, I'm pretty ignorant of what I'm getting myself into, but I don't mind.

The other week, my boss and I met with some great people at digitalnc.org and we started talking about the idea of having a super simple, lightweight approach to providing a one-stop-shop search interface for collections across the state – provided those collections expose their metadata somehow. For now, we talked about limiting this to people who do so with an OAI feed and grabbing that metadata. But eventually, this thing should be metadata agnostic – in the sense that it isn't about a metadata format, but just the data itself.

By the way, I guess "grabbing" and "feed" aren't what I typically see with OAI – about which I admittedly don't know much – but I don't care. Same difference.

Of course, there's nothing new to this. I guess one could use Blacklight or VuFind to do this kind of thing, but I'm not sure, though even those are existing open souce projects, that doing so isn't overkill and won't in turn increase dependencies and maintenance overhead.

Actually, that's a topic for another time – I mean the idea that just because part of something is capable of doing what you want doesn't necessarily make it a better option than rolling one's own if using and updating said something entails more cost in the long run. Paved roads often get you there faster, but a willingness to get lost now and then is how you learn where all the really cool local bars are …

;)

Anyway, here's what I'm thinking. A small script would simply look at an XML setup file from which it would know which places to go grab metadata from, the type of feed, the last time the metadata was requested, and stuff like the resumptionToken if applicable. It would also store the appropriate XSL file to process the metadata with so that the metadata could be passed into Solr to be indexed and searchable. Anyone who's site doesn't provide metadata as XML could simply create a web service that does so, e.g. a RESTful MySQL to XML thingamajig. The outputted XML just needs to have an XSL that will facilitate passing it to Solr for that data to be part of the shared metadata store. And since XSL is the universal translator in this context, other metadata types such as RSS/ATOM feeds could be grabbed, too. All one needs to do is add to the XML config file so the script knows to retrieve metadata from that site and make sure there's an XSL file that can be used to facilitate passing the data into Solr. So in the end all this should take in terms of coding is a small script, one XML config file, and as many XSL files as needed.

For fun and to start learning about Solr, I just manually grabbed some OAI metadata from CalTech yesterday – it was for some oral histories. And then I ran them through an XSL file and then posted them to Solr. Within no time I had a searchable, local metadata store to play around with (screenshot below). Since I was using all the defaults from the Solr tutorial I had to map the <dc:creator> field to things like manufacturer, since the default is set up for an electronics store.

Solr screenshot

BTW if we use this, at some point I won't be able to call it "pOAIndexter" but for now I can.

Since I don't know if I'll do this in Python or PHP and since OAI is what we'll work on first, I guess it stands for "Python or PHP OAI Indexer".

Yes, I'm a dork.

--------------

Related Content:

Written by nitin

October 2nd, 2011 at 11:20 am

Switch to our mobile site