I'm working on a cool project at work that's about harvesting metadata, indexing it with Solr, and providing a simple UI so that people wanting to search for digital items from North Carolina libraries can have some fun searching from a single interface. It's fun working with other people re: making decisions and all, but also with coding. I'm totally the "backend" guy re: harvesting metadata and indexing and the UI is being handled, very awesomely, by one of the programmers who works at one of the partner institutions. Once the site's up and running on a non-development server (hopefully in just a few weeks), I'll offer up more information and a link or two.
Anyway, once a user makes a selection through the UI and clicks on a link, they go straight to the corresponding page on the originating website. Right now, everything is using an OAI feed for the pilot project, but the Python script that does the harvesting can support lots of other things, like WordPress sites, for example, by harvesting RSS feeds or whatever.
It's nothing new, but what we have works and has a very small footprint in terms of scripts and setup files. The only real requirements are that the data be openly available via HTTP and that there's a programmatic way to construct a new URL to get the next "batch" of metadata.
… oh and that the data be parse-able by XSLT 1.0, but as I mentioned before I'll eventually add support, in an extensible manner, for what I hope is just about any scripting language.
Anyway, I wanted to set up a cron job to run the harvester, so I wrote a Bash script that runs the harvester and the cron job in turn runs the Bash script.
All the partners involved for the pilot agreed that we'd harvest and index every two weeks. Currently, I'm running it nightly, but same difference. The real thing I want to say is that, after harvest, I delete the entire index before re-indexing. This keeps the thing up-to-date and prevents old items from lingering in the index if, in fact, they've been taken down from the originating collection. And, let's face it, that's the reality of it. Things change.
Of course, this entails a huge risk. If something goes wrong with the harvesting script (which is still in it's early stages of development) or with one or more of the feeds, then deleting the index is potentially disastrous. So I discussed this with our main IT/programming guy in the office. And he said, "You gotta make your Python script talk to your Bash script."
What he meant was that while the Python script will push through most issues, foreseen and not, I needed the Python script to report if something went wrong with a feed or whatnot along the way. So, what I did was simply set it to print a "0" if all went well and a "1" if anything I identified as a point of concern occurred: Python script failed, one of the feeds returned a non-200, etc. The Bash script, in turn, reads this output and will only delete the index if a "0" was returned by the Python script, called "pOAIndexter.py".
So, here's the Bash. I think the logic is laid out well enough with the
echo statements, so I'll just cough it up, as is, below:
echo "HARVESTING metadata (this may take a long time)."
echo "Return code:" $output
if [ $output != "0" ]; then
echo "NOT deleting existing index."
echo "DELETING existing index."
java -Ddata=args -jar post.jar "<delete><query>*:*</query></delete>"
echo "INDEXING harvested metadata."
java -jar post.jar /srv/heritageIndexing/pOAIndexter/output/*.xml
echo "DELETING temporary harvested metadata files."