blog.humaneguitarist.org

discoveries in digital audio, music notation, and information encoding

Archive for the ‘metadata harvesting’ tag

awesome sauce: augmenting PubMed Central’s OAI response

leave a comment

Update, 9 pm EST, May 27, 2012: Well, this is interesting. After reading this page, I see that by setting the "metadataPrefix" to "pmc_fm" I can bypass steps #3 and #4 altogether it seems – provided one's OAI harvester/indexer is set to ingest the data in that format instead of Dublin Core or provided the script below transforms the data to Dublin Core before returning it. Anyway … score one for documentation and reading it after-the-fact!

I saw a post from a Metadata Librarian on the code4lib list about their work with placing article data from PubMed into DSpace. They are doing some metadata additions and cleanup in Excel so I emailed them off-list and let them know about PubMed2XL and we went back and forth on a few things. Among the things I learned from them was that PubMed Central has an OAI feed. Cool!

But that OAI feed doesn't return all the data they need.

Here's an example: http://www.pubmedcentral.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=oai_dc&set=aac.

One of the additional bits of data they wanted was author affiliation which is available from PubMed.gov's XML output. Same for the MESH terms.

Anyway, besides pushing PubMed2XL, I also mentioned that it would be interesting to make a sauce, if you will, for PubMed Central's OAI feed. In other words, rather than using the OAI link above, one would use a service on top of that a la: http://myPubMedCentralOAI_sauce.com/oai?verb=ListRecords&metadataPrefix=oai_dc&set=aac. And when one went to that URL, the service would fetch the real OAI feed from PubMed Central and then get the additional metadata from the NCBI EFetch APIs. It would then drop the additional metadata into the original OAI response and finally serve it up to the user (e.g. the OAI harvester).

I went ahead and played with a proof-of-concept using Google App Engine and it's working although it's adding about 20 – 25 seconds to the OAI response time. BTW: it's faster when I run it from localhost and not actually live on App Engine.

Here's how it's done.

  1. The user goes to http://localhost:8084/oai?verb=ListRecords&metadataPrefix=oai_dc&set=aac.
  2. The app then fetches http://www.pubmedcentral.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=oai_dc&set=aac.
  3. For each record, the app parses out the PubMed Central ID and uses the EFetch API with PubMed Central as the database to get more data about the item.
  4. Unfortunately, the API for PubMed Central doesn't return MESH terms, so in step #3 the app just uses the returned data to translate the PubMed Central ID to the regular PubMed ID.
  5. With the PubMed ID now in hand, the app goes to the EFetch API and specifies PubMed as the database and hands the API the PubMed ID from step #4.
  6. Now the app gets the <Affiliation> value and the MESH terms and adds them to the real OAI response from step #2.
  7. Finally (whew!), the app returns the OAI feed with more metadata than before.

This seems super klunky, so I'd love to hear about more elegant ways to do this … like having more options from PubMed Central without 3rd party hacks!

But it is working. And it's just a proof-of-concept …

Below, I've pasted a snippet of the augmented OAI data.

Below that is the Python code if anyone's interested.

ps: Python users will notice I used Google App Engine's "urlfetch" instead of "urllib" to request URLs. This is because using the latter was causing 500 errors due to timeouts. I don't think, from what I've read, that you can set the timeout with "urllib" in App Engine, so I used "urlfetch" which lets one set it up to 60 seconds.

<!--
  This is just a test to use the NCBI EFetch APIs to augment the ouput of PubMed Central's OAI feed.
  In short, it's a web servive that sits on top of the PubMed Central OAI API.

  *** DO NOT use this service to harvest OAI records from PubMed Central ... you will probably mess up your repository!
  ... and I haven't verified that the additional data being added to the OAI feed is accurate per the item.

  Currently, this supports the following OAI parameters:
 
   - ListRecords
   - set
   - metadataPrefix (must use "oai_dc"/Dublin Core)
   - resumptionToken
 
  Thanks, Nitin Arora (humaneguitarist.org), May 2012.
 
  ps: adding metadata increased the OAI response time by 22.6178297997 seconds.
  -->
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
 <responseDate>2012-05-27T13:34:17Z</responseDate>
 <request verb="ListRecords" metadataPrefix="oai_dc" set="aac">http://www.pubmedcentral.nih.gov/oai/oai.cgi</request>
 <ListRecords>
  <record>
   <header>
    <identifier>oai:pubmedcentral.nih.gov:89011</identifier>
    <datestamp>2002-09-12</datestamp>
    <setSpec>aac</setSpec>
   </header>
   <metadata>
    <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
     <dc:title>Antifungal Peptides: Novel Therapeutic Compounds against Emerging Pathogens</dc:title>
     <dc:creator>De Lucca, Anthony J.</dc:creator>
     <dc:creator>Walsh, Thomas J.</dc:creator>
     <dc:subject>Minireviews</dc:subject>
     <dc:description/>
     <dc:publisher>American Society for Microbiology</dc:publisher>
     <dc:identifier>http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=89011</dc:identifier>
     <dc:type>Text</dc:type>
     <dc:language>en</dc:language>
     <dc:rights/>
     <dc:contributor.affiliation>Southern Regional Research Center, Agricultural Research Service, U. S. Department of Agriculture, New Orleans, Louisiana 70124, USA. adelucca@nola.srrc.usda.gov</dc:contributor.affiliation>
     <dc:subject.mesh>Animals</dc:subject.mesh>
     <dc:subject.mesh>Anti-Bacterial Agents</dc:subject.mesh>
     <dc:subject.mesh>Antifungal Agents</dc:subject.mesh>
     <dc:subject.mesh>Fungi</dc:subject.mesh>
     <dc:subject.mesh>Humans</dc:subject.mesh>
     <dc:subject.mesh>Mycoses</dc:subject.mesh>
     <dc:subject.mesh>Peptides</dc:subject.mesh>
    </oai_dc:dc>
   </metadata>
  </record>
  <resumptionToken>oai%3Apubmedcentral.nih.gov%3A89061!!!oai_dc!aac</resumptionToken>
 </ListRecords>
</OAI-PMH>

Python:

### pmc-oai-topper.py
### 2012, Nitin Arora

### import modules
##import urllib #DELETE
from google.appengine.api import urlfetch #see: https://developers.google.com/appengine/docs/python/urlfetch/overview
from lxml import etree
import time
import webapp2

### set what additional metadata to get from the EFetch API
additions = [('contributor.affiliation', 'Affiliation'),
             ('subject.mesh', 'DescriptorName')] #(name of element to output to, XPath); eventually needs to be in external config file
            #note: the XPath has to refer to elements in the EFetch XML output for the PubMed database as in "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=12654674&retmode=xml"

#####
class pmctopper(webapp2.RequestHandler):
  def get(self):

    #GET OAI parameter values
    verb_value = self.request.get('verb')
    metadataPrefix_value = self.request.get('metadataPrefix')
    set_value = self.request.get('set')
    resumptionToken_value = self.request.get('resumptionToken')

    #define the *real* OAI feed URL and read it
    if resumptionToken_value: #if a resumptionToken is being used
      url = 'http://www.pubmedcentral.gov/oai/oai.cgi?verb=%s&resumptionToken=%s' %(verb_value, resumptionToken_value)
    elif set_value:
      url = 'http://www.pubmedcentral.gov/oai/oai.cgi?verb=%s&set=%s&metadataPrefix=%s' %(verb_value, set_value, metadataPrefix_value)
    else:
      url = 'http://www.pubmedcentral.gov/oai/oai.cgi?verb=%s&metadataPrefix=%s' %(verb_value, metadataPrefix_value)

##    oai_in = urllib.urlopen(url).read() #DELETE
    oai_in = urlfetch.fetch(url=url, deadline=60).content
    time_in = time.time() #tracking how long this takes

    #parse OAI response as XML
    oai_parsed = etree.XML(oai_in)
    root = oai_parsed.xpath('.') #root node
    dc = root[0].xpath('//oai_dc:dc',
                            namespaces={'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/',
                            'dc': 'http://purl.org/dc/elements/1.1/'}) #access dc:* nodes (i.e. each item)

    #loop through all items and for each go fetch additional metadata via the EFetch APIs for PubMed Central and PubMed
    #place that additional data into the original OAI feed
    i = 0
    for record in dc:
      identifier = record.xpath('//dc:identifier',
                            namespaces={'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/',
                            'dc': 'http://purl.org/dc/elements/1.1/'})
      pmc_id =(identifier[i].text).replace('http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=','') #get the article's unique ID

      #request PubMed ID from Pubmed Central API ... ugh!
      efetch_url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=%s' %pmc_id #this is the URL to get metadata about the article per it's ID
##      efetch_read = urllib.urlopen(efetch_url).read() #DELETE
      efetch_read = urlfetch.fetch(url=efetch_url, deadline=60).content #read the API response
      efetch_parsed = etree.XML(efetch_read) #parse as XML
      pubmed_id = efetch_parsed.xpath('//article-id[@pub-id-type="pmid"]/text()') #pubmed id

      #now(!) get the additional data from the PubMed API
      efetch_url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=%s&retmode=xml' %pubmed_id
##      efetch_read = urllib.urlopen(efetch_url).read() #DELETE
      efetch_read = urlfetch.fetch(url=efetch_url, deadline=60).content
      efetch_parsed = etree.XML(efetch_read)

      for addition in additions:
        added_element = efetch_parsed.xpath('//%s/text()' %addition[1]) #get data from API XML tree
        for added_value in added_element:
          etree.SubElement(record, '{http://purl.org/dc/elements/1.1/}%s' %addition[0]).text = added_value

      i = i + 1

    #for reporting how long this all takes
    time_out = time.time()
    time_diff = str(time_out - time_in)
    
    #output the *new* OAI results with the additional metadata
    self.response.headers['Content-Type'] = 'text/xml' #output as XML doc
    disclaimer= '''<!--
    This is just a test to use the NCBI EFetch APIs to augment the ouput of PubMed Central's OAI feed.
    In short, it's a web servive that sits on top of the PubMed Central OAI API.

    *** DO NOT use this service to harvest OAI records from PubMed Central ... you will probably mess up your repository!
    ... and I haven't verified that the additional data being added to the OAI feed is accurate per the item.

    Currently, this supports the following OAI parameters:
    
      - ListRecords
      - set
      - metadataPrefix (must use "oai_dc"/Dublin Core)
      - resumptionToken
    
    Thanks, Nitin Arora (humaneguitarist.org), May 2012.
    
    ps: adding metadata increased the OAI response time by %s seconds.
    -->''' %time_diff
    self.response.out.write(disclaimer)
    for node in root:
      self.response.out.write(etree.tostring(node))

### app engine stuff ...
app = webapp2.WSGIApplication([('/oai', pmctopper)],
                              debug=True)
--------------

Related Content:

Written by nitin

May 27th, 2012 at 10:11 am

facet mashing, a tragedy in 0.987 acts

leave a comment

Update, March 21, 2012: I'm at DrupalCon 2012 and after going to a session on node.js – which I've had in the back of my head as a potential replacement for Python for some metadata harvesting software I'm working on – I was reminded of OpenCalais which I haven't looked at in forever, probably because I wouldn't have understood it before. Anyway, maybe that's a solution to the issues I'm describing below in terms of generating some sort of browse-able facets. This is definitely something to look into.

Home sick again, so that means another meaningless contribution to the "blogosphere" …

So, I've been working with some folks on a project to make a single site search for digital collections across the state I work in.

We're using Solr for the index and OAI feeds for now even though the metadata harvesting software is agnostic of OAI and can support other feed types, etc. But that's not the point here …

The point is that metadata coming in from different places makes for a mess if you want to expose facets … and we might veer to not showing them because noone wants to get into the murky waters of trying to control for that across multiple places.

I think subject facets are still useful though because I like to "play around", to stumble in the dark, and just have fun.

But, of course, there's still the fact-of-the-matter that across multiple institutions you might see subjects from one place written as "Asheville, NC" and another as "Asheville, (N.C.)".

Well, that stinks. There are essentially the same thing, but would get exposed as two separate facets.

So, in the spirit of stumbling in the dark, last Saturday morning I worked on a preliminary little function in Python to try and merge strings like the Asheville example above.

The idea is that the function should present to the user the version that has more "votes", i.e. the one that has more matches in the current search results. So, if "Asheville, NC" appeared 10 times and "Asheville, (N.C.)" appeared 15 times in the user's search results, the function would display "Asheville, (N.C.)" to the user and say it has 25 matches. When the user clicks "Asheville, (N.C.)" a search would be launched for either "Asheville, (N.C.)" or "Asheville, NC". Essentially, the idea is to beautify the facets at the last possible moment (i.e. through a function in the user interface) so the user doesn't have to see the ugly reality of metadata from all over the place; it's also about rectifying things based on text similarity not on semantic similarity – which is another ballgame altogether.

The function uses some known string similarity methods. It's promising but there's still lots of work to do if I really decide to pursue this. And by "lots of work" I really mean seeing if someone with the proper computer science and linguistic background has already written a library for this kind of thing. And (adding this the day after I originally wrote this), I also need to play with s-match.

Anyway, the test code is below and the results are below that but I need to stop writing because I'm dropping out and need to take a nap.

:/

#####
def facetMasher(x,y):
  info = "Comparing \"%s\" with %s facets, against \"%s\" with %s facets." %(x[0],x[1],y[0],y[1])
  print info
 
  output = ""
 
  import Levenshtein #Windows32/Python 2.7 installer: http://sourceforge.net/projects/translate/files/python-Levenshtein/
  lev = Levenshtein.jaro
  myJaro = lev(x[0], y[0])
 
  lev2 = Levenshtein.distance
  myDist = lev2(x[0], y[0])
 
  print "Jaro-Winkler score: ", myJaro
  print "Levenshtein distance: ", myDist
  if myJaro > .95 or (myJaro > .75 and myDist < 10):
      if myDist > 1:
          totalFacets = x[1] + y[1]
          if (x[1] >= y[1]):
              mergedString = x[0]
          else:
              mergedString = y[0]
          output =  "Merging to \"%s\" with %s facets." %(mergedString, totalFacets)
  if output == "":
    output = "Keeping \"%s\" with %s facets, and \"%s\" with %s facets." %(x[0],x[1],y[0],y[1])

  print output
  print ("--\n")
      
##### tests ...
facetMasher (("Bibles",3),("bible",2)) #interesting ...
facetMasher (("Fibles",3),("fible",2))

facetMasher (("World War 1",3),("World War 2",2))

facetMasher (("Images",4),("image",3))
facetMasher (("Images",2),("movies",3))

facetMasher (("Asheville, NC",3),("Asheville (N.C.)",2))
facetMasher (("Asheville, (NC)",3),("Asheville (N.C.)",2))
facetMasher (("Granville County (N.C.)",120),("Granville County, N.C.",2))

facetMasher (("foo & bar",3),("foo and bar",2))

facetMasher (("United States--History--Civil War, 1861-1865",3),("United States--History--Civil War, 1861-1865--Correspondence",2))

facetMasher (("United States--History--World War II",3),("United States--History--World War I",2))
facetMasher (("United States--History--World War Two",3),("United States--History--World War 2",2))
facetMasher (("United States--History--World War Two",3),("United States--History--World War 1",2))
facetMasher (("United States--History--World War 1",3),("United States--History--World War 2",2))

And here are the results, below. It's interesting how "Bibles" vs. "bible" doesn't merge, yet "Fibles" and "fible" do. Also, there are some undesired results such as merging "United States–History–World War Two" with "United States–History–World War 1" because the algorithm still sucks.

Comparing "Bibles" with 3 facets, against "bible" with 2 facets.
Jaro-Winkler score:  0.738888888889
Levenshtein distance:  2
Keeping "Bibles" with 3 facets, and "bible" with 2 facets.
--

Comparing "Fibles" with 3 facets, against "fible" with 2 facets.
Jaro-Winkler score:  0.822222222222
Levenshtein distance:  2
Merging to "Fibles" with 5 facets.
--

Comparing "World War 1" with 3 facets, against "World War 2" with 2 facets.
Jaro-Winkler score:  0.939393939394
Levenshtein distance:  1
Keeping "World War 1" with 3 facets, and "World War 2" with 2 facets.
--

Comparing "Images" with 4 facets, against "image" with 3 facets.
Jaro-Winkler score:  0.822222222222
Levenshtein distance:  2
Merging to "Images" with 7 facets.
--

Comparing "Images" with 2 facets, against "movies" with 3 facets.
Jaro-Winkler score:  0.666666666667
Levenshtein distance:  4
Keeping "Images" with 2 facets, and "movies" with 3 facets.
--

Comparing "Asheville, NC" with 3 facets, against "Asheville (N.C.)" with 2 facets.
Jaro-Winkler score:  0.891025641026
Levenshtein distance:  5
Merging to "Asheville, NC" with 5 facets.
--

Comparing "Asheville, (NC)" with 3 facets, against "Asheville (N.C.)" with 2 facets.
Jaro-Winkler score:  0.936111111111
Levenshtein distance:  3
Merging to "Asheville, (NC)" with 5 facets.
--

Comparing "Granville County (N.C.)" with 120 facets, against "Granville County, N.C." with 2 facets.
Jaro-Winkler score:  0.955862977602
Levenshtein distance:  3
Merging to "Granville County (N.C.)" with 122 facets.
--

Comparing "foo & bar" with 3 facets, against "foo and bar" with 2 facets.
Jaro-Winkler score:  0.809553872054
Levenshtein distance:  3
Merging to "foo & bar" with 5 facets.
--

Comparing "United States--History--Civil War, 1861-1865" with 3 facets, against "United States--History--Civil War, 1861-1865--Correspondence" with 2 facets.
Jaro-Winkler score:  0.911111111111
Levenshtein distance:  16
Keeping "United States--History--Civil War, 1861-1865" with 3 facets, and "United States--History--Civil War, 1861-1865--Correspondence" with 2 facets.
--

Comparing "United States--History--World War II" with 3 facets, against "United States--History--World War I" with 2 facets.
Jaro-Winkler score:  0.990740740741
Levenshtein distance:  1
Keeping "United States--History--World War II" with 3 facets, and "United States--History--World War I" with 2 facets.
--

Comparing "United States--History--World War Two" with 3 facets, against "United States--History--World War 2" with 2 facets.
Jaro-Winkler score:  0.963449163449
Levenshtein distance:  3
Merging to "United States--History--World War Two" with 5 facets.
--

Comparing "United States--History--World War Two" with 3 facets, against "United States--History--World War 1" with 2 facets.
Jaro-Winkler score:  0.963449163449
Levenshtein distance:  3
Merging to "United States--History--World War Two" with 5 facets.
--

Comparing "United States--History--World War 1" with 3 facets, against "United States--History--World War 2" with 2 facets.
Jaro-Winkler score:  0.980952380952
Levenshtein distance:  1
Keeping "United States--History--World War 1" with 3 facets, and "United States--History--World War 2" with 2 facets.
--
--------------

Related Content:

Written by nitin

March 15th, 2012 at 11:57 am

bidi bidi bidi and more on pOAIndexter-ing metadata

leave a comment

It's shaping up to be a sunny day and this means I need to go on a long walk.

But before I do that, I'll follow up to this post about grabbing OAI metadata from an online source and throwing the metadata into Solr for searching purposes, etc.

Last night – while watching streaming the Gil Gerard iteration of Buck Rogers – I wrote a small PHP script to grab this OAI metadata from the Library of Congress' site. BTW: this is a cool page of theirs that helps one get started with OAI feeds, etc.

Aside: Is it only since the advent of hypertext that the word "this" began appearing in a referential context within documents?

As I mentioned in the previous post, an XML config file will instruct the code where to get the metadata and which XSL file will be used to transform the data into something Solr can chew on. I haven't bothered with the config file yet, so for now I just tested it on the specific metadata linked to above since the config file aspect of this is the most trivial component of the whole thing.

Anyway, below is the PHP file, the OAI to Solr XSL file, and a snippet of the output. Last is a Python script that does the same thing as the PHP. It's not OO like the PHP file, but I just whipped it up this morning for shiggles.

Here's the PHP …

<?php

function grabMetadata($urlArg) {
    $ch = curl_init(); // see: http://php.net/manual/en/book.curl.php
    curl_setopt($ch, CURLOPT_URL, $urlArg);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $curlOut = curl_exec($ch);
    return $curlOut;
    curl_close($ch);
}

// See "http://www.php.net/manual/en/xsltprocessor.transformtoxml.php" for instructions re: XSL processing as below.
function useXSL($output) {
    $search_results = new DOMDocument;
    $search_results->loadXML($output);
    // If you just use "load" instead of "loadXML" it won't work unless you first stored the XML results in a file (boo!).
    // For info on "loadXML" see: http://www.php.net/manual/en/domdocument.loadxml.php
    $proc = new XSLTProcessor;
    $xsl = new DOMDocument;
    $xsl->load('OAI_to_solr.xsl');
    $proc->importStyleSheet($xsl);
    $processed = $proc->transformToXML($search_results);
    return $processed;
}

function writeSOLR($solrXML) {
    $myFile = "for_solr-PHP.xml";
    $fh = fopen($myFile, 'w') or die("can't open file");
    fwrite($fh, utf8_encode($solrXML)); // For UTF-8, see: http://www.php.net/manual/en/function.fwrite.php#73764
    fclose($fh);
}

// Do stuff ...
$output = grabMetadata('http://memory.loc.gov/cgi-bin/oai2_0?verb=ListRecords&metadataPrefix=oai_dc&set=papr');
writeSOLR(useXSL($output));
?>
The XSL file …
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
exclude-result-prefixes="oai_dc dc">
  <xsl:output method="xml" indent="yes" encoding="UTF-8"/>
  <xsl:template match="/">
    <add>
      <xsl:for-each select="//oai_dc:dc">
        <doc>
          <field name="identifier">
            <xsl:value-of select="dc:identifier" />
          </field>
          <field name="title">
            <xsl:value-of select="dc:title" />
          </field>
          <field name="creator">
            <xsl:value-of select="dc:creator" />
          </field>
          <xsl:for-each select="dc:subject">
            <field name="subject">
              <xsl:value-of select="." />
            </field>
          </xsl:for-each>
          <field name="description">
            <xsl:value-of select="dc:description" />
          </field>
        </doc>
      </xsl:for-each>
    </add>
  </xsl:template>
</xsl:stylesheet>
The Millionare and his wife … er, wrong show. I mean the sample Solr XML snippet …
<add>
  <doc>
    <field name="identifier">http://hdl.loc.gov/loc.mbrsmi/amrlv.4007</field>
    <field name="title">[Theater commercial--electric refrigerators]. Buy an electric refrigerator /</field>
    <field name="creator">AFI/Kalinowski (Eugene) Collection (Library of Congress)</field>
    <field name="subject">Refrigerators.</field>
    <field name="subject">Advertising--Electric household appliances--Pennsylvania--Pittsburgh.</field>
    <field name="subject">Trade shows--Pennsylvania--Pittsburgh.</field>
    <field name="subject">Silent films.</field>
    <field name="subject">Pittsburgh (Pa.)--Manufactures.</field>
    <field name="description">Largely graphic commercial for electric refrigerators in general and a refrigerator show, presumably in Pittsburgh, in particular.</field>
  </doc>

...

</add>
Some Python for fun …
import codecs
import urllib
from lxml import etree, _elementpath # see: http://lxml.de/
from lxml.etree import XSLT,fromstring

## some OAI metadata from the Library of Congress
url = 'http://memory.loc.gov/cgi-bin/oai2_0?verb=ListRecords&metadataPrefix=oai_dc&set=papr'
metadata = urllib.urlopen(url).read()
metadata = etree.XML(metadata)

## the XSL file that will transform the OAI metadata to Solr
xsl = open('OAI_to_solr.xsl', 'r')
xsl = xsl.read()
xsl = etree.XML(xsl)

## XSL transformation
style = XSLT(xsl)
result = style.apply(metadata)

## the outputted Solr XML
fw = codecs.open('for_solr-PY.xml', 'w', 'utf-8-sig')
utf8_result = unicode(str(result), encoding='utf8')
fw.write(utf8_result)
fw.close()

And most importantly, the introduction to Buck Rogers in the 25th Century – Season 1, of course! I couldn't even make it through the first ten minutes of the Season 2 opener. I mean they changed the introduction which was brilliant and brilliantly narrated – as you shall see!

I'd prefer to watch the South Park spoof over the Season 2 insult-to-perfection any day of the week.

And here's a bad-ass fan trailer that I think respects the greatness of the first season.

--------------

Related Content:

Written by nitin

October 15th, 2011 at 9:05 am

Posted in scripts

Tagged with , , ,

pOAIndexter: grabbing and indexing online metadata

leave a comment

As per usual, a good bit of my computer-y stuff at home relates to something that's come up at work. And as usual, I'm pretty ignorant of what I'm getting myself into, but I don't mind.

The other week, my boss and I met with some great people at digitalnc.org and we started talking about the idea of having a super simple, lightweight approach to providing a one-stop-shop search interface for collections across the state – provided those collections expose their metadata somehow. For now, we talked about limiting this to people who do so with an OAI feed and grabbing that metadata. But eventually, this thing should be metadata agnostic – in the sense that it isn't about a metadata format, but just the data itself.

By the way, I guess "grabbing" and "feed" aren't what I typically see with OAI – about which I admittedly don't know much – but I don't care. Same difference.

Of course, there's nothing new to this. I guess one could use Blacklight or VuFind to do this kind of thing, but I'm not sure, though even those are existing open souce projects, that doing so isn't overkill and won't in turn increase dependencies and maintenance overhead.

Actually, that's a topic for another time – I mean the idea that just because part of something is capable of doing what you want doesn't necessarily make it a better option than rolling one's own if using and updating said something entails more cost in the long run. Paved roads often get you there faster, but a willingness to get lost now and then is how you learn where all the really cool local bars are …

;)

Anyway, here's what I'm thinking. A small script would simply look at an XML setup file from which it would know which places to go grab metadata from, the type of feed, the last time the metadata was requested, and stuff like the resumptionToken if applicable. It would also store the appropriate XSL file to process the metadata with so that the metadata could be passed into Solr to be indexed and searchable. Anyone who's site doesn't provide metadata as XML could simply create a web service that does so, e.g. a RESTful MySQL to XML thingamajig. The outputted XML just needs to have an XSL that will facilitate passing it to Solr for that data to be part of the shared metadata store. And since XSL is the universal translator in this context, other metadata types such as RSS/ATOM feeds could be grabbed, too. All one needs to do is add to the XML config file so the script knows to retrieve metadata from that site and make sure there's an XSL file that can be used to facilitate passing the data into Solr. So in the end all this should take in terms of coding is a small script, one XML config file, and as many XSL files as needed.

For fun and to start learning about Solr, I just manually grabbed some OAI metadata from CalTech yesterday – it was for some oral histories. And then I ran them through an XSL file and then posted them to Solr. Within no time I had a searchable, local metadata store to play around with (screenshot below). Since I was using all the defaults from the Solr tutorial I had to map the <dc:creator> field to things like manufacturer, since the default is set up for an electronics store.

Solr screenshot

BTW if we use this, at some point I won't be able to call it "pOAIndexter" but for now I can.

Since I don't know if I'll do this in Python or PHP and since OAI is what we'll work on first, I guess it stands for "Python or PHP OAI Indexer".

Yes, I'm a dork.

--------------

Related Content:

Written by nitin

October 2nd, 2011 at 11:20 am

Switch to our mobile site