awesome sauce: augmenting PubMed Central's OAI response

Update, 9 pm EST, May 27, 2012: Well, this is interesting. After reading this [http://www.ncbi.nlm.nih.gov/pmc/tools/oai/] page, I see that by setting the "metadataPrefix" to "pmc_fm" I can bypass steps #3 and #4 altogether it seems - provided one's OAI harvester/indexer is set to ingest the data in that format instead of Dublin Core or provided the script below transforms the data to Dublin Core before returning it. Anyway ... score one for documentation and reading it after-the-fact! ... I saw a post from a Metadata Librarian on the code4lib [http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CFUQFjAA&url=http%3A%2F%2Fwww.code4lib.org%2F&ei=3CfCT56HIKaJ6gGRuPm7Cg&usg=AFQjCNFXiIvTXJCNNoO2DLndNa0WZ2VNJw&sig2=qPWzAJ5fBpiO4XFiY6e_hg] list about their work with placing article data from PubMed into DSpace. They are doing some metadata additions and cleanup in Excel so I emailed them off-list and let them know about PubMed2XL [http://blog.humaneguitarist.org/projects/pubmed2xl/] and we went back and forth on a few things. Among the things I learned from them was that PubMed Central has an OAI feed. Cool! But that OAI feed doesn't return all the data they need. Here's an example: http://www.pubmedcentral.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=oai_dc&set=aac [http://www.pubmedcentral.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=oai_dc&set=aac]. One of the additional bits of data they wanted was author affiliation which is available from PubMed.gov's XML output. Same for the MESH terms. Anyway, besides pushing PubMed2XL, I also mentioned that it would be interesting to make a sauce, if you will, for PubMed Central's OAI feed. In other words, rather than using the OAI link above, one would use a service on top of that a la: http://myPubMedCentralOAI_sauce.com/oai?verb=ListRecords&metadataPrefix=oai_dc&set=aac. And when one went to that URL, the service would fetch the real OAI feed from PubMed Central and then get the additional metadata from the NCBI EFetch APIs [http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html]. It would then drop the additional metadata into the original OAI response and finally serve it up to the user (e.g. the OAI harvester). I went ahead and played with a proof-of-concept using Google App Engine and it's working although it's adding about 20 - 25 seconds to the OAI response time. BTW: it's faster when I run it from localhost and not actually live on App Engine. Here's how it's done. 1. The user goes to http://localhost:8084/oai?verb=ListRecords&metadataPrefix=oai_dc&set=aac. 2. The app then fetches http://www.pubmedcentral.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=oai_dc&set=aac [http://www.pubmedcentral.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=oai_dc&set=aac]. 3. For each record, the app parses out the PubMed Central ID and uses the EFetch API with PubMed Central as the database to get more data about the item. + Here's a sample URL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=152494 [http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=152494]. 4. Unfortunately, the API for PubMed Central doesn't return MESH terms, so in step #3 the app just uses the returned data to translate the PubMed Central ID to the regular PubMed ID. 5. With the PubMed ID now in hand, the app goes to the EFetch API and specifies PubMed as the database and hands the API the PubMed ID from step #4. + Here's a sample URL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=12654674&retmode=xml [http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=12654674&retmode=xml]. o BTW, this is for the same article as the example link in step #3 ... I think. 6. Now the app gets the value and the MESH terms and adds them to the real OAI response from step #2. 7. Finally (whew!), the app returns the OAI feed with more metadata than before. This seems super klunky, so I'd love to hear about more elegant ways to do this ... like having more options from PubMed Central without 3rd party hacks! But it is working. And it's just a proof-of-concept ... IFRAME: http://www.youtube.com/embed/m3dZl3yfGpc Below, I've pasted a snippet of the augmented OAI data. Below that is the Python code if anyone's interested. ps: Python users will notice I used Google App Engine's "urlfetch" instead of "urllib" to request URLs. This is because using the latter was causing 500 errors due to timeouts. I don't think, from what I've read, that you can set the timeout with "urllib" in App Engine, so I used "urlfetch" which lets one set it up to 60 seconds.  <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2012-05-27T13:34:17Z</responseDate> <request verb="ListRecords" metadataPrefix="oai_dc" set="aac">http://www.pubmedcentral.nih.gov/oai/oai.cgi</request> <ListRecords> <record> <header> <identifier>oai:pubmedcentral.nih.gov:89011</identifier> <datestamp>2002-09-12</datestamp> <setSpec>aac</setSpec> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>Antifungal Peptides: Novel Therapeutic Compounds against Emerging Pathogens</dc:title> <dc:creator>De Lucca, Anthony J.</dc:creator> <dc:creator>Walsh, Thomas J.</dc:creator> <dc:subject>Minireviews</dc:subject> <dc:description/> <dc:publisher>American Society for Microbiology</dc:publisher> <dc:identifier>http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=89011</dc:identifier> <dc:type>Text</dc:type> <dc:language>en</dc:language> <dc:rights/> <dc:contributor.affiliation>Southern Regional Research Center, Agricultural Research Service, U. S. Department of Agriculture, New Orleans, Louisiana 70124, USA. adelucca@nola.srrc.usda.gov</dc:contributor.affiliation> <dc:subject.mesh>Animals</dc:subject.mesh> <dc:subject.mesh>Anti-Bacterial Agents</dc:subject.mesh> <dc:subject.mesh>Antifungal Agents</dc:subject.mesh> <dc:subject.mesh>Fungi</dc:subject.mesh> <dc:subject.mesh>Humans</dc:subject.mesh> <dc:subject.mesh>Mycoses</dc:subject.mesh> <dc:subject.mesh>Peptides</dc:subject.mesh> </oai_dc:dc> </metadata> </record> <resumptionToken>oai%3Apubmedcentral.nih.gov%3A89061!!!oai_dc!aac</resumptionToken> </ListRecords> </OAI-PMH> Python: ### pmc-oai-topper.py ### 2012, Nitin Arora ### import modules ##import urllib #DELETE from google.appengine.api import urlfetch #see: https://developers.google.com/appengine/docs/python/urlfetch/overview from lxml import etree import time import webapp2 ### set what additional metadata to get from the EFetch API additions = [('contributor.affiliation', 'Affiliation'), ('subject.mesh', 'DescriptorName')] #(name of element to output to, XPath); eventually needs to be in external config file #note: the XPath has to refer to elements in the EFetch XML output for the PubMed database as in "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=12654674&retmode=xml" ##### class pmctopper(webapp2.RequestHandler): def get(self): #GET OAI parameter values verb_value = self.request.get('verb') metadataPrefix_value = self.request.get('metadataPrefix') set_value = self.request.get('set') resumptionToken_value = self.request.get('resumptionToken') #define the *real* OAI feed URL and read it if resumptionToken_value: #if a resumptionToken is being used url = 'http://www.pubmedcentral.gov/oai/oai.cgi?verb=%s&resumptionToken=%s' %(verb_value, resumptionToken_value) elif set_value: url = 'http://www.pubmedcentral.gov/oai/oai.cgi?verb=%s&set=%s&metadataPrefix=%s' %(verb_value, set_value, metadataPrefix_value) else: url = 'http://www.pubmedcentral.gov/oai/oai.cgi?verb=%s&metadataPrefix=%s' %(verb_value, metadataPrefix_value) ## oai_in = urllib.urlopen(url).read() #DELETE oai_in = urlfetch.fetch(url=url, deadline=60).content time_in = time.time() #tracking how long this takes #parse OAI response as XML oai_parsed = etree.XML(oai_in) root = oai_parsed.xpath('.') #root node dc = root[0].xpath('//oai_dc:dc', namespaces={'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/', 'dc': 'http://purl.org/dc/elements/1.1/'}) #access dc:* nodes (i.e. each item) #loop through all items and for each go fetch additional metadata via the EFetch APIs for PubMed Central and PubMed #place that additional data into the original OAI feed i = 0 for record in dc: identifier = record.xpath('//dc:identifier', namespaces={'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/', 'dc': 'http://purl.org/dc/elements/1.1/'}) pmc_id =(identifier[i].text).replace('http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=','') #get the article's unique ID #request PubMed ID from Pubmed Central API ... ugh! efetch_url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=%s' %pmc_id #this is the URL to get metadata about the article per it's ID ## efetch_read = urllib.urlopen(efetch_url).read() #DELETE efetch_read = urlfetch.fetch(url=efetch_url, deadline=60).content #read the API response efetch_parsed = etree.XML(efetch_read) #parse as XML pubmed_id = efetch_parsed.xpath('//article-id[@pub-id-type="pmid"]/text()') #pubmed id #now(!) get the additional data from the PubMed API efetch_url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=%s&retmode=xml' %pubmed_id ## efetch_read = urllib.urlopen(efetch_url).read() #DELETE efetch_read = urlfetch.fetch(url=efetch_url, deadline=60).content efetch_parsed = etree.XML(efetch_read) for addition in additions: added_element = efetch_parsed.xpath('//%s/text()' %addition[1]) #get data from API XML tree for added_value in added_element: etree.SubElement(record, '{http://purl.org/dc/elements/1.1/}%s' %addition[0]).text = added_value i = i + 1 #for reporting how long this all takes time_out = time.time() time_diff = str(time_out - time_in) #output the *new* OAI results with the additional metadata self.response.headers['Content-Type'] = 'text/xml' #output as XML doc disclaimer= '''''' %time_diff self.response.out.write(disclaimer) for node in root: self.response.out.write(etree.tostring(node)) ### app engine stuff ... app = webapp2.WSGIApplication([('/oai', pmctopper)], debug=True)