discoveries in digital audio, music notation, and information encoding

Archive for the ‘PubMed’ tag

PubMed2XL 2.0 now available for download

leave a comment

PubMed2XL 2.0 is now available.

You can read the documentation and download the newest version here.

There are a few notable changes to the graphical user interface (GUI) and lots of huge changes under the hood.

As far as the GUI the visible changes are as such:

  • You will now get notified by the software if a newer version of the software is available;
  • You can now turn the processing of book (non-journal) citations on/off;
  • You can toggle between Excel 2007 (.xls) and OpenDocument (.ods) output;
  • You can now save your preferences;
  • You can inspect a stylesheet and see the column title names before selecting that stylesheet;
  • … and there's even a simple logo (below) thanks to klukeart's icon on IconArchive.

And for programmers there is now a "" Python library that has lots of functions that I hope might be useful for folks. The GUI is now built on top of the library so there's a clear separation of concerns between data processing functions and user interface. Non-Python programmers can also call the library functions via the command line.

OK, here's the logo (made with Inkscape):

PubMed2XL 2.0 logo

I'd also like to say congrats to Andy Murray for winning Wimbledon today!


Related Content:

Written by nitin

July 7th, 2013 at 4:11 pm

Posted in news,scripts,XML

Tagged with , , ,

awesome sauce: augmenting PubMed Central’s OAI response

leave a comment

Update, 9 pm EST, May 27, 2012: Well, this is interesting. After reading this page, I see that by setting the "metadataPrefix" to "pmc_fm" I can bypass steps #3 and #4 altogether it seems – provided one's OAI harvester/indexer is set to ingest the data in that format instead of Dublin Core or provided the script below transforms the data to Dublin Core before returning it. Anyway … score one for documentation and reading it after-the-fact!

I saw a post from a Metadata Librarian on the code4lib list about their work with placing article data from PubMed into DSpace. They are doing some metadata additions and cleanup in Excel so I emailed them off-list and let them know about PubMed2XL and we went back and forth on a few things. Among the things I learned from them was that PubMed Central has an OAI feed. Cool!

But that OAI feed doesn't return all the data they need.

Here's an example:

One of the additional bits of data they wanted was author affiliation which is available from's XML output. Same for the MESH terms.

Anyway, besides pushing PubMed2XL, I also mentioned that it would be interesting to make a sauce, if you will, for PubMed Central's OAI feed. In other words, rather than using the OAI link above, one would use a service on top of that a la: And when one went to that URL, the service would fetch the real OAI feed from PubMed Central and then get the additional metadata from the NCBI EFetch APIs. It would then drop the additional metadata into the original OAI response and finally serve it up to the user (e.g. the OAI harvester).

I went ahead and played with a proof-of-concept using Google App Engine and it's working although it's adding about 20 – 25 seconds to the OAI response time. BTW: it's faster when I run it from localhost and not actually live on App Engine.

Here's how it's done.

  1. The user goes to http://localhost:8084/oai?verb=ListRecords&metadataPrefix=oai_dc&set=aac.
  2. The app then fetches
  3. For each record, the app parses out the PubMed Central ID and uses the EFetch API with PubMed Central as the database to get more data about the item.
  4. Unfortunately, the API for PubMed Central doesn't return MESH terms, so in step #3 the app just uses the returned data to translate the PubMed Central ID to the regular PubMed ID.
  5. With the PubMed ID now in hand, the app goes to the EFetch API and specifies PubMed as the database and hands the API the PubMed ID from step #4.
  6. Now the app gets the <Affiliation> value and the MESH terms and adds them to the real OAI response from step #2.
  7. Finally (whew!), the app returns the OAI feed with more metadata than before.

This seems super klunky, so I'd love to hear about more elegant ways to do this … like having more options from PubMed Central without 3rd party hacks!

But it is working. And it's just a proof-of-concept …

Below, I've pasted a snippet of the augmented OAI data.

Below that is the Python code if anyone's interested.

ps: Python users will notice I used Google App Engine's "urlfetch" instead of "urllib" to request URLs. This is because using the latter was causing 500 errors due to timeouts. I don't think, from what I've read, that you can set the timeout with "urllib" in App Engine, so I used "urlfetch" which lets one set it up to 60 seconds.

  This is just a test to use the NCBI EFetch APIs to augment the ouput of PubMed Central's OAI feed.
  In short, it's a web servive that sits on top of the PubMed Central OAI API.

  *** DO NOT use this service to harvest OAI records from PubMed Central ... you will probably mess up your repository!
  ... and I haven't verified that the additional data being added to the OAI feed is accurate per the item.

  Currently, this supports the following OAI parameters:
   - ListRecords
   - set
   - metadataPrefix (must use "oai_dc"/Dublin Core)
   - resumptionToken
  Thanks, Nitin Arora (, May 2012.
  ps: adding metadata increased the OAI response time by 22.6178297997 seconds.
<OAI-PMH xmlns="" xmlns:xsi="" xsi:schemaLocation="">
 <request verb="ListRecords" metadataPrefix="oai_dc" set="aac"></request>
    <oai_dc:dc xmlns:oai_dc="" xmlns:dc="" xmlns:xsi="" xsi:schemaLocation="">
     <dc:title>Antifungal Peptides: Novel Therapeutic Compounds against Emerging Pathogens</dc:title>
     <dc:creator>De Lucca, Anthony J.</dc:creator>
     <dc:creator>Walsh, Thomas J.</dc:creator>
     <dc:publisher>American Society for Microbiology</dc:publisher>
     <dc:contributor.affiliation>Southern Regional Research Center, Agricultural Research Service, U. S. Department of Agriculture, New Orleans, Louisiana 70124, USA.</dc:contributor.affiliation>
     <dc:subject.mesh>Anti-Bacterial Agents</dc:subject.mesh>
     <dc:subject.mesh>Antifungal Agents</dc:subject.mesh>


### 2012, Nitin Arora

### import modules
##import urllib #DELETE
from google.appengine.api import urlfetch #see:
from lxml import etree
import time
import webapp2

### set what additional metadata to get from the EFetch API
additions = [('contributor.affiliation', 'Affiliation'),
             ('subject.mesh', 'DescriptorName')] #(name of element to output to, XPath); eventually needs to be in external config file
            #note: the XPath has to refer to elements in the EFetch XML output for the PubMed database as in ""

class pmctopper(webapp2.RequestHandler):
  def get(self):

    #GET OAI parameter values
    verb_value = self.request.get('verb')
    metadataPrefix_value = self.request.get('metadataPrefix')
    set_value = self.request.get('set')
    resumptionToken_value = self.request.get('resumptionToken')

    #define the *real* OAI feed URL and read it
    if resumptionToken_value: #if a resumptionToken is being used
      url = '' %(verb_value, resumptionToken_value)
    elif set_value:
      url = '' %(verb_value, set_value, metadataPrefix_value)
      url = '' %(verb_value, metadataPrefix_value)

##    oai_in = urllib.urlopen(url).read() #DELETE
    oai_in = urlfetch.fetch(url=url, deadline=60).content
    time_in = time.time() #tracking how long this takes

    #parse OAI response as XML
    oai_parsed = etree.XML(oai_in)
    root = oai_parsed.xpath('.') #root node
    dc = root[0].xpath('//oai_dc:dc',
                            namespaces={'oai_dc': '',
                            'dc': ''}) #access dc:* nodes (i.e. each item)

    #loop through all items and for each go fetch additional metadata via the EFetch APIs for PubMed Central and PubMed
    #place that additional data into the original OAI feed
    i = 0
    for record in dc:
      identifier = record.xpath('//dc:identifier',
                            namespaces={'oai_dc': '',
                            'dc': ''})
      pmc_id =(identifier[i].text).replace('','') #get the article's unique ID

      #request PubMed ID from Pubmed Central API ... ugh!
      efetch_url = '' %pmc_id #this is the URL to get metadata about the article per it's ID
##      efetch_read = urllib.urlopen(efetch_url).read() #DELETE
      efetch_read = urlfetch.fetch(url=efetch_url, deadline=60).content #read the API response
      efetch_parsed = etree.XML(efetch_read) #parse as XML
      pubmed_id = efetch_parsed.xpath('//article-id[@pub-id-type="pmid"]/text()') #pubmed id

      #now(!) get the additional data from the PubMed API
      efetch_url = '' %pubmed_id
##      efetch_read = urllib.urlopen(efetch_url).read() #DELETE
      efetch_read = urlfetch.fetch(url=efetch_url, deadline=60).content
      efetch_parsed = etree.XML(efetch_read)

      for addition in additions:
        added_element = efetch_parsed.xpath('//%s/text()' %addition[1]) #get data from API XML tree
        for added_value in added_element:
          etree.SubElement(record, '{}%s' %addition[0]).text = added_value

      i = i + 1

    #for reporting how long this all takes
    time_out = time.time()
    time_diff = str(time_out - time_in)
    #output the *new* OAI results with the additional metadata
    self.response.headers['Content-Type'] = 'text/xml' #output as XML doc
    disclaimer= '''<!--
    This is just a test to use the NCBI EFetch APIs to augment the ouput of PubMed Central's OAI feed.
    In short, it's a web servive that sits on top of the PubMed Central OAI API.

    *** DO NOT use this service to harvest OAI records from PubMed Central ... you will probably mess up your repository!
    ... and I haven't verified that the additional data being added to the OAI feed is accurate per the item.

    Currently, this supports the following OAI parameters:
      - ListRecords
      - set
      - metadataPrefix (must use "oai_dc"/Dublin Core)
      - resumptionToken
    Thanks, Nitin Arora (, May 2012.
    ps: adding metadata increased the OAI response time by %s seconds.
    -->''' %time_diff
    for node in root:

### app engine stuff ...
app = webapp2.WSGIApplication([('/oai', pmctopper)],

Related Content:

Written by nitin

May 27th, 2012 at 10:11 am

PubMed2XL 1.0 available


I've uploaded a new version of PubMed2XL, a Windows application that converts article lists from into Microsoft Excel files.

Unlike downloading the CSV directly from, PubMed2XL gives users (OK … advanced users) the ability to customize the output but even the default format includes Abstract, links to each article, and even links to related articles, and reviews.

Here's an example of a spreadsheet made with PubMed2XL and here's the source file used to make it. The source file was downloaded from using a search for "Mexican flu".

If you'd like to use the software you can download it for free.

If you notice any bugs or have any questions or remarks, please feel free to leave a comment on the site. Thanks!


Related Content:

Written by nitin

June 18th, 2011 at 2:28 pm

Posted in news,scripts

Tagged with , ,

PubMed CSV option added

leave a comment

It looks like things are afoot at They've apparently added a CSV option for downloading citations. This is great and will facilitate people getting citations into spreadsheets. Having this new option sure beats that FLink thing.

I'm not sure it's totally up and running though as the RSS feed for their News and Noteworthy shows a link to a post called "CSV File Selection" from March 24th but there's no post there, it just redirects to the home page.

Interestingly, this all seems to have happened a few days after someone from the NIH – according to Google Analytics – spent a good bit of time on this blog looking for information on how to get PubMed citations into a spreadsheet. Boy, would I like to believe that had something to do with it!

Of course, the data in the CSV file is still limited and non-customizable. You still can't get abstracts, it seems. That makes no sense of course since it is available in the XML format. If they've already got the data in a granular fashion, why not offer users better options?

For now, if you want abstracts and customized output you can try PubMed2XL. By the way, I'll be uploading an updated version in a few days. I just moved to a new town and I'm not quite settled enough to update some documentation and do some light programming work.


Related Content:

Written by nitin

March 31st, 2011 at 10:04 am

Posted in news

Tagged with ,

and yet more PubMed to Excel news


I've updated the documentation for PubMed2XL, a Windows application that converts article lists from into Microsoft Excel files. The documentation isn't incredibly thorough, but I think it's enough to work for now.

Speaking of getting PubMed search results into a spreadsheet check this out:

Those who search PubMed regularly have often wished for a way to import search results into a a program such as Excel. It’s here! A new tool called FLink (Frequency-weighted Links) is now accessible from the NIH National Center for Biotechnology Information (NCBI): FLink allows PubMed search results to be saved as a CSV, or comma-separated value, file which can be imported into a program like Excel.

source: Dragonfly » Blog Archive » FLink: A New Way to Save PubMed Search Results. Retrieved November 13, 2010, from

For instructions, just click here.

Unfortunately, those instructions don't instruct the user to to import the CSV file with UTF-8 encoding, etc. Not using the correct character encoding upon import could cause characters like accents and umlauts that might appear in author names, for example, to appear as strange, nonsensical characters.

Also, the output format is fixed – i.e. I don't think the user has any control of what data gets exported to the CSV file. Some data is concatenated together in one spreadsheet cell and that can be a problem for those who need to parse the data at a more granular level. It's more difficult to split data and re-sort it than it is to concatenate data that is already parsed in a granular fashion.

On the contrary, the PubMed2XL output can be customized – although it requires some skill with XML. Also, it places in each cell only one value and lastly I've never experienced any character encoding issues in the tests I've done.

Sure, I'm trying to compare the two approaches – just a touch, but in the end the best way will be for the users to have an easy interface offered directly from and its related sites. I'm just saying that I hope they soon offer more options and a more user-friendly method for the sake of the user.


Related Content:

Written by nitin

November 13th, 2010 at 1:12 pm

Posted in news,scripts

Tagged with , ,

PubMed to Excel: PubMed2XL version 0.9


I've released the first Beta version of PubMed2XL, a Windows application that converts article lists from into Microsoft Excel files.

If you'd like to use the software you can download it. Yes, it's free.


Here's a little video tutorial on installing and using the software:

PubMed2XL: Basic Installation and Use from nitin arora on Vimeo.

PubMed2XL's documentation is available at:​projects/pubmed2xl/.

The documentation includes a download link to the program files.


Related Content:

Written by nitin

September 19th, 2010 at 7:03 pm

Posted in scripts,XML

Tagged with , ,

PubMed to spreadsheet made easy

leave a comment

Update, September 2010: This post refers to an Alpha version of PubMed2XL. You can get the latest version of the software here.

Some time ago - exactly a year ago, actually! – I shared a post on how to use XSLT to turn a PubMed XML file into an HTML table and in turn paste that into Microsoft Excel or OpenOffice Calc.

That's fine and all but that's still too "techy" for the average bear who just wants to get a list of articles into a spreadsheet. So, I've been working on some software called PubMed2XL to make the job super simple.

PubMed2XL's a GUI program written in Python and it uses PyQT:

… a set of Python bindings for Nokia's Qt application framework and runs on all platforms supported by Qt including Windows, MacOS/X and Linux.

Since the program's still in early stages there's no real documentation but if you want to just play around with it and you use Windows you can get it here. If it doesn't work, it's probably because you need a file called MSVCR71.dll which I can't legally distribute but I think you can find it if you are resourceful.

Basically all you need to do is this:

  1. Conduct searches in PubMed.
  2. Send your articles to the Clipboard.
  3. Send the results to "File" as XML.
  4. Save the file as "pubmed_results.txt" which is the default name – of course, you can call the file something else if you want as long as it ends in ".txt" or ".xml".
  5. Click on the file called PubMed2XL.exe and then choose FILE>SELECT PUBMED FILE as below:


    PubMed2XL screenshot

  6. Then "open" the file you downloaded from PubMed (pubmed_results.txt).
  7. You should now see an XLS (Microsoft Excel) file in the same folder as pubmed_results.txt.

That should pretty much be it. And by the way the Help currently just points your browser to because, um, there's no help documentation yet.

If you're curious how this all works in the very general sense, I'm using a home-grown XML setup file (see below) that tells PubMed2XL which element or attribute value to extract from the pubmed_results.txt file. Then, the script uses the awesome pyExcelerator module to write the data to an XLS file.

By using this XML file advanced users can change the data as well as the spreadsheet column names that are generated in the resultant XLS file. I'm trying to make this software as open and mutable as possible but casual users won't have to worry about anything since the defaults should eventually work just fine.

Right now, the main work I have left to do is to overcome one glaring weakness. PubMed2XL can currently only retrieve data from non-repeating XML elements. In other words, elements like an author's <LastName> can't be extracted because there may be more than one author. What I'll eventually do is incorporate something in the setup file that tells PubMed2XL which occurrence of a repeating element to get data from: i.e. the last name of the primary author, etc.

If you are bored enough to download the zip file containing the program files, you'll notice the main executable file, PubMed2XL.exe, but also another file called PubMed2XL_CL.exe. Now this is exactly the same application but if you click on it you will see an ugly console window pop up in addition to PubMed2XL. The only reason I've included that file is to demonstrate that PubMed2XL can support command line arguments. In other words if you were to go to the command line and type in $ PubMed2XL_CL -h you would see a message pop up on the command line showing you the options for passing arguments to the software via the command line.

Basically what this means is that you can tell PubMed2XL which PubMed file to process and what to call the resultant spreadsheet while bypassing the program's graphical interface. Now if you're working on just one file, the GUI version is definitely the way to go, but by incorporating command line functionality the program becomes instantly usable for batch-processing multiple files and also becomes a viable tool to incorporate on a server. In other words, it could be used on the back end of a website. For example,  users could just upload their PubMed file to a website while having the XLS file emailed to them or something like that.

Anyway, there's still lots to do and when I've taken care of the issues I mentioned I'll release the source code if anyone's interested – or if Linux or MAC users want to get this up and running on their systems.

Ideally, I'd like this to become a nifty tool reference librarians could use to help their patrons with. Now if something like this is already out there, please let me know. No need to re-invent the wheel.


<?xml version="1.0" encoding="UTF-8" ?>
<config xmlns:xsi="" xsi:noNamespaceSchemaLocation="PubMed2XL-0.8.9.xsd">
		<column xPath="PubmedArticle/MedlineCitation/PMID" type="element" linkPrefix="">PMID</column>
		<column xPath="PubmedArticle/MedlineCitation" type="attribute" attributeName="Owner" linkPrefix="none">Owner</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Year" type="element" linkPrefix="none">Publication Year</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Month" type="element" linkPrefix="none">Publication Month</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/Journal/Title" type="element" linkPrefix="">Journal</column>
		<column xPath="PubmedArticle/MedlineCitation/MedlineJournalInfo/NlmUniqueID" type="element" linkPrefix="none">NLM ID</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/ArticleTitle" type="element" linkPrefix="none">Article Title</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/Abstract/AbstractText" type="element" linkPrefix="none">Abstract</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/Language" type="element" linkPrefix="none">Language</column>

Related Content:

Written by nitin

August 15th, 2010 at 8:31 pm

Posted in scripts

Tagged with , , ,

XSLT: a practical usage example with Pubmed records

leave a comment

Update, December 10, 2010: If you are interested in getting PubMed citations into a spreadsheet application (Excel, etc.) please see PubMed2XL. PubMed2XL is free software that can convert PubMed citations into a Microsoft Excel file.

As part of my coursework for the University of Alabama SLIS program, I took a database class last year. Long story short, one of assignments was to create a Microsoft Access dbase based on Medline records.

The records were already provided for us as well as Java-based script to parse the information into a tab-delimited format prior to import into Access.

For extra credit, we were given another script that would parse records from an Ovid database. If we could find access to an Ovid dbase (I couldn't as they were all password protected, understandably), we could run the script, parse the records and bring them into Access for additional credit.

But there was a way to use a free source, Pubmed, and still get the job done.

How? Well, Pubmed allows article information to be exported as XML.

Once in XML, there was no need for a script to parse the information. From there it was simple to bring the information into Access. I found it easier to import it into Excel, clean it up, and then import that Excel data source into Access.

But what if you have OpenOffice?

I'm not aware of a simple way to import XML documents into OpenOffice Calc (their spreadsheet app) or Base (their dbase app).

But by using XSLT, there's a way around this issue.

Here are the steps:

  1. Conduct searches in Pubmed.
  2. Send your articles to the Clipboard.
  3. Set display to "XML".
  4. Send the results to "File" (see image below).
  5. Save the file as "pubmed_results.txt".
  6. Change the file's extension from "txt" to "xml".
  7. Open the document in a text editor.
  8. Above the DTD (i.e. <!DOCTYPE PubmedArticleSet PUBLIC … ">), add the following line:

<?xml-stylesheet type="text/xsl" href="pubmed_xslt.xsl"?>

  1. Re-save the file.
  2. Then, download this file to the same directory as your "pubmed_results.xml" file.
  3. Now click on "pubmed_results.xml" ; your browser should now display select data in an HTML tabular format.
  4. From here, simply copy/paste the tabular data into OpenOffice Calc, clean it up as desired, save it as a ".ods" file, hook it up to OpenOffice Base, and design your queries, etc.

And now you've got a totally Free (minus the cost of a laptop, internet connexion, etc.) desktop dbase of Medline results.

* Note that the XML stylesheet I provided only displays certain info. You can always open the stylesheet in a text editor and set it to display more information, such as Abstract, etc.



Related Content:

Written by nitin

August 15th, 2009 at 1:48 pm

Posted in XML

Tagged with , , , ,