blog.humaneguitarist.org

discoveries in digital audio, music notation, and information encoding

PubMed2XL

40 comments

PubMed2XL 1.0 (Beta)


Table of Contents

Introduction

Download

Using PubMed2XL

Changing Settings

FAQ


Introduction

PubMed2XL can convert PubMed.gov citations to a Microsoft Excel (.xls) file.

PubMed2XL is licensed under the MIT software license.

PubMe2XL includes software developed by Roman V. Kiseliov.


Download

PubMed2XL has been tested on 32-bit versions of Windows XP (SP3), Windows 7, and Xubuntu 9.10.

I hope to have a MAC version one day. If someone can help compile a MAC OSX version, please leave me a note. Thanks!

Click here to download the latest Windows self-installer.

Click here to download the latest ZIP file.

Older versions can be accessed here.


Using PubMed2XL

Currently, the only usage documentation is the video tutorial available below. Please view the video prior to reading the rest of this documentation.

PubMed2XL: Basic Installation and Use from nitin arora on Vimeo.

Advanced users may wish to use command line arguments by using the PubMed2XL-1.0-CL.exe file (Windows) or the Python source file.

To see the arguments that can be passed from the command line do:

$ PubMed2XL-1.0-CL.exe --help

or:

$ python PubMed2XL-1.0.py --help


Changing Settings

Pubmed2XL uses "stylesheets" to determine the output format of the spreadsheet that it creates. In other words, things like column order, column title, and the PubMed data placed in a given column's cell is controlled by a PubMed2XL stylesheet.

Users can specify a different stylesheet by clicking on Tools>Select Stylesheet (see below) prior to using the File>Convert PubMed File command.

stylesheet

Those using the command line options can use the --style option to specify a particular stylesheet.

Users familiar with XML and the specific XML outputted by PubMed can change the format of the outputted Excel file that PubMed2XL creates by creating their own stylesheet. Using an external stylesheet in a well-known markup language allows researchers and librarians, etc. to customize PubMed2XL's output and to easily share stylesheets with each other and their friends and patrons.

More information and instructions for making new stylesheets can be seen in the Pubmed2XL stylesheet XSD located at ./styles/schema/. By studying the schema document and the default stylesheet, advanced users can better customize PubMed2XL to their needs.

As of version 1.0, PubMed2XL uses XSLT 1.0 as the only method of retrieving data from PubMed.gov XML files. This is in contrast to 9.x versions that used a home-grown XML processing language. As such, PubMed2XL 1.0 is not backwards compatible with version 9.x stylesheets. The move to XSLT was to accommodate requests from users for more capabilities.

If you need assistance customizing a stylesheet, please leave a comment below so that we can all work together. Thanks!

Here is an example stylesheet that will make a spreadsheet with on column called "PMID" in which the PubMed article ID value will be placed. The data in the cell is hyperlinked to the article's page on PubMed.gov using the optional <hyperlink> element.

<?xml version="1.0" encoding="UTF-8" ?>
<spreadsheet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="schema/PubMed2XL-1.0.xsd">
  <column>
    <title>PMID</title>
    <cell>{{?xml version="1.0" encoding="UTF-8"?}}
      {{xsl:stylesheet version="1.0" encoding="UTF-8" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"}}
      {{xsl:output method = "text" /}}
      {{xsl:template match="/"}}
        {{xsl:value-of select="//PMID" /}}
      {{/xsl:template}}
      {{/xsl:stylesheet}}
    </cell>
    <hyperlink>{{?xml version="1.0" encoding="UTF-8"?}}
      {{xsl:stylesheet version="1.0" encoding="UTF-8" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"}}
      {{xsl:output method = "text" /}}
      {{xsl:template match="/"}}
        {{xsl:text}}http://www.ncbi.nlm.nih.gov/pubmed/{{/xsl:text}}
        {{xsl:value-of select="//PMID" /}}
      {{/xsl:template}}
      {{/xsl:stylesheet}}
    </hyperlink>
  </column>
</spreadsheet>

FAQ

  1. How many citations can PubMed2XL process?

    • PubMed2XL is not recommended for processing an XML file of PubMed citations for more than a few thousand citations (less than 5k).

      • I've tested PubMed2XL for ~5,000 records. The XML file downloaded from PubMed.gov was roughly 50 megabytes. It took PubMed2XL less than 1 minute to create an 8 megabyte Excel (.xls) file.

      • I also tested nearly 25k records using a development version of PubMed2XL from the command line. That took approximately 1.5 hours to process and it took OpenOffice almost 10 minutes just to open the spreadsheet.

      • Both tests were on my Lenovo T510 (431328U) and used the default Pubmed2XL stylesheet which is verbose as it retrieves a lot of data from the PubMed citations including abstracts.

  2. I don't have Excel. Can I still use PubMed2XL?

    • PubMed2XL creates an Excel (.xls) file that can also be opened in OpenOffice and other competing applications.

  3. I don't want to create an Excel (.xls) file. Can PubMed2XL create ODS (.ods) files?

    • Currently, this can only be done via the command line options using the --out option.

    • If you use the GUI ("graphical user interface") version of PubMed2XL, you can try just changing the extension of the outputted file from .xls to .ods.

  4. Does the software process book-based citations if found within the XML?

    • The current version of the software does not; future versions likely will have this as an option.

--------------

Related Content:

Written by nitin

August 8th, 2010 at 9:12 am

Posted in

Tagged with

40 Responses to 'PubMed2XL'

Subscribe to comments with RSS

  1. Nice work – this is a very handy tool!

    clsnyder

    6 Dec 10 at 8:21 pm

  2. Thanks for leaving a comment! – and for checking out the software.

    nitin

    16 Jan 11 at 4:30 pm

  3. Hi Nitin,
    I tried to download and install your software in VMware Fusion, which is a Mac Virtual MS OS.  The program appears to install ok but will not read the PubMED txt (XML) file.  Do you need Excel installed for your program to work?
    Cheers,
    Jim

    Jim

    16 Feb 11 at 8:34 pm

  4. Hey Jim,
    Excel isn’t required – I have OpenOffice. Can you send me a copy of the XML file you are trying to use?
    You can email it to:
    nitaro74 AT gmail DOT com

    Thanks and sorry for the trouble.

    nitin

    16 Feb 11 at 10:21 pm

  5. Dear Nitin,
    This is AWESOME!!!

    Cardiovascular researchers

    15 May 11 at 2:47 pm

  6. This is awesome and so much faster than the workarounds I was thinking of to extract out the abstracts. I got my CSV file…which is great and then was trying to think of a way to get the abstracts in there for reference. This saved me a lot of time and was very easy to use. The default output is great and it is pretty easy to customize if you would like an alternative output.  Thanks.

    Alison

    28 Jul 11 at 3:39 pm

  7. Thanks Alison. I’m glad it’s useful.

    nitin

    28 Jul 11 at 5:32 pm

  8. Nitin,
    This was just what I needed to round out a set of Excel macros. I could not work out how extract XML info in VBA, so this does the trick wonderfully.
    2Qs though
    1)I have Excel 2010 in WIn7 x64. When I open your Excel output it says “File error: data may have been lost”. Everything seems OK but I don’t know what is happening. Any ideas?
    2) Is there a way to alter the file output name to a generic like “pubmed result”? I would like my macro to identify and open the data directly, but with the variable file name it makes life tricky.

    Thanks again,
    Aidan

    Aidan

    3 Aug 11 at 11:52 am

  9. Hey Aidan,

    I’ve seen that same error on Office 2010, but I can’t see any evidence of a real problem. I don’t know much about the library I’m using to make an Excel 2007 file, but that might have something to do with it.

    As far as the outputted filename, the only way (other than to rename the file after the fact) is to use the command line version:

    $ PubMed2XL-1.0-CL --in="pubmed_result.xml" --out="pubmed_result.xls"

    The example above assumes you downloaded the PubMed XML file to the same folder as the PubMed2XL application and that you want to save the spreadsheet to the same directory.

    nitin

    3 Aug 11 at 4:28 pm

  10. Thanks Nitin.

    Aidan

    3 Aug 11 at 4:42 pm

  11. Hi Nitin,

    I know nothing about XML coding, but have spent several hours trying to figure out how to add to your default stylesheet the additional information for "Affiliation".  Help please!

    Cheers,
    Robert

    Robert

    11 Sep 11 at 8:30 pm

  12. Hey Robert,

    There appears to only be one Affiliation listing per article.

    You can try saving this file: http://blog.humaneguitarist.org/uploads/PubMed2XL/moreStyles/version_1.x/robert_091111.xml and placing it inside PubMed2XL’s “styles” folder.

    Make sure it’s saved as an XML file such as “robert_091111.xml”.

    Then:

    - launch PubMed2XL
    - go to TOOLS>SELECT STYLESHEET
    - select the XML file you just downloaded
    - then choose FILE>CONVERT PUBMED FILE and process the PubMed XML file from pubmed.gov.

    This is for PubMed2XL version 1.0 or higher only. Let me know if this helps.

    Thanks,
    Nitin

    nitin

    11 Sep 11 at 9:37 pm

  13. Thank you for your work on this program, it is perfect for what I need to do.

    Dave

    19 Sep 11 at 1:29 pm

  14. great tool!….. makes my work much easier :)
    thanks

    Prachi

    12 Oct 11 at 4:57 am

  15. Thnx, so much for this great application.
    Been using an old script from your blog for my last project – this solution is perfect! Prehaps a GUI for selecting what columns to put in the output file is in v2.0?
    In the most pubmed listings a "publication type" tag has been added but it often contains several lines – any ideas how to handle this so you can sort in "reviews" "case reports" "RCT's" etc.
     
    A danish user
    Theis

    Theis Itenov

    14 Oct 11 at 5:00 pm

  16. Hey Theis,

    Thanks for the feedback.
    I like the GUI builder idea! I’ve added it to my to-do list.

    As far as Publication Type, I *think* I know what you mean.
    You can try saving this file: http://blog.humaneguitarist.org/uploads/PubMed2XL/moreStyles/version_1.x/theis_101411.xml and placing it inside PubMed2XL’s “styles” folder.

    Make sure it’s saved as an XML file such as “theis_101411.xml”.

    Then:
    - launch PubMed2XL
    - go to TOOLS>SELECT STYLESHEET
    - select the XML file you just downloaded
    - then choose FILE>CONVERT PUBMED FILE and process the PubMed XML file from pubmed.gov.

    There should now be in Column O a “Publication Type(s)” field containing semicolon separate values for each publication type a la:
    “Journal Article; Research Support, Non-U.S. Gov’t”.

    Then you can filter the data with the spreadsheet software’s “Filter” feature by things such as:
    Contains “Journal Article”
    AND Does not Contain “Research Support”, etc.

    I don’t know if that helps. If not, let me know and maybe we can figure something out.

    nitin

    14 Oct 11 at 9:56 pm

  17. GOD BLESS YOU, sir! This saved me bazillions of hours!!

    Aaron

    17 Oct 11 at 1:56 pm

  18. May you help me create a stylesheet that exports only the PMID, Title, First Author's Name, Pub Date, State/Country of 1st author, Journal Name, Abstract? Until then, I will just download the default and delete unneeded columns. MUCH THANKS!

    Aaron

    17 Oct 11 at 2:04 pm

  19. Hey Aaron,

    Thanks for the feedback.

    You’re lucky I’m home sick today, so I took a stab at the stylesheet.
    If it isn’t what you want, I would need more detail from you so you’d need to email me off-list.

    Re: first author’s location, I used the Affiliation field which may or may not have the information.

    You can go here:
    http://blog.humaneguitarist.org/uploads/PubMed2XL/moreStyles/version_1.x/
    And download the file with your name on it. Make sure to save it with an XML extension (“yourFilename.xml”) inside PubMed2XL’s “styles” folder.

    Then:
    - launch PubMed2XL
    - go to TOOLS>SELECT STYLESHEET
    - select the XML file you just downloaded
    - then choose FILE>CONVERT PUBMED FILE and process the PubMed XML file from pubmed.gov.

    thanks,

    nitin

    17 Oct 11 at 4:14 pm

  20. Would you be able to put together style sheet that included all the basics but with PUBLICATION TYPE and AUTHOR Location.  That would be fabulous.
    Many thanks
    Steve

    Steve

    24 Oct 11 at 1:28 pm

  21. Hey Steve,
    If you just need to remove Publication Type and Author from the default output, it’s pretty easy to delete these in Excel, etc. You could even record a macro in your spreadsheet program to do that with one click.

    nitin

    24 Oct 11 at 2:05 pm

  22. Nitin, 
    Sorry if I was not clear.   I was after an output style that included BOTH, PUBLICATION TYPE and AUTHOR LOCATION.
    so a combination of the two styles you have added above.
    Many thanks
    Steve
    p.s. excellent piece of work by the way

    Steve

    26 Oct 11 at 1:14 pm

  23. Thanks, the other program worked great! Sorry for delayed response.
    I was hoping that there would be a way to pull out the country of the first author's affiliation, rather than the name of institution, etc. But that may be more difficult since the country appears in different places in that string of text.
    Much thanks!

    Aaron

    27 Oct 11 at 4:22 pm

  24. Hey Steve,

    My bad.
    Here’s the style sheet (check columns “O” and “S”):
    http://blog.humaneguitarist.org/uploads/PubMed2XL/moreStyles/version_1.x/steve_102711.xml

    Let me know if it works.
    Thanks,

    nitin

    27 Oct 11 at 8:37 pm

  25. Nitin
    Thanks, you have solved a problem I have been strugling with for some time, i have used work arounds through web of sceince which can export to excel however i do prefer Pubmed.
    Is there a style which could let me export all the fields to excel? I can later decide if i use them in my database.
    Thanks

    Oren

    29 Dec 11 at 9:34 am

  26. Hi Oren,

    I don’t have a style which would export everything. One of the problems is that I don’t know all the elements, though one could perhaps generated a generic “catch-all” style by parsing the document type definition for PubMed – something like this: http://www.ncbi.nlm.nih.gov/entrez/query/static/PubMed.dtd. But this would still be sub-optimal given that there are still questions related to how to display multi-valued fields like all the authors, MESH terms, etc. within one cell.

    If you are putting it into a database, you might be able to directly import the XML file into Excel (if you have Excel). I think one could use the import “as a read-only workbook option” – or something. I think you’ll have duplicates that way (i.e. row per author per PMID), but maybe that could be cleaned up with a pivot table or database query. Actually, that might be more desirable than only having one row per article since the latter is less granular.

    nitin

    1 Jan 12 at 8:43 am

  27. Hi Nitin,
    You helped me with a new stylesheet that pulls the first author's affiliation. I used this a few months ago successfully, but now it doesn't seem to be working. I have selected to use that custom style sheet, and restarted the program and my computer, but there is still no affiliation column. Any thoughts? Much thanks!

    Aaron

    26 Feb 12 at 5:50 pm

  28. Hey Aaron,

    I can check it out. I’ll need you to send me the following at my email (nitaro74 AT gmail DOT com):
    - version of PubMed2XL you are using
    - a copy of the XML file you got from PubMed (i.e. email as attachment)
    - a copy of the stylesheet (as email attachement)
    - a copy, if you have it, of the outputted Excel file (as attachment).

    thanks,

    nitin

    26 Feb 12 at 5:53 pm

  29. Ok, I sent you an email. Thanks!

    Aaron

    3 Mar 12 at 7:47 pm

  30. A stylesheet that outputs everything would be ideal. For my situation, and I am guessing many others, the best way to handle multi-value fields (like authors, MeSH, etc) would be to put all the values in one cell with a delimiter like a double verticle pipe (||).

    David

    8 Aug 12 at 1:39 pm

  31. Hey David,

    Thanks for writing. This really isn’t possible with the software as it wasn’t designed to that. Getting everything really would be a job for a totally new script. But there would be semantic issues in regard to naming the column fields per the data that got extracted – which is likely to vary from article to article within the XML. At some point this kind of thing also raises questions about trying to totally flatten out nested data in the first place and would likely point to the need to ingest the XML in a native XML database.

    If you have Excel you could try importing the XML directly though I imagine that would require a lot of cleanup.

    nitin

    8 Aug 12 at 10:34 pm

  32. Thank you very much for this software.
    This will save me many hours of work !!!
    It is perfect for what I need to do (bibliometric analysis) using pubmed data
    Christophe

    boudry

    1 Feb 13 at 9:55 am

  33. Thanks Christophe.

    Keep an eye out for a new version (I’m aiming for early March).

    The new version will still have a GUI interface, but will also be a Python module and a command line tool – the latter two being helpful if you really want to do some serious automation and integrate PubMed2XL into other scripts.

    thanks,
    Nitin

    nitin

    1 Feb 13 at 2:58 pm

  34. Many thanks for your tool, it look very useful. However, even though  I have followed step by step the instructions, I cannot generate the Excel file. To be honest, I don't know what is going on. I have Excel 2010 and Windows 7.

    I'll really appreciate your kind help.

    Oscar

    12 Feb 13 at 4:07 pm

  35. Oscar and I figured out that the most likely reason this wasn’t working was an accent mark and/or whitespace in his Windows user name. When he moved PubMed2XL to C:\PubMed2XL it worked (i.e. bypassing the path with his user name).

    Something for me to keep in mind for version 2.

    nitin

    18 Feb 13 at 10:50 am

  36. What a fabulous tool, thank you so much! I can't believe I have been doing so many lengthy procedures for so long. 
    I have a question if you are able to help please. I would like to have a column in which the URL for the PMID is displayed as a hyperlink that actually shows the text instead of just the PMID number (if that makes sense). I know  I could do this manually in Excel, I am just unsure of how to modify the xml style to do this automatically. 
    Any help would be greatly appreciated.
    Thank you

    Andy

    6 Mar 13 at 12:26 pm

  37. Hey Andy,

    If I understand correctly, you want a column with a link to the article that shows the actual full link in the column?

    If so, try downloading this style:
    http://blog.humaneguitarist.org/uploads/PubMed2XL/moreStyles/version_1.x/andy_030613.xml
    and storing it in the “/styles” folder.

    When you run PubMed2XL, go to “Tools>Select Stylesheet” and select “andy_030613.xml” before selecting “File>Convert PubMed File”.

    nitin

    6 Mar 13 at 2:03 pm

  38. Hi Nitin,
    That is perfect. Thank you so much :-) I am so impressed with this tool and it saves so much time. What a great job!
    Thanks,
    Andy

    Andy

    7 Mar 13 at 4:00 am

  39. Hi, Nitin, it seems fantastic program i've ever seen !  however it doen's work for me T.T
    i am using excel 2007, it doesn't work ..
    if I click "convert pubmed file" and select "PubMedResults.xml", PubMed2XL stops.
    it does not show me any pop-up, and not create any xls files.
    what should i do ?

    Doreen

    28 Apr 13 at 4:07 am

  40. Hi Doreen,

    I’ll need you to send me the following at my email (nitaro74 AT gmail DOT com):
    - version of PubMed2XL you are using,
    - a copy of the XML file you got from PubMed (i.e. email as attachment).

    You can get the version number by using the Help>About menu option.

    thanks,
    Nitin

    nitin

    28 Apr 13 at 11:01 am

Leave a Reply

*

Switch to our mobile site