blog.humaneguitarist.org

discoveries in digital audio, music notation, and information encoding

Archive for the ‘PubMed2XL’ tag

the serpent, the apple, and Joe

leave a comment

For better or worse, the one application of mine that people actually use is the one I wrote pretty casually with Python over a couple weekends from bed because I was too lazy or hungover to get moving on those days.

That software, PubMed2XL, lets people do a few things with downloaded citations from PubMed.gov that isn't currently offered directly from the site. I've gotten some nice feedback from librarians, researchers, and information-y people at companies that have found it useful.

This post isn't a plug though; it's more an acknowledgement of something that I didn't really realize in full at the time. And that is when one writes software that people go on to actually use, one better be prepared to support it. Now, the software's simple enough that there haven't been real bugs save one, but it does eat at me that I can't offer a simple way for it to work on multiple platforms.

While the Windows version is really easy to setup – thanks to py2exe and Inno Setup – getting it running on Linux is a bit more work, given all the distro variations and dependency installation. But getting it running on a Mac – particularly with an easy to use installer – isn't going to be possible unless I can find someone to compile it for a Mac who will also test it and compile future versions. Sure, there's the possibility of using Wine, but that's still asking a lot from end users.

Normally, I wouldn't care. Apple doesn't make it easy for people to develop for Macs unless you fork over the change for a Mac – and I ain't buying a copy of OSX and doing the Hackintosh bit. But, since the software is ultimately about health-related research, I do care.

Unfortunately I made – with the advantage of hindsight – two coding decisions that create problems.

First, I chose PyQT as the GUI toolkit for the software simply because it looks prettier than Python's native Tkinter. My reasoning at the time was the people were more likely to trust better looking software even though it's just a small window with some basic menu options. Eventually, I added a progress bar, too, so downgrading to Tkinter has become less of an option.

Second (and this is the big one), I used lxml since the PubMed2XL setup files employ XSL to tell the software what data to put in a spreadsheet cell. Granted, lxml is freakin' fantastic, but since it's not a pure Python module I can't just distribute it in a folder and import the module locally. Not that I had much of a choice: there's no built in XSLT-capable module that ships with Python 'far as I know.

So I've been asking myself how to make the serpent (Python) and the apple (OSX) get along.

I've consider just making PubMed2XL a web-app, but that will entail expenses for me that simply offering people a desktop app doesn't entail.

So, I think the solution lies in a cup of Joe. That's to say that a Java app is the obvious solution, specifically using Jython.

That would leave me to replace PyQT with Swing. I'm fine with that. It's not like PyQT is all that Pythonic in the first place. There's a nice Jython/Swing tutorial here.

And as for the XSLT component, this tutorial on XSLT with Jython and native Java libraries should help immensely.

So, I should be able to use Jython to make a cross-platform version of PubMed2XL. I don't necessarily want to, but given the type of research I'd like to help facilitate (in a very small way, I know), I think I probably should.

--------------

Related Content:

Written by nitin

January 7th, 2012 at 10:22 am

PubMed2XL 1.0 available

2 comments

I've uploaded a new version of PubMed2XL, a Windows application that converts article lists from PubMed.gov into Microsoft Excel files.

Unlike downloading the CSV directly from PubMed.gov, PubMed2XL gives users (OK … advanced users) the ability to customize the output but even the default format includes Abstract, links to each article, and even links to related articles, and reviews.

Here's an example of a spreadsheet made with PubMed2XL and here's the source file used to make it. The source file was downloaded from PubMed.gov using a search for "Mexican flu".

If you'd like to use the software you can download it for free.

If you notice any bugs or have any questions or remarks, please feel free to leave a comment on the site. Thanks!

--------------

Related Content:

Written by nitin

June 18th, 2011 at 2:28 pm

Posted in news,scripts

Tagged with , ,

PubMed2XL 0.9.1 available

5 comments

I've uploaded a new version of PubMed2XL, a Windows application that converts article lists from pubmed.gov into Microsoft Excel files.

If you'd like to use the software you can download it for free.

For those who are interested, here's the changelog:

0.9.1
- worked with Björn Carlsson on a few things:
    - added length checker for <getElement> so that abstracts greater than 32k characters would get truncated to the first 30k characters.
        - see: http://blog.humaneguitarist.org/2011/03/16/dealing-with-a-pubmed2xl-bug/
    - added <getAttributeByElementPosition> element.
        - Updated schema.
- removed code that displayed the "aboutMessage" variable on the command line if command line options are used.
    - This is because the diacritic in Mr. Carlsson's name caused encoding errors with the default Windows command prompt.
- added <hyperlinkSuffix> element so that alternate views of PubMed data could be passed via the URL.
    - updated schema.
    - For example, see this: http://www.ncbi.nlm.nih.gov/pubmed/21069543 then this: http://www.ncbi.nlm.nih.gov/pubmed/21069543?report=medline
        - The hyperlink suffix of ?report=medline changes the display!
        - For more information, see:
            - PubMed Help — PubMed Help — NCBI Bookshelf. Retrieved November 13, 2010, from http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helppubmed&part=pubmedhelp&rendertype=table&id=pubmedhelp.T40
            - pm_workbook.pdf. Retrieved November 13, 2010, from http://www.nlm.nih.gov/pubs/manuals/pm_workbook.pdf (see page 135).
- updated py2exe "setup.py" to automatically name the command line/console version correctly (i.e. with the "-CL" suffix).
- removed "src" folder and placed Python files in same folder as .exe's.
_______________________________________________________________________
0.9.0
- this was the first version - that worked!
--------------

Related Content:

Written by nitin

April 3rd, 2011 at 5:40 pm

Posted in news,scripts

Tagged with

dealing with a PubMed2XL bug

leave a comment

Björn from Sweden has been using PubMed2XL and has suggested some additional features that we are working on. More on that some other time …

But he also found a bug, or rather an oversight on my part. That needs to be dealt with first.

I didn't realize that some data in the PubMed.gov XML elements are insanely long. We encountered an abstract in one article nearly 50,000 characters long. That wasn't breaking PubMed2XL but the resultant spreadsheet had all kinds of problems – values in the wrong column, wrong cell, etc. I guess this is because – as I now know – Excel/OpenOffice don't let cells carry more than about 32k characters. I don't know if this is true of newer versions of MS Excel, but whatever. 32k is enough!

So in a test version of the application, I added a length checking and stoppage feature. This restricts the length of the data placed into a cell to 30,000 characters if the data to be placed is greater than 32,000 characters.

Eventually, I'll make it  so that if the data is greater than 32k characters, the cell will contain colored text so the user can know that "Hey, this data is incomplete because it's so darn long!".

Anyway, as a note to myself, here's a code snippet that seems to be a quick patch. I'll upload the fixed version in a week or so. I'm moving and all, so my schedule's a bit wonky.

cell = getElement.text
if len(cell) > 32000:
	cell = cell[0:30000]
writeExcel.write (rowIter, columnIter, cell)
--------------

Related Content:

Written by nitin

March 16th, 2011 at 6:26 pm

Posted in scripts

Tagged with , , ,

and yet more PubMed to Excel news

leave a comment

I've updated the documentation for PubMed2XL, a Windows application that converts article lists from pubmed.gov into Microsoft Excel files. The documentation isn't incredibly thorough, but I think it's enough to work for now.

Speaking of getting PubMed search results into a spreadsheet check this out:

Those who search PubMed regularly have often wished for a way to import search results into a a program such as Excel. It’s here! A new tool called FLink (Frequency-weighted Links) is now accessible from the NIH National Center for Biotechnology Information (NCBI): http://www.ncbi.nlm.nih.gov/Structure/flink/docs/flink_about.html. FLink allows PubMed search results to be saved as a CSV, or comma-separated value, file which can be imported into a program like Excel.

source: Dragonfly » Blog Archive » FLink: A New Way to Save PubMed Search Results. Retrieved November 13, 2010, from http://nnlm.gov/pnr/dragonfly/2010/11/10/flink-a-new-way-to-save-pubmed-search-results/

For instructions, just click here.

Unfortunately, those instructions don't instruct the user to to import the CSV file with UTF-8 encoding, etc. Not using the correct character encoding upon import could cause characters like accents and umlauts that might appear in author names, for example, to appear as strange, nonsensical characters.

Also, the output format is fixed – i.e. I don't think the user has any control of what data gets exported to the CSV file. Some data is concatenated together in one spreadsheet cell and that can be a problem for those who need to parse the data at a more granular level. It's more difficult to split data and re-sort it than it is to concatenate data that is already parsed in a granular fashion.

On the contrary, the PubMed2XL output can be customized – although it requires some skill with XML. Also, it places in each cell only one value and lastly I've never experienced any character encoding issues in the tests I've done.

Sure, I'm trying to compare the two approaches – just a touch, but in the end the best way will be for the users to have an easy interface offered directly from PubMed.gov and its related sites. I'm just saying that I hope they soon offer more options and a more user-friendly method for the sake of the user.

--------------

Related Content:

Written by nitin

November 13th, 2010 at 1:12 pm

Posted in news,scripts

Tagged with , ,

PubMed to Excel: PubMed2XL version 0.9

3 comments

I've released the first Beta version of PubMed2XL, a Windows application that converts article lists from pubmed.gov into Microsoft Excel files.

If you'd like to use the software you can download it. Yes, it's free.

:P

Here's a little video tutorial on installing and using the software:

PubMed2XL: Basic Installation and Use from nitin arora on Vimeo.

PubMed2XL's documentation is available at: blog.humaneguitarist.org/​projects/pubmed2xl/.

The documentation includes a download link to the program files.

--------------

Related Content:

Written by nitin

September 19th, 2010 at 7:03 pm

Posted in scripts,XML

Tagged with , ,

PubMed to spreadsheet made easy

leave a comment

Update, September 2010: This post refers to an Alpha version of PubMed2XL. You can get the latest version of the software here.

Some time ago - exactly a year ago, actually! – I shared a post on how to use XSLT to turn a PubMed XML file into an HTML table and in turn paste that into Microsoft Excel or OpenOffice Calc.

That's fine and all but that's still too "techy" for the average bear who just wants to get a list of articles into a spreadsheet. So, I've been working on some software called PubMed2XL to make the job super simple.

PubMed2XL's a GUI program written in Python and it uses PyQT:

… a set of Python bindings for Nokia's Qt application framework and runs on all platforms supported by Qt including Windows, MacOS/X and Linux.

Since the program's still in early stages there's no real documentation but if you want to just play around with it and you use Windows you can get it here. If it doesn't work, it's probably because you need a file called MSVCR71.dll which I can't legally distribute but I think you can find it if you are resourceful.

Basically all you need to do is this:

  1. Conduct searches in PubMed.
  2. Send your articles to the Clipboard.
  3. Send the results to "File" as XML.
  4. Save the file as "pubmed_results.txt" which is the default name – of course, you can call the file something else if you want as long as it ends in ".txt" or ".xml".
  5. Click on the file called PubMed2XL.exe and then choose FILE>SELECT PUBMED FILE as below:

     

    PubMed2XL screenshot

  6. Then "open" the file you downloaded from PubMed (pubmed_results.txt).
  7. You should now see an XLS (Microsoft Excel) file in the same folder as pubmed_results.txt.

That should pretty much be it. And by the way the Help currently just points your browser to blog.humaneguitarist.org because, um, there's no help documentation yet.

If you're curious how this all works in the very general sense, I'm using a home-grown XML setup file (see below) that tells PubMed2XL which element or attribute value to extract from the pubmed_results.txt file. Then, the script uses the awesome pyExcelerator module to write the data to an XLS file.

By using this XML file advanced users can change the data as well as the spreadsheet column names that are generated in the resultant XLS file. I'm trying to make this software as open and mutable as possible but casual users won't have to worry about anything since the defaults should eventually work just fine.

Right now, the main work I have left to do is to overcome one glaring weakness. PubMed2XL can currently only retrieve data from non-repeating XML elements. In other words, elements like an author's <LastName> can't be extracted because there may be more than one author. What I'll eventually do is incorporate something in the setup file that tells PubMed2XL which occurrence of a repeating element to get data from: i.e. the last name of the primary author, etc.

If you are bored enough to download the zip file containing the program files, you'll notice the main executable file, PubMed2XL.exe, but also another file called PubMed2XL_CL.exe. Now this is exactly the same application but if you click on it you will see an ugly console window pop up in addition to PubMed2XL. The only reason I've included that file is to demonstrate that PubMed2XL can support command line arguments. In other words if you were to go to the command line and type in $ PubMed2XL_CL -h you would see a message pop up on the command line showing you the options for passing arguments to the software via the command line.

Basically what this means is that you can tell PubMed2XL which PubMed file to process and what to call the resultant spreadsheet while bypassing the program's graphical interface. Now if you're working on just one file, the GUI version is definitely the way to go, but by incorporating command line functionality the program becomes instantly usable for batch-processing multiple files and also becomes a viable tool to incorporate on a server. In other words, it could be used on the back end of a website. For example,  users could just upload their PubMed file to a website while having the XLS file emailed to them or something like that.

Anyway, there's still lots to do and when I've taken care of the issues I mentioned I'll release the source code if anyone's interested – or if Linux or MAC users want to get this up and running on their systems.

Ideally, I'd like this to become a nifty tool reference librarians could use to help their patrons with. Now if something like this is already out there, please let me know. No need to re-invent the wheel.

:P

<?xml version="1.0" encoding="UTF-8" ?>
<config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="PubMed2XL-0.8.9.xsd">
	<spreadsheetHeader>
		<column xPath="PubmedArticle/MedlineCitation/PMID" type="element" linkPrefix="http://www.ncbi.nlm.nih.gov/pubmed/">PMID</column>
		<column xPath="PubmedArticle/MedlineCitation" type="attribute" attributeName="Owner" linkPrefix="none">Owner</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Year" type="element" linkPrefix="none">Publication Year</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Month" type="element" linkPrefix="none">Publication Month</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/Journal/Title" type="element" linkPrefix="http://www.ncbi.nlm.nih.gov/pubmed?term=">Journal</column>
		<column xPath="PubmedArticle/MedlineCitation/MedlineJournalInfo/NlmUniqueID" type="element" linkPrefix="none">NLM ID</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/ArticleTitle" type="element" linkPrefix="none">Article Title</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/Abstract/AbstractText" type="element" linkPrefix="none">Abstract</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/Language" type="element" linkPrefix="none">Language</column>
	</spreadsheetHeader>
</config>
--------------

Related Content:

Written by nitin

August 15th, 2010 at 8:31 pm

Posted in scripts

Tagged with , , ,

Switch to our mobile site