blog.humaneguitarist.org

discoveries in digital audio, music notation, and information encoding

Archive for the ‘Excel’ tag

PubMed to Excel: PubMed2XL version 0.9

3 comments

I've released the first Beta version of PubMed2XL, a Windows application that converts article lists from pubmed.gov into Microsoft Excel files.

If you'd like to use the software you can download it. Yes, it's free.

:P

Here's a little video tutorial on installing and using the software:

PubMed2XL: Basic Installation and Use from nitin arora on Vimeo.

PubMed2XL's documentation is available at: blog.humaneguitarist.org/​projects/pubmed2xl/.

The documentation includes a download link to the program files.

--------------

Related Content:

Written by nitin

September 19th, 2010 at 7:03 pm

Posted in scripts,XML

Tagged with , ,

PubMed to spreadsheet made easy

leave a comment

Update, September 2010: This post refers to an Alpha version of PubMed2XL. You can get the latest version of the software here.

Some time ago - exactly a year ago, actually! – I shared a post on how to use XSLT to turn a PubMed XML file into an HTML table and in turn paste that into Microsoft Excel or OpenOffice Calc.

That's fine and all but that's still too "techy" for the average bear who just wants to get a list of articles into a spreadsheet. So, I've been working on some software called PubMed2XL to make the job super simple.

PubMed2XL's a GUI program written in Python and it uses PyQT:

… a set of Python bindings for Nokia's Qt application framework and runs on all platforms supported by Qt including Windows, MacOS/X and Linux.

Since the program's still in early stages there's no real documentation but if you want to just play around with it and you use Windows you can get it here. If it doesn't work, it's probably because you need a file called MSVCR71.dll which I can't legally distribute but I think you can find it if you are resourceful.

Basically all you need to do is this:

  1. Conduct searches in PubMed.
  2. Send your articles to the Clipboard.
  3. Send the results to "File" as XML.
  4. Save the file as "pubmed_results.txt" which is the default name – of course, you can call the file something else if you want as long as it ends in ".txt" or ".xml".
  5. Click on the file called PubMed2XL.exe and then choose FILE>SELECT PUBMED FILE as below:

     

    PubMed2XL screenshot

  6. Then "open" the file you downloaded from PubMed (pubmed_results.txt).
  7. You should now see an XLS (Microsoft Excel) file in the same folder as pubmed_results.txt.

That should pretty much be it. And by the way the Help currently just points your browser to blog.humaneguitarist.org because, um, there's no help documentation yet.

If you're curious how this all works in the very general sense, I'm using a home-grown XML setup file (see below) that tells PubMed2XL which element or attribute value to extract from the pubmed_results.txt file. Then, the script uses the awesome pyExcelerator module to write the data to an XLS file.

By using this XML file advanced users can change the data as well as the spreadsheet column names that are generated in the resultant XLS file. I'm trying to make this software as open and mutable as possible but casual users won't have to worry about anything since the defaults should eventually work just fine.

Right now, the main work I have left to do is to overcome one glaring weakness. PubMed2XL can currently only retrieve data from non-repeating XML elements. In other words, elements like an author's <LastName> can't be extracted because there may be more than one author. What I'll eventually do is incorporate something in the setup file that tells PubMed2XL which occurrence of a repeating element to get data from: i.e. the last name of the primary author, etc.

If you are bored enough to download the zip file containing the program files, you'll notice the main executable file, PubMed2XL.exe, but also another file called PubMed2XL_CL.exe. Now this is exactly the same application but if you click on it you will see an ugly console window pop up in addition to PubMed2XL. The only reason I've included that file is to demonstrate that PubMed2XL can support command line arguments. In other words if you were to go to the command line and type in $ PubMed2XL_CL -h you would see a message pop up on the command line showing you the options for passing arguments to the software via the command line.

Basically what this means is that you can tell PubMed2XL which PubMed file to process and what to call the resultant spreadsheet while bypassing the program's graphical interface. Now if you're working on just one file, the GUI version is definitely the way to go, but by incorporating command line functionality the program becomes instantly usable for batch-processing multiple files and also becomes a viable tool to incorporate on a server. In other words, it could be used on the back end of a website. For example,  users could just upload their PubMed file to a website while having the XLS file emailed to them or something like that.

Anyway, there's still lots to do and when I've taken care of the issues I mentioned I'll release the source code if anyone's interested – or if Linux or MAC users want to get this up and running on their systems.

Ideally, I'd like this to become a nifty tool reference librarians could use to help their patrons with. Now if something like this is already out there, please let me know. No need to re-invent the wheel.

:P

<?xml version="1.0" encoding="UTF-8" ?>
<config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="PubMed2XL-0.8.9.xsd">
	<spreadsheetHeader>
		<column xPath="PubmedArticle/MedlineCitation/PMID" type="element" linkPrefix="http://www.ncbi.nlm.nih.gov/pubmed/">PMID</column>
		<column xPath="PubmedArticle/MedlineCitation" type="attribute" attributeName="Owner" linkPrefix="none">Owner</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Year" type="element" linkPrefix="none">Publication Year</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Month" type="element" linkPrefix="none">Publication Month</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/Journal/Title" type="element" linkPrefix="http://www.ncbi.nlm.nih.gov/pubmed?term=">Journal</column>
		<column xPath="PubmedArticle/MedlineCitation/MedlineJournalInfo/NlmUniqueID" type="element" linkPrefix="none">NLM ID</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/ArticleTitle" type="element" linkPrefix="none">Article Title</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/Abstract/AbstractText" type="element" linkPrefix="none">Abstract</column>
		<column xPath="PubmedArticle/MedlineCitation/Article/Language" type="element" linkPrefix="none">Language</column>
	</spreadsheetHeader>
</config>
--------------

Related Content:

Written by nitin

August 15th, 2010 at 8:31 pm

Posted in scripts

Tagged with , , ,

Excel & VBA: I cheat to win

leave a comment

A few months ago at work I spearheaded the standardization of our digitization log spreadsheets – simple Excel files that let students track what they digitized, its unique identifier, number of scans per item, date of scan, notes, etc.

The reason for this was twofold:

  1. A standard template would hopefully make the students a little more productive in that they would not have to re-acclimate themselves to a 'new look' whenever they starting working on a different digitization project.
  2. A standard template would allow us to quickly merge all the spreadsheets into a temporary master Excel file and import that into Microsoft Access so we could run queries and make quick monthly reports. We used a Visual Basic for Applications (VBA) script called RDBMerge to make the master file. It's a fantastically well written piece of work.

Of course, it took several days to standardize the numerous existing spreadsheets but we got there. Now, all spreadsheets had the same naming convention (using the collection's unique identifier) and were placed in the same folder – which essentially made it a quasi-database that could be queried with the search functions inherent to the OS. We considered using an actual database, but honestly that causes its own set of problems with training, support, and other stuff – hence the decision to stick with Excel and employ RDBMerge/Access.

Now, over time with standardized spreadsheets one naturally begins to think, "Ah! We should have also added a field for … !"

Well, yes that's a pain. In a true database that's no big deal, but with individual files who wants to go through myriad spreadsheets and manually add a column or a formula, etc?

Not me.

I did it once and it took a few hours even using some minor automation in the form a macro I recorded in Excel.

Once was enough.

But the great – I mean truly great – thing though about recording macros in Microsoft Office is that the resultant VBA code – code that the act of recording a macro generates – is visible. And, thus, editable. Not to mention it's a great way to learn VBA syntax. In other words, if you record yourself doing a task in Excel, you can then actually see/edit the line item code that equates to your actions. Seems like a great tool for teaching some programming skills …

Anyway, I started to ask the lazy question, "How can I iterate a recorded macro/s over all spreadsheets in a given folder?"

After much digging, I found a great template script here. The script basically is a wrapper that can encapsulate the VBA script for one's recorded macros. A little know-how and tweaking is required, but that's better than the alternative: learning from-scratch Visual Basic on the job is something I really just don't have the time for right now. At any rate, I was able to use the template script and automate stuff across all of our digitization log files in no time.

By the way, if you use the script, note the snippet below shows a line break that isn't present in the blogger's post, perhaps due to their blogging software's formatting.

'will start LOOP until all files in folder sPath have been looped through

Set oWbk = Workbooks.Open(sPath & "" & sFil)

p.s. in my experience, batch VBA scripts break on Excel files for which sharing is turned on (simultaneous multi-user editing). Those files need to be identified* and temporarily isolated into another folder or the sharing needs to temporarily be turned off. Also, sheets need to be 'unprotected' for the code to work, but it's quite simple to add a line in your macro that unprotects the sheet prior to performing whatever macro you've created.

* It's a good idea to use some sort of suffix to identify shared files … something like "filename.share.xlsx". In fact, if one does so, a VBA script could be run to turn off sharing for all shared files (*.share.xls*). Unfortunately, trying to turn sharing off for a non-shared Excel file will break the script … which, again, is why it's best to identify shared documents through a filenaming convention.

--------------

Related Content:

Written by nitin

October 24th, 2009 at 2:39 pm

Posted in scripts

Tagged with , , , ,

Switch to our mobile site