blog.humaneguitarist.org

discoveries in digital audio, music notation, and information encoding

Archive for the ‘OpenOffice’ tag

PubMed to spreadsheet made easy

without comments

I really shouldn't be sharing software for which I still have a ton of work to do. But if it works OK, I figure there's no real harm especially since I had my library friends in mind while doing this.

Having said that: anything I offer up is to be downloaded and run at your own risk.

Some time ago - exactly a year ago, actually! – I shared a post on how to use XSLT to turn a PubMed XML file into an HTML table and in turn paste that into Microsoft Excel or OpenOffice Calc.

That's fine and all but that's still too "techy" for the average bear who just wants to get a list of articles into a spreadsheet. So, I've been working on some software called PubMed2XL to make the job super simple.

PubMed2XL's a GUI program written in Python and it uses PyQT:

… a set of Python bindings for Nokia's Qt application framework and runs on all platforms supported by Qt including Windows, MacOS/X and Linux.

Since the program's still in early stages there's no real documentation but if you want to just play around with it and you use Windows you can get it here. If it doesn't work, it's probably because you need a file called MSVCR71.dll which I can't legally distribute but I think you can find it if you are resourceful.

;)

Basically all you need to do is this:

  1. Conduct searches in PubMed.
  2. Send your articles to the Clipboard.
  3. Send the results to "File" as XML.
  4. Save the file as "pubmed_results.txt" which is the default name – of course, you can call the file something else if you want as long as it ends in ".txt" or ".xml".
  5. Click on the file called PubMed2XL.exe and then choose FILE>SELECT PUBMED FILE as below:

     

    PubMed2XL screenshot

  6. Then "open" the file you downloaded from PubMed (pubmed_results.txt).
  7. You should now see an XLS (Microsoft Excel) file in the same folder as pubmed_results.txt.

That should pretty much be it. And by the way the Help currently just points your browser to blog.humaneguitarist.org because, um, there's no help documentation yet.

If you're curious how this all works in the very general sense, I'm using a home-grown XML setup file (see below) that tells PubMed2XL which element or attribute value to extract from the pubmed_results.txt file. Then, the script uses the awesome pyExcelerator module to write the data to an XLS file.

By using this XML file advanced users can change the data as well as the spreadsheet column names that are generated in the resultant XLS file. I'm trying to make this software as open and mutable as possible but casual users won't have to worry about anything since the defaults should eventually work just fine.

Right now, the main work I have left to do is to overcome one glaring weakness. PubMed2XL can currently only retrieve data from non-repeating XML elements. In other words, elements like an author's <LastName> can't be extracted because there may be more than one author. What I'll eventually do is incorporate something in the setup file that tells PubMed2XL which occurrence of a repeating element to get data from: i.e. the last name of the primary author, etc.

If you are bored enough to download the zip file containing the program files, you'll notice the main executable file, PubMed2XL.exe, but also another file called PubMed2XL_CL.exe. Now this is exactly the same application but if you click on it you will see an ugly console window pop up in addition to PubMed2XL. The only reason I've included that file is to demonstrate that PubMed2XL can support command line arguments. In other words if you were to go to the command line and type in

 $ PubMed2XL_CL -h 

you would see a message pop up on the command line showing you the options for passing arguments to the software via the command line.

Basically what this means is that you can tell PubMed2XL which PubMed file to process and what to call the resultant spreadsheet while bypassing the program's graphical interface. Now if you're working on just one file, the GUI version is definitely the way to go, but by incorporating command line functionality the program becomes instantly usable for batch-processing multiple files and also becomes a viable tool to incorporate on a server. In other words, it could be used on the back end of a website. For example,  users could just upload their PubMed file to a website while having the XLS file emailed to them or something like that.

Anyway, there's still lots to do and when I've taken care of the issues I mentioned I'll release the source code if anyone's interested – or if Linux or MAC users want to get this up and running on their systems.

Ideally, I'd like this to become a nifty tool reference librarians could use to help their patrons with. Now if something like this is already out there, please let me know. No need to re-invent the wheel.

:P

<?xml version="1.0" encoding="UTF-8" ?>
	<config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="PubMed2XL-0.8.9.xsd">
	    <spreadsheetHeader>
	        <column xPath="PubmedArticle/MedlineCitation/PMID" type="element" linkPrefix="http://www.ncbi.nlm.nih.gov/pubmed/">PMID</column>
	        <column xPath="PubmedArticle/MedlineCitation" type="attribute" attributeName="Owner" linkPrefix="none">Owner</column>
	        <column xPath="PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Year" type="element" linkPrefix="none">Publication Year</column>
	        <column xPath="PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Month" type="element" linkPrefix="none">Publication Month</column>
	        <column xPath="PubmedArticle/MedlineCitation/Article/Journal/Title" type="element" linkPrefix="http://www.ncbi.nlm.nih.gov/pubmed?term=">Journal</column>
	        <column xPath="PubmedArticle/MedlineCitation/MedlineJournalInfo/NlmUniqueID" type="element" linkPrefix="none">NLM ID</column>
	        <column xPath="PubmedArticle/MedlineCitation/Article/ArticleTitle" type="element" linkPrefix="none">Article Title</column>
	        <column xPath="PubmedArticle/MedlineCitation/Article/Abstract/AbstractText" type="element" linkPrefix="none">Abstract</column>
	        <column xPath="PubmedArticle/MedlineCitation/Article/Language" type="element" linkPrefix="none">Language</column>
	    </spreadsheetHeader>
	</config>

Written by nitin

August 15th, 2010 at 8:31 pm

Posted in scripts

Tagged with , , ,

OpenOffice to HTML trick

with 2 comments

I’ve recently been working on some documentation for a MusicXML platform I wrote called MXMLiszt – I’ll be realizing the files/source in a few weeks.

In the past I’ve used the W3C’s Amaya editor to write the HTML documentation for AudioRegent but that’s a really laborious process and requires a good bit of coding by hand even as I worked with the WYSIWYG environment.

So this time I just decided to use OpenOffice. Problem is when I did a Save As to HTML, the W3C’s Validator was giving me nearly 300 errors. Even worse is that the export was exporting the .html file and all the images in the document to the same directory. Boo. Ideally, the images should be in a subfolder for compartmentalization purposes.

Here’s what worked better: instead of saving to HTML, I used the File>Preview in Web Browser option in OpenOffice. Since Firefox is my default browser, it opened in Firefox. Then I just Firefox’s Save Page As option using the Web Page, complete type. Firefox saved the file (let’s call it "foo") as foo.htm and created a folder called "foo" that contained all the images. Sweet!

This time: only 13 errors from the WC3s’ p.o.v., all very minor errors – who knows where the other ~275 went?!

One way to cut down on errors is to make sure any image you embed in your OpenOffice .odt document has an alternative (alt) text attribute since that attribute is technically required for all images in HTML docs.

I should mention that it’s better to do this from a Linux box rather than Windows as the former uses the UTF-8 encoding and the latter Windows-1252. That’s no huge deal for non-critical documents, but it’s probably better to go with UTF-8 if you can.

Now I’m not going to waste my time hand-correcting a perfectly "valid" HTML doc for things like this. That’s mistaking the cart for the horse. I’m just an average dude trying to share some info with some folks. I’m not an institution charged with a preservation mindset. These WordPress blog entries aren’t valid either BTW …

I realize the importance of standards and sustainability but the Open Document format (foo.odt) is what I would argue is the thing to save, the HTML version being just a convenient manifestation. Secondly, if all major browsers have no trouble with the document, then from a certain p.o.v. the HTML is valid. In a sense, it’s much ado about nothing.

ps: If you’re wondering about exporting to xHTML, don’t. :-)

… that’s a much bigger pain.

Written by nitin

May 9th, 2010 at 10:52 am

Posted in technophilia

Tagged with , ,

Excel & VBA: I cheat to win

without comments

A few months ago at work I spearheaded the standardization of our digitization log spreadsheets – simple Excel files that let students track what they digitized, its unique identifier, number of scans per item, date of scan, notes, etc.

The reason for this was twofold:

  1. A standard template would hopefully make the students a little more productive in that they would not have to re-acclimate themselves to a ‘new look’ whenever they starting working on a different digitization project.
  2. A standard template would allow us to quickly merge all the spreadsheets into a temporary master Excel file and import that into Microsoft Access so we could run queries and make quick monthly reports. We used a Visual Basic for Applications (VBA) script called RDBMerge to make the master file. It’s a fantastically well written piece of work.

Of course, it took several days to standardize the numerous existing spreadsheets but we got there. Now, all spreadsheets had the same naming convention (using the collection’s unique identifier) and were placed in the same folder – which essentially made it a quasi-database that could be queried with the search functions inherent to the OS. We considered using an actual database, but honestly that causes its own set of problems with training, support, and other stuff – hence the decision to stick with Excel and employ RDBMerge/Access.

Now, over time with standardized spreadsheets one naturally begins to think, "Ah! We should have also added a field for … !"

Well, yes that’s a pain. In a true database that’s no big deal, but with individual files who wants to go through myriad spreadsheets and manually add a column or a formula, etc?

Not me.

I did it once and it took a few hours even using some minor automation in the form a macro I recorded in Excel.

Once was enough.

But the great – I mean truly great – thing though about recording macros in Microsoft Office is that the resultant VBA code – code that the act of recording a macro generates – is visible. And, thus, editable. Not to mention it’s a great way to learn VBA syntax. In other words, if you record yourself doing a task in Excel, you can then actually see/edit the line item code that equates to your actions. Seems like a great tool for teaching some programming skills …

Anyway, I started to ask the lazy question, "How can I iterate a recorded macro/s over all spreadsheets in a given folder?"

After much digging, I found a great template script here. The script basically is a wrapper that can encapsulate the VBA script for one’s recorded macros. A little know-how and tweaking is required, but that’s better than the alternative: learning from-scratch Visual Basic on the job is something I really just don’t have the time for right now. At any rate, I was able to use the template script and automate stuff across all of our digitization log files in no time.

By the way, if you use the script, note the snippet below shows a line break that isn’t present in the blogger’s post, perhaps due to their blogging software’s formatting.

‘will start LOOP until all files in folder sPath have been looped through

Set oWbk = Workbooks.Open(sPath & "" & sFil)

p.s. in my experience, batch VBA scripts break on Excel files for which sharing is turned on (simultaneous multi-user editing). Those files need to be identified* and temporarily isolated into another folder or the sharing needs to temporarily be turned off. Also, sheets need to be ‘unprotected’ for the code to work, but it’s quite simple to add a line in your macro that unprotects the sheet prior to performing whatever macro you’ve created.

* It’s a good idea to use some sort of suffix to identify shared files … something like "filename.share.xlsx". In fact, if one does so, a VBA script could be run to turn off sharing for all shared files (*.share.xls*). Unfortunately, trying to turn sharing off for a non-shared Excel file will break the script … which, again, is why it’s best to identify shared documents through a filenaming convention. ;)

Written by nitin

October 24th, 2009 at 2:39 pm

Posted in scripts

Tagged with , , , ,

XSLT: a practical usage example with Pubmed records

without comments

As part of my coursework for the University of Alabama SLIS program, I took a database class last year. Long story short, one of assignments was to create a Microsoft Access dbase based on Medline records.

The records were already provided for us as well as Java-based script to parse the information into a tab-delimited format prior to import into Access.

For extra credit, we were given another script that would parse records from an Ovid database. If we could find access to an Ovid dbase (I couldn't as they were all password protected, understandably), we could run the script, parse the records and bring them into Access for additional credit.

But there was a way to use a free source, Pubmed, and still get the job done.

How? Well, Pubmed allows article information to be exported as XML.

Once in XML, there was no need for a script to parse the information. From there it was simple to bring the information into Access. I found it easier to import it into Excel, clean it up, and then import that Excel data source into Access.

But what if you have OpenOffice?

I'm not aware of a simple way to import XML documents into OpenOffice Calc (their spreadsheet app) or Base (their dbase app).

But by using XSLT, there's a way around this issue.

Here are the steps:

  1. Conduct searches in Pubmed.
  2. Send your articles to the Clipboard.
  3. Set display to "XML".
  4. Send the results to "File" (see image below).
  5. Save the file as "pubmed_results.txt".
  6. Change the file's extension from "txt" to "xml".
  7. Open the document in a text editor.
  8. Above the DTD (i.e. <!DOCTYPE PubmedArticleSet PUBLIC … ">), add the following line:

<?xml-stylesheet type="text/xsl" href="pubmed_xslt.xsl"?>

  1. Re-save the file.
  2. Then, download this file to the same directory as your "pubmed_results.xml" file.
  3. Now click on "pubmed_results.xml" ; your browser should now display select data in an HTML tabular format.
  4. From here, simply copy/paste the tabular data into OpenOffice Calc, clean it up as desired, save it as a ".ods" file, hook it up to OpenOffice Base, and design your queries, etc.

And now you've got a totally Free (minus the cost of a laptop, internet connexion, etc.) desktop dbase of Medline results.

* Note that the XML stylesheet I provided only displays certain info. You can always open the stylesheet in a text editor and set it to display more information, such as Abstract, etc.


 

Written by nitin

August 15th, 2009 at 1:48 pm

Posted in XML

Tagged with , , , ,