Archive for the ‘scripts’ Category
PubMed to spreadsheet made easy
I really shouldn't be sharing software for which I still have a ton of work to do. But if it works OK, I figure there's no real harm especially since I had my library friends in mind while doing this.
Having said that: anything I offer up is to be downloaded and run at your own risk.
Some time ago - exactly a year ago, actually! – I shared a post on how to use XSLT to turn a PubMed XML file into an HTML table and in turn paste that into Microsoft Excel or OpenOffice Calc.
That's fine and all but that's still too "techy" for the average bear who just wants to get a list of articles into a spreadsheet. So, I've been working on some software called PubMed2XL to make the job super simple.
PubMed2XL's a GUI program written in Python and it uses PyQT:
… a set of Python bindings for Nokia's Qt application framework and runs on all platforms supported by Qt including Windows, MacOS/X and Linux.
Since the program's still in early stages there's no real documentation but if you want to just play around with it and you use Windows you can get it here. If it doesn't work, it's probably because you need a file called MSVCR71.dll which I can't legally distribute but I think you can find it if you are resourceful.

Basically all you need to do is this:
- Conduct searches in PubMed.
- Send your articles to the Clipboard.
- Send the results to "File" as XML.
- Save the file as "pubmed_results.txt" which is the default name – of course, you can call the file something else if you want as long as it ends in ".txt" or ".xml".
- Click on the file called PubMed2XL.exe and then choose FILE>SELECT PUBMED FILE as below:

- Then "open" the file you downloaded from PubMed (pubmed_results.txt).
- You should now see an XLS (Microsoft Excel) file in the same folder as pubmed_results.txt.
That should pretty much be it. And by the way the Help currently just points your browser to blog.humaneguitarist.org because, um, there's no help documentation yet.
If you're curious how this all works in the very general sense, I'm using a home-grown XML setup file (see below) that tells PubMed2XL which element or attribute value to extract from the pubmed_results.txt file. Then, the script uses the awesome pyExcelerator module to write the data to an XLS file.
By using this XML file advanced users can change the data as well as the spreadsheet column names that are generated in the resultant XLS file. I'm trying to make this software as open and mutable as possible but casual users won't have to worry about anything since the defaults should eventually work just fine.
Right now, the main work I have left to do is to overcome one glaring weakness. PubMed2XL can currently only retrieve data from non-repeating XML elements. In other words, elements like an author's <LastName> can't be extracted because there may be more than one author. What I'll eventually do is incorporate something in the setup file that tells PubMed2XL which occurrence of a repeating element to get data from: i.e. the last name of the primary author, etc.
If you are bored enough to download the zip file containing the program files, you'll notice the main executable file, PubMed2XL.exe, but also another file called PubMed2XL_CL.exe. Now this is exactly the same application but if you click on it you will see an ugly console window pop up in addition to PubMed2XL. The only reason I've included that file is to demonstrate that PubMed2XL can support command line arguments. In other words if you were to go to the command line and type in
$ PubMed2XL_CL -h
you would see a message pop up on the command line showing you the options for passing arguments to the software via the command line.
Basically what this means is that you can tell PubMed2XL which PubMed file to process and what to call the resultant spreadsheet while bypassing the program's graphical interface. Now if you're working on just one file, the GUI version is definitely the way to go, but by incorporating command line functionality the program becomes instantly usable for batch-processing multiple files and also becomes a viable tool to incorporate on a server. In other words, it could be used on the back end of a website. For example, users could just upload their PubMed file to a website while having the XLS file emailed to them or something like that.
Anyway, there's still lots to do and when I've taken care of the issues I mentioned I'll release the source code if anyone's interested – or if Linux or MAC users want to get this up and running on their systems.
Ideally, I'd like this to become a nifty tool reference librarians could use to help their patrons with. Now if something like this is already out there, please let me know. No need to re-invent the wheel.

<?xml version="1.0" encoding="UTF-8" ?> <config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="PubMed2XL-0.8.9.xsd"> <spreadsheetHeader> <column xPath="PubmedArticle/MedlineCitation/PMID" type="element" linkPrefix="http://www.ncbi.nlm.nih.gov/pubmed/">PMID</column> <column xPath="PubmedArticle/MedlineCitation" type="attribute" attributeName="Owner" linkPrefix="none">Owner</column> <column xPath="PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Year" type="element" linkPrefix="none">Publication Year</column> <column xPath="PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Month" type="element" linkPrefix="none">Publication Month</column> <column xPath="PubmedArticle/MedlineCitation/Article/Journal/Title" type="element" linkPrefix="http://www.ncbi.nlm.nih.gov/pubmed?term=">Journal</column> <column xPath="PubmedArticle/MedlineCitation/MedlineJournalInfo/NlmUniqueID" type="element" linkPrefix="none">NLM ID</column> <column xPath="PubmedArticle/MedlineCitation/Article/ArticleTitle" type="element" linkPrefix="none">Article Title</column> <column xPath="PubmedArticle/MedlineCitation/Article/Abstract/AbstractText" type="element" linkPrefix="none">Abstract</column> <column xPath="PubmedArticle/MedlineCitation/Article/Language" type="element" linkPrefix="none">Language</column> </spreadsheetHeader> </config>
MXMLiszt release 0.9.0
MXMLiszt version 0.9.0 is now available for download.
MXMLiszt is a web-based delivery and search/retrieval environment for MusicXML files and their manifestations.
MXMLiszt was created in order to complete a Master’s in Library and Information Science at the University of Alabama under the direction of Dr. Steven L. MacCall.
The documentation and source-code download links are available here.
The accompanying research paper, “Beyond Images: Encoding Music for Access and Retrieval” can be accessed here.
As of June, 2010 the live demo of MXMLiszt can be accessed at:
http://opensourcelibrarian.org/MXMLiszt
MXMLiszt is licensed under the BSD software license.
AudioRegent Installation (Xubuntu)
I just got a Vimeo account.
AudioRegent Installation (Xubuntu) from nitin arora on Vimeo.
A quick tutorial on installing and running AudioRegent 1.1 in Xubuntu 9.04.
AudioRegent’s documentation is available at:
http://blog.humaneguitarist.org/projects/audioregent/
The documentation includes a download link to the program files.
LS-598 #4: MXMLiszt on Windows Server 2003
Yesterday I wrote about trying out a Windows VPS using KickAssVPS.com to run MXMLiszt, my web demo for my music encoding research project.
Well, so far so good. Sort of …
I use Windows XP Home Edition (SP3, 32-bit) on my laptop, so my software and everything that it utilizes runs well on XP since I naturally made sure everything works.
But to my surprise things didn’t go that great when I uploaded everything to my VPS which uses Windows Server 2003. Here’s what was problematic:
- Musecore version 0.9.5 is what I call to make PDFs and PNG preview files from the MusicXML files. On the VPS, the titles of the pieces weren’t showing up when these image files were getting made, so I went on to version 0.9.6 which is still in beta. It’s working OK. The PDFs are getting made when one transposes a piece, but they look better with the XP version of MXMLiszt on my laptop which uses version 0.9.5 of MuseScore. But, there’s something funny going on with the administrative PHP script that makes PNG preview files for the site. These are small image of the 1st page of each piece so the user can get a feel for the piece, its key and initial texture, etc. Not all the previews are getting made. Weird.
- I also use MuseScore to "normalize" each MusicXML file by opening/saving them in MuseScore so that every file gets "spit out" by the same application before I use an XSL transformation to auto-generate MODS metadata from the MusicXML files. Works great on XP and with MuseScore 0.9.5 but on the VPS with the beta version of MuseScore, the title of each piece is showing up in the MODS as the primary and sub-title – so it’s redundant. If you don’t know what I’m talking about, don’t worry. Bottom line is things are wonky.
- The software that makes the transpositions, MusicXML Library version 2.0, wasn’t working on the VPS. So I went down to version 1.09. Moreover, my PHP script that calls the MusicXML Library to transpose the music per user request looks for the "xmltranspose.exe" executable in version 2. The older version of the software calls that file "transpose.exe". So rather than change my script, I just renamed "transpose.exe" to "xmltranspose.exe" on the VPS.
So in other words, MXMLiszt really isn’t too compatible with Windows Server 2003. It’s working well enough as a live demo now, but I wouldn’t try and run it "for real" on the VPS since the PNG preview and MODS generation are problematic. So the production version is the XP version as on my laptop.
Oh well. The user doesn’t miss out, it’s just the Administrator who has to do the work on a local machine and upload everything to the VPS for the sake of having a live demo.
Such is life.
p.s. it’ll be a few more weeks before I can post the link to the demo. It needs more documentation before I can share it with anyone other than my professor. Feel free to email me off-list if you would like me to send the link as long as you agree not to post the link until the documentation is done.
This blog post is part of a semester-long investigation into digital encoding of symbolic music representation (SMR), its context in libraries, web-based delivery, preservation and metadata, and search and retrieval technologies..
LS-598 #3: Web Demo Update
Before I provide a quick update on the status of the web demo for my MusicXML digital library delivery mock-up, it goes without saying that I haven’t been bloggin per my original intent. That is to say, the initial idea of writing my paper with a modular approach fed by blog posts, has essentially been scrapped. The idea of it all seemed a bit too piece-meal to me. Instead, I’ll write the paper and post it in its entirety when it’s been written.
As for the web demo, it’s more or less completed from a programming p.o.v at this point, but there are a few things I want to mention:
- The software has a name in MXMLiszt, i.e. "MusicXML List". It’s pronounced "maximalist" and happily references my favorite composer in the process.
- It should run "out of the box" (after you install all dependencies) on Windows servers running XAMPP or at least an Apache installation on a Windows server with the same folder configuration as XAMPP.
- It will utilize open-source software only, with the ironic exception of the OS.
- MXMLiszt will be released with a BSD license approximately 6-8 weeks from now after the documentation is complete and my paper is finished.
Eventually, I’ll port MXMLiszt to Linux (in all likelihood for Ubuntu specifically), but that will entail some work I’m not willing to be distracted by just yet. With Linux, I’ll have more homework to do before I can distribute it and ask people to expect it to work.
Other problems are finding a server to support MXMLiszt. Given all the software requirements and that I *do not* want to have to rearrange – for either a Windows or Linux server – all the paths to the various executable I utilize and - for Linux hosting – the fact that I can’t control what Linux distro a web hosting company offers, I have only two options:
- Use my own laptop as a server. The downside of this is that I can’t leave it on 24/7. I’ll only be able to offer it from roughly 9-5 EST Mon-Fri since I’ll need my computer to rest and/or be available for personal use while I’m home.
- Run MXMLiszt from a Virtual Private Server (VPS) – i.e. instead of a typical web hosting plan, I’ll rent out a virtual machine running Windows from a hosting provider. This gives me total control of the VM and I don’t have to make compromises and/or rearrange folders and paths to executables, etc.
I’ve decided to go with the 2nd option for now using KickAssVPS. I’ll report back how that works out, but I want to close with the economic dilemma all this presents.
It’s roughly $35 a month to have a Windows VPS with KickAssVPS. And that’s on the low end of VPS pricing. That’s not that expensive but when one factors in that for that rate I could finance a brand-new really high end laptop, I have to ask myself "Why not just use my current low-end Dell laptop as a server and use the $35 bucks a month to get a really nice laptop?".
Good question. For now I’ll try the VPS, but I have a feeling that my narcissistic need to control everything will lead me back to running a server from home. And I think in the end I just won’t be able to justify the cost of renting (VPS) vs. renting to own (financing a new laptop).
If you’re asking "Why get an expensive laptop? Why not get a really cheap one or better yet a used headless server?"
Well, my friend, it doesn’t get much cheaper than my Dell Vostro. If I buy a low-end laptop, it’ll end up being comparable in price and specs to my Dell. It doesn’t make much sense to essentially have two of the same thing. As for headless servers, that would be better from the p.o.v. of running a server 24/7 but I’d need a monitor at some point and I don’t want to get into buying more peripherals …
… unless someone can tell me how to remotely control a headless server from a laptop without *ever* needing a monitor for the headless.
This blog post is part of a semester-long investigation into digital encoding of symbolic music representation (SMR), its context in libraries, web-based delivery, preservation and metadata, and search and retrieval technologies.
LS-598 #2: XQuery problems and solutions
Just a quick morning post today …
The last 10 days or so I’ve been struggling with some major problems that arose in trying to implement effective XQuer-ies on my web demo.
- Dublin Core doesn’t allow me to differentiate creator "types", so I was limited to searching across the DC:creator element for all creators, be they Composer, Lyricist, or Arranger. MusicXML does differentiate these types, so essentially Dublin Core was making me "dumb down" some information. I want people to be able to search creator specifically by their role: Composer, Lyricist, or Arranger.
- I needed a way to iterate an XQuery over all the MusicXML documents and I needed it to be relatively fast. A demo is a demo, but impatience is impatience and I just can’t accept slow query processing.
- The XQuery processor I was using didn’t support some XQuery functions that would allow a searcher to type in "Bach" and retrieve documents for which the creator was "J.S Bach", "Johann Sebastian Bach", "Bach, J.S", "P.D.Q Bach", etc. This really was limiting the search/query coolness factor and I wasn’t at all happy about it.
Here are my solutions (details to follow in a few days or so):
- Ditch Dublin Core and switch to MODS, which does allow me to specify the role of a creator. Last week, I made a MusicXML to MODS XSL transformation for descriptive metadata and it’s working well.
- Steal an idea from Using XQuery on MusicXML Databases for Musicological Analysis so that rather than iterate one query (say for the number of notes in a piece) across multiple MusicXML docs, I just concatenated all the MusicXML documents. The original files are left alone, but a "super" MusicXML file gets created so that one can just query that one file, hence no need for lengthy iteration. I’m not sure how those fellows did it, but I just automated it via PHP using the following format:
- Switch XQuery processors! I’ll go into the ones that didn’t support the function I needed another time, but I will say that BaseX did the trick. Below is the query that searches for creators with "Bach" somewhere in MusicXML’s <creator> element. For the deliverable demo, I won’t be querying these big MusicXML documents for simple descriptive metadata like Creator, that’s what the MODS is for. But this is just an example. The "ftcontains" syntax is what allows for retrieval of values where "Bach" is somewhere within the element, but isn’t necessarily equivalent to the entire element value.
<hyperMXML>
<hypoMXML file="foo1.xml">
1st MusicXML document
</hypoMXML>
<hypoMXML file="foo2.xml">
2nd MusicXML document
</hypoMXML>
</hyperMXML>
for $i in doc("../temp/concat/concatMXML.xml")/hyperMXML/hypoMXML/score-partwise
where $i/identification/creator ftcontains "Bach"
return ($i/work/work-title)
This blog post is part of a semester-long investigation into digital encoding of symbolic music representation (SMR), its context in libraries, web-based delivery, preservation and metadata, and search and retrieval technologies.
segmenting audio with AudioRegent, SoX and XML
For some reason I feel obligated to point out that I haven’t blogged in a while for a few reasons:
- Christmas break from school/work at the University of Alabama
- the desire not to blog for the sake of blogging
- and …
I’ve been working on something huge – at least for me. It’s a piece of software called AudioRegent that harnesses XML to create derivative "clips" of regions within WAV audio files. A region is simply a user-defined segment within an audio file, like a track on a Compact Disc.
Besides writing the program in Python, which I pretty much finished in December, I had to also develop the XML format which I call SimpleADL (Simple Audio Decision List) that AudioRegent looks at and then makes derivative audio clips by leveraging SoX, the Sound Exchange command line audio editor. AudioRegent and SimpleADL can also be used to sync audio to text, like transcripts.
Actually, the programming and devising SimpleADL were the easy part. The hard stuff was the documentation and deciding on a license for the software.
I tried to find a balance in documenting the software: being thorough without writing a novel. I’m not sure I succeeded, but I can always improve it with time.
I used the W3C’s Amaya editor to write the documentation in XHTML. Sure, you can use OpenOffice to export a document to XHTML, but man is it bloated and messy. Amaya writes really clean XHTML.
As for the license, I chose the BSD license. As I understand it, this allows one to use the source code at will in future open or closed-source applications as long as you maintain the credits for AudioRegent. I was tempted to use the Mozilla Public License (MPL) which, again from what I can tell, is similar to the BSD license except that any source derived from AudioRegent would have to stay open-source though any peripheral code can be closed-source. I absolutely decided against the GNU General Public License which is viral and imposes its philosophy perpetually on all subsequent code, even peripheral code. Some have even argued that it works against its own objectives and is less "open" than the MPL.
Now I realize that, practically speaking, a skilled programmer could write better code from scratch in 30 minutes as opposed to the some 30 hours I needed, but I wanted to go about this quasi-professionally. And I learned more about licensing, which was cool.
Anyway, rather than try and explain the software itself and how to get it, I’d be better off pointing you to the documentation if you have any interest …
South Park Launcher: Python and VB editions
It seems that about a year ago, the creators of South Park started freely offering their episodes online at southparkstudios.com.
The move was made, in part, to offer a legal way for fans without television access or even televisions (that’s me!) to watch the show. As the creators stated, they got:
This was great for me because I admittedly was downloading those shows as well. One of the other great things about it is the Random Episode link which, you guessed it, launches a random episode of the show.
The only thing that makes it less than perfect is that I can’t set it to repeatedly play more random episodes.
Until now.
I decided to sit down and write the most important code I may ever write: a program that leverages the Random Episode feature to launch successive random episodes of the great social commentary, courtesy of Matt and Trey.
Here’s the core of the code, from a Python perspective:
import sys, webbrowser, time
url = ‘http://www.southparkstudios.com/episodes/random.php’
i=1
while i <= 5:
webbrowser.open(url)
time.sleep(1440) # 1440 seconds = 24 minutes, the average length of an episode plus ads.
i=i+1
sys.exit()
Basically this is a loop that keeps launching a episode every 24 minutes while the variable "i" is less than or equal to 5. After every episode, "i" grows by one so when "i" equals 6, the program shuts down. BTW: after the program starts, the user does have to manually initiate playback of the first episode, but the last 4 episodes play automatically.
Now, this works fine and I even embellished the code so that the user can select from 2 to 10 episodes to launch. But I was still thinking, "This would be great to share with my friends, but they aren’t going to run software from the command line!"
So I wrote a version using Microsoft’s Visual Studio Express, a free version of their software that allows one to create GUI applications with a mixture of visual tools and the Visual Basic language. It’s a really great tool that I hope to use more in the future although I do worry about the extra time required just to make software have a graphical interface.
Anyway, here’s a ZIP file that contains both the installer for the Windows version of South Park Launcher as well as the Python version for Windows and other platforms. Make sure to read the ReadMe.txt file first!
You assume all responsibility for use.
UPDATE, 03/14/2010: Looks like the South Park site now requires one to click the "play" button to start each and every episode, so South Park Launcher isn’t very useful anymore.
Oh well. It was fun while it lasted.
running XQuery online
A while ago, I posted about my first experience with XQuery and how I’d used the .NET version of the Saxon processor on my local Windows machine.
Obviously, I want to extend that experience to running XQueries online. So far, I know this can easily be done with a native XML database server like eXist. That is to say, when I installed eXist on my local machine and made it go live: bam! – instant XML server with built-in XQuery functionality, accessible from anywhere.
Another way is to use one of the more popular web-scripting languages to execute XQuery syntax. I nearly killed myself this weekend trying to install the Zorba PHP binding for XQuery (i.e. run XQuery natively from within a PHP script). I just couldn’t get all the dependencies successfully installed on my virtual install of Ubuntu Netbook Remix (BTW: I use Sun’s VirtualBox for virtualization). Perhaps I’ll be able to make it work another time.
Now, even though it makes all the sense in the world to stick with a native XML server like eXist if I want to make a large collection of searchable XML documents online (and I do), I’m feeling non-sensical. What I decided to try was to run my computer as a more traditional server using the X-Apache-MySql-PHP model, specifically WampServer.
From there, I placed my sample document, "books.xml" and my "test.xquery" file from last time:
<ul>
{
for $x in doc("books.xml")/bookstore/book/title
order by $x
return <li>{data($x)}</li>
}
</ul>
in WAMP’s "www" directory – i.e. the directory which is accessible from the browser, the place where one would put all their .html files, etc. for the world to see.
Of course, I was still missing the actual XQuery processor at this point. What I tried is to put the relevant executables for Saxon in the "www" directory as well.
… Now, I’m sure this is totally unsafe or something, but I was just testing and I only make my computer "go live" as a server for short periods of time.
Anyway, from there I used PHP to call the Saxon processor and to display the results in the browser. It actually worked!
Here’s the code:
<?php
echo "Hi. I’m going to use XQuery to list the books alphabetically.";
exec("query.exe test.xquery !indent=yes", $results);
foreach ($results as $value)
{
echo "$value";
}
?>
You can see that the PHP "exec" command called the Saxon executable named "query.exe" and executes the "test.xquery" file.
It saves the results in a variable called "results" and then prints each value of the results in the browser.
…
For now, I’m OK with this, but I need to eventually do the same thing with the Java version of Saxon, I suppose, if I’m ever going to run this on a Linux server. It shouldn’t present any new hurdles, but I need to try it to make sure, of course.
If anyone out there has any thoughts or recommendations on a more elegant method to achieve these results – and what the security risks of the approach I’ve outlined presents (since I didn’t use a CGI-bin), please speak up. I’m all ears.
UPDATE, 11/15/2009: No problems with using the Java version of Saxon. I did place it in a "bin" directory on my local server so that the Java file wouldn’t reside in the directory that users have direct access to.
Excel & VBA: I cheat to win
A few months ago at work I spearheaded the standardization of our digitization log spreadsheets – simple Excel files that let students track what they digitized, its unique identifier, number of scans per item, date of scan, notes, etc.
The reason for this was twofold:
- A standard template would hopefully make the students a little more productive in that they would not have to re-acclimate themselves to a ‘new look’ whenever they starting working on a different digitization project.
- A standard template would allow us to quickly merge all the spreadsheets into a temporary master Excel file and import that into Microsoft Access so we could run queries and make quick monthly reports. We used a Visual Basic for Applications (VBA) script called RDBMerge to make the master file. It’s a fantastically well written piece of work.
Of course, it took several days to standardize the numerous existing spreadsheets but we got there. Now, all spreadsheets had the same naming convention (using the collection’s unique identifier) and were placed in the same folder – which essentially made it a quasi-database that could be queried with the search functions inherent to the OS. We considered using an actual database, but honestly that causes its own set of problems with training, support, and other stuff – hence the decision to stick with Excel and employ RDBMerge/Access.
Now, over time with standardized spreadsheets one naturally begins to think, "Ah! We should have also added a field for … !"
Well, yes that’s a pain. In a true database that’s no big deal, but with individual files who wants to go through myriad spreadsheets and manually add a column or a formula, etc?
Not me.
I did it once and it took a few hours even using some minor automation in the form a macro I recorded in Excel.
Once was enough.
But the great – I mean truly great – thing though about recording macros in Microsoft Office is that the resultant VBA code – code that the act of recording a macro generates – is visible. And, thus, editable. Not to mention it’s a great way to learn VBA syntax. In other words, if you record yourself doing a task in Excel, you can then actually see/edit the line item code that equates to your actions. Seems like a great tool for teaching some programming skills …
Anyway, I started to ask the lazy question, "How can I iterate a recorded macro/s over all spreadsheets in a given folder?"
After much digging, I found a great template script here. The script basically is a wrapper that can encapsulate the VBA script for one’s recorded macros. A little know-how and tweaking is required, but that’s better than the alternative: learning from-scratch Visual Basic on the job is something I really just don’t have the time for right now. At any rate, I was able to use the template script and automate stuff across all of our digitization log files in no time.
By the way, if you use the script, note the snippet below shows a line break that isn’t present in the blogger’s post, perhaps due to their blogging software’s formatting.
‘will start LOOP until all files in folder sPath have been looped through
Set oWbk = Workbooks.Open(sPath & "" & sFil)
p.s. in my experience, batch VBA scripts break on Excel files for which sharing is turned on (simultaneous multi-user editing). Those files need to be identified* and temporarily isolated into another folder or the sharing needs to temporarily be turned off. Also, sheets need to be ‘unprotected’ for the code to work, but it’s quite simple to add a line in your macro that unprotects the sheet prior to performing whatever macro you’ve created.
* It’s a good idea to use some sort of suffix to identify shared files … something like "filename.share.xlsx". In fact, if one does so, a VBA script could be run to turn off sharing for all shared files (*.share.xls*). Unfortunately, trying to turn sharing off for a non-shared Excel file will break the script … which, again, is why it’s best to identify shared documents through a filenaming convention.