blog.humaneguitarist.org

discoveries in digital audio, music notation, and information encoding

Archive for the ‘information retrieval’ Category

keyword vs. phrase searching of the Soundboard, a GFA publication

leave a comment

As I mentioned before, last summer I went to the Guitar Foundation of America convention in Charleston.

I also mentioned that I'd asked some questions about whether the GFA journal, "Soundboard" was full-text indexed.

Via the FlippingBook software the GFA uses to display current issues online (membership required), there is full-text searching capability because the content is indexed as far as I can tell. But as I was saying, I don't think one can search across *all* online Soundboards simultaneously – i.e. fire off one query and get results across all online Soundboards. I could be wrong about that.

In contrast, the PDF back issues sold on a DVD-ROM are not full-text indexed nor full-text searchable with Adobe Acrobat Reader as far as I can tell. And I think this is where there's real confusion – perhaps on my part – about what we mean when we use terms like "keyword" searching.

To me, keyword searching means full-text and not a "find" (as in Acrobat Reader). The Webopedia site differentiates these as "keyword" and "phrase" searches, respectively. The GFA is using a different meaning, per the "How to search Soundboard back issues.pdf" file that comes with the DVD, for "keyword" searching:

"These issues have been processed both to reproduce the page-by-page appearance of the originals on your computer screen, and to apply an "optical character recognition" (OCR) process to the text, so that every page of every issue is now keyword searchable."

In my experience, however, the search provided internally via Adobe Acrobat Reader (and Foxit Reader, too) is what I'd just call a "find" (i.e. the same as Ctrl-F on your browser). In fact, in my version of Acrobat Reader and per the screenshot in the "How to search Soundboard back issues.pdf" file, Adobe also uses the phrase "find" and not "search" in their application. Their "Advanced Search" adds options really dealing with what to search (comments, all files in a folder, etc.) but not really how to search (in the algorithmic sense) – so, it's still a "find", though more feature-rich. Now, if you have Acrobat Pro (admittedly I do through work) you apparently can create an index and then actually do a full-text search, but that doesn't help people who don't have the pro version and won't/can't buy it.

Granted, I can index the PDF with my operating system (Windows) and do a full-text search, but I don't really get much useful information other than what files match. I don't get useful information on where the passage exists (page number, etc).

Consider the following passage from Soundboard Volume 1, Number 1, 1974:

"Mr. Llois Mauerhofer, Elizabethstrasse 93, 8010 Graz, Lustria, was reported working on a doctoral dissertation at the University of Graz on Leonard von Call, early 19th c. guitarist active in Vienna who is best remembered for his serenades for guitar and strings."

A "find" won't match that passage if you search for "Graz University" or "University Graz" or "strings Vienna" but a real keyword search likely would.

Of course, a demonstration is in order, so using a tool called Apache Tika to extract the text from the aformentioned PDF scan of Soundboard v.1, #1, 1974; a little Python software script I wrote to output the data to a database-friendly file; and an online database, I indexed the data and made a little API – all that means is that there's page you can go to, throw some search terms at it, and get the results back as structured data (um, usually not fun to read through).

By the way, I normally use more technical jargon in my posts but I have some guitarist buddies who I want to read this page.

Anyway, here are the three searches mentioned above that don't yield results in Acrobat Reader but do using a full-text search (you can see the search terms in bold in the links below). Don't worry if you can't read the output, just focus on the fact that something comes back (provided my database isn't down at the moment!).

http://blog.humaneguitarist.org/uploads/Soundboard/currentVersion/search/?q=Graz+University
http://blog.humaneguitarist.org/uploads/Soundboard/currentVersion/search/?q=University+Graz
http://blog.humaneguitarist.org/uploads/Soundboard/currentVersion/search/?q=strings+Vienna

For a more user-friendly version, try going here:

http://blog.humaneguitarist.org/uploads/Soundboard/currentVersion/soundboard_search.html

Try typing in the three searches mentioned above. Then try some more searches for fun. For simplicity's sake, I hard-coded the system to never return more than 10 results.

Of course, this should all scale to indexing the text of all the PDFs on the DVD, but exposing those openly on the web wouldn't be appropriate.

But my point with this demo is to say that this is more like what I meant by "keyword" searching at the GFA convention. There's probably a way to ingest the old PDFs into the FlippingBook software or at least something else like the Internet Archive book reader. That would probably require re-OCRing the images so that the coordinates of the words could be indexed as well, allowing one to see where on a page the results are, just as with the current issues via FlippingBook.

Ok, if you're still here and are a geek, here's the Python script, "soundboardToTabDelimited.py".

'''
usage example:
  $ python soundboardToTabDelimited.py V01-n1-1974.pdf

This yields "V01-n1-1974.xhtml" and then "V01-n1-1974.txt"
 
Note: you must have the lxml module installed (which isn't always fun).
You can get it here: http://lxml.de/
'''

import codecs, subprocess, sys
from lxml import etree

##### globals
tab = "\t"
br = "\n"


##### run Apache Tika on the file passed via the command line
soundboard = sys.argv[1].replace(".pdf", "")
command_string = "java -jar tika-app-1.2.jar %s > %s" %(soundboard + ".pdf", soundboard + ".xhtml")
command = subprocess.Popen(command_string, shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
command.wait() #wait until the subprocess finishes.


##### write file headers (this needs to be deleted if you're going to later import the file via PHPMyAdmin).
tab_delimited = codecs.open(soundboard + ".txt", "w", "utf-8") #output file

tab_delimited.write("journal_id" + tab + "volume" + tab + \
                    "issue" + tab + "year" + tab + \
                    "page_id" + tab + "text_id" + tab + "text" + br)


##### extract volume, issue, year from filename
volume = int(soundboard.split("-")[0].replace("V", ""))
issue = int(soundboard.split("-")[1].replace("n", ""))
year = int(soundboard.split("-")[2])
journal_id = "%04d_%04d_%04d" %(volume, issue, year)


##### parse xhtml file
soundboard_parse = etree.parse(soundboard + ".xhtml")
root = soundboard_parse.xpath(".")

div_tags = root[0].xpath("//xhtml:div[@class='page']",
             namespaces={"xhtml":"http://www.w3.org/1999/xhtml"})


##### extract text from each div/p tag and write data to file
page_id = 1
for div_tag in div_tags:
  text_id = 0
  p_tags = div_tag.xpath("xhtml:p",
             namespaces={"xhtml":"http://www.w3.org/1999/xhtml"})

  for p_tag in p_tags:
    p_text = p_tag.text
    if p_text !=None and p_text !="":
      p_text = p_text.replace(br, "")
      p_text = p_text.replace(tab, "  ")
      p_text = p_text.strip()
      if p_text != "":
        tab_delimited.write(str(journal_id) + tab + str(volume) + tab + \
                            str(issue) + tab + str(year) + tab + \
                            str(page_id) + tab + str(text_id) + \
                            tab + p_text + br)
        text_id = text_id + 1
     
  page_id = page_id + 1

tab_delimited.close()
# fin
--------------

Related Content:

Written by nitin

January 5th, 2013 at 12:35 pm

PyEDS: a simple Python starter library for Ebsco’s Discovery Service (EDS)

leave a comment

Before this little vacation I'm on started (sadly, it's almost over!), I was allowed to have access to Ebsco's Discovery Service (EDS) API and its documentation WIKI.

I sent a tiny bit of feedback on some of the things in the documentation that I think are unclear or really need correction and I'm looking to send more when I return to work.

My biggest concern was that  – and I think this is true of A LOT of API documentation – it requires a lot of reading on the user's part to figure out what means what, which almost invariably exceeds the amount of work to actually write the code to authenticate, make queries, etc.

That's to say that often working through documentation about tying a shoelace is more of a task than actually tying said shoelace.

I *think* developers really just want to start experimenting with code, so clarity and really concise language with examples are really of the utmost importance.

Speaking of examples, I also think that sample code needs to have scope in mind. What I'm getting at is that sample code for a search API shouldn't be a "soup to nuts" thing that entails authenticating, making requests, having a client-side UI/interface and displaying results, etc. That's too much. Again, I think (off the top of my head of course and with nothing more than a gut feeling) that it might be more helpful to simply show how to authenticate and make a request and show the formatting of a sample response. The other stuff – interface, UI, etc, etc. – just convolutes the code and adds noise to the basics. In fact, that confuses API usage implementation vs. the API usage itself.

Better still would be to offer small libraries in popular scripting languages that simplify the basics – again, to facilitate people playing with one's API's. The easier and more "fun" it is, the more likely I think (yeah, yeah, I know!) people are likely to really dream about incorporating the API, etc. into their applications and what-nots.

So along those lines, I've pasted a little sample Python script below that makes it really easier for me to authenticate, open a session, conduct searches, format the JSON response, and close the session. It needs work (what doesn't?) but it does what I mean for it to for now.

I probably shouldn't post a sample response since access to the EDS WIKI is for customers only, but if you aren't a customer or at least aren't interested, why are you even reading this?

:P

#PyEDS.py

'''
This module provides a basic Python binding to Ebsco's EDS API, allowing one to:
  - authenticate with a UserID and Password,
  - open and close a session,
  - perform a search (results are returned as JSON),
  - pretty print the JSON.
 
Thanks,
Nitin Arora; nitaro74@gmail.com
____________________________________________________________________________________________________
#Usage example:
 
  import PyEDS as eds
  
  eds.authenticateUser('USERID_GOES_HERE', 'PASSWORD_GOES_HERE')
  eds.openSession('PROFILE_GOES_HERE', 'GUEST_GOES_HERE', 'ORG_GOES_HERE')
 
  #eds.authenticateFile() #alternative to using authenticateUser() and openSession()
  #uses values in JSON config file argument(default="config.json")
  
  #sample "config.json" file:
  """
  {
    "EDS_config": {
      "UserId": "USERID_GOES_HERE",
      "Password": "PASSWORD_GOES_HERE",
      "Profile": "PROFILE_GOES_HERE",
      "Guest": "GUEST_GOES_HERE",
      "Org": ORG_GOES_HERE
    }
  }
  """
 
  kittens = eds.advancedSearch('{"SearchCriteria":{"Queries":[{"Term":"kittens"}],"SearchMode":"smart","IncludeFacets":"y","Sort":"relevance"},"RetrievalCriteria":{"View":"brief","ResultsPerPage":10,"PageNumber":1,"Highlight":"y"},"Actions":null}')
  puppies = eds.advancedSearch('{"SearchCriteria":{"Queries":[{"Term":"puppies"}],"SearchMode":"smart","IncludeFacets":"y","Sort":"relevance"},"RetrievalCriteria":{"View":"brief","ResultsPerPage":10,"PageNumber":1,"Highlight":"y"},"Actions":null}')
  cubs = eds.basicSearch('cubs')
  piglets = eds.basicSearch('piglets', view='brief', offset=1, limit=10, order='relevance')
  
  eds.closeSession()
  
  print 'Some search results with the EDS API ...'
  print '\n"kittens" advanced search as original JSON:'
  print kittens
  print '\n"puppies" advanced search as original JSON:'
  print puppies
  print '\n"kittens" advanced search as JSON with indentation and non-ascii escaping:'
  print eds.prettyPrint(kittens)
  print '\n"cubs" and "piglets" basic searches as original JSON:'
  print cubs, piglets
  print '\nGoodbye.'
____________________________________________________________________________________________________
 
TO DO:
  - add more options to basicSearch() like "facets", "search mode", "fulltext", "thesauras", etc.
    - can't hurt! :-]
  - consider adding an authenticateIP() function that uses the IP authentication method.
  - deal with expired tokens, etc.; see: http://edswiki.ebscohost.com/API_Reference_Guide:_Appendix
'''
 
import urllib2
_EDS_ = {}
 
 
def authenticateUser(UserId, Password):
  '''Authenticates user with an EDS UserId and Password.'''
  auth_json = '{"UserId":"%s","Password":"%s","InterfaceId":"WSapi"}' %(UserId, Password)
  req = urllib2.Request(url='https://eds-api.ebscohost.com/authservice/rest/UIDAuth',
                        data=auth_json,
                        headers={'Content-Type':'application/json'})
  req_open = urllib2.urlopen(req)
  req_results = req_open.read()
  
  req_results_dictionary = eval(req_results) #convert JSON to dictionary.
  _EDS_['AuthToken'] = req_results_dictionary['AuthToken']
  _EDS_['AuthTimeout'] = req_results_dictionary['AuthTimeout']
 
 
def openSession(Profile, Guest, Org):
  '''Opens the EDS session with an EDS Profile, the Guest value ("y" or "n"), and the Org nickname.'''
  sessionOpen_json = '{"Profile":"%s","Guest":"%s","Org":"%s"}' %(Profile, Guest, Org)
  req = urllib2.Request(url='http://eds-api.ebscohost.com/edsapi/rest/CreateSession',
                        data=sessionOpen_json,
                        headers={'Content-Type':'application/json',
                        'x-authenticationToken':_EDS_['AuthToken']})
  req_open = urllib2.urlopen(req)
  req_results = req_open.read()
 
  req_results_dictionary = eval(req_results)
  _EDS_['SessionToken'] = req_results_dictionary['SessionToken'].replace('\\/', '/')
 
 
def closeSession():
  '''Closes the EDS sesssion.'''
  sessionClose_json = '{"SessionToken":"%s"}' %(_EDS_['SessionToken'])
  req = urllib2.Request(url='http://eds-api.ebscohost.com//edsapi/rest/EndSession',
                        data=sessionClose_json,
                        headers={'Content-Type':'application/json',
                        'x-authenticationToken':_EDS_['AuthToken']})
  urllib2.urlopen(req)
  
  
def authenticateFile(config_file='config.json'):
  '''Uses values in JSON config file to authenticate *and* open a session.'''
  config = open(config_file, 'r').read()
  config = eval(config)
  config = config['EDS_config']
  authenticateUser(config['UserId'], config['Password'])
  openSession(config['Profile'], config['Guest'], config['Org'])
 
 
def basicSearch(query, view='brief', offset=1, limit=10, order='relevance'):
  '''Returns search results using basic arguments.'''
  search_json = '''{"SearchCriteria":{"Queries":[{"Term":"%s"}],"SearchMode":"smart","IncludeFacets":"n","Sort":"%s"},
                   "RetrievalCriteria":{"View":"%s","ResultsPerPage":%d,"PageNumber":%d,"Highlight":"n"},"Actions":null}
                   ''' %(query, order, view, limit, offset)
  return advancedSearch(search_json)
 
         
def advancedSearch(search_json):
  '''Returns search results using the full EDS search syntax (JSON).'''
  req = urllib2.Request(url='http://eds-api.ebscohost.com/edsapi/rest/Search',
                        data=search_json, headers={'Content-Type':'application/json',
                        'x-authenticationToken':_EDS_['AuthToken'],
                        'x-sessionToken':_EDS_['SessionToken']})
  req_open = urllib2.urlopen(req)
  req_results = req_open.read()
  return req_results
 
 
def prettyPrint(json_string):
  '''Returns a pretty-printed, UTF-8 encoded JSON string with escaped non-ASCII characters.'''
  import json
  dictionary = json.loads(json_string, encoding='utf=8')
  return json.dumps(dictionary, ensure_ascii=True, indent=2, encoding='utf-8')
 
 
#fin
--------------

Related Content:

Written by nitin

December 30th, 2012 at 11:23 am

full-text searching of timed text and a farewell to Andy Roddick

leave a comment

It’s been a while since I had one of my “So, I’m home sick today and wrote this silly, little script” things.

Well, here’s another one while the antibiotics take root.

I’ve always wanted to do something with offering full-text search against timed-text files and allowing a user to click on a result and skip to the audio segment matching the returned line of timed-text, etc. Hulu has had a BETA version of this kind of thing for a while and I suspect others do too.

Well, today I just whipped up a little search API using PHP and MySQL. It’s a nice little start and super easy to do.

I made a database table using the timed-text data from my SAVS project, OpenOffice Calc, and phpMyAdmin. The text is from Shakepeare’s Sonnet 130 using a LibriVox recording (version #14, Miller). BTW, parsing DFXP or SRT files and throwing those into a table is easy, but it’s not within the scope of this little mock-up.

If I send a query for “rare love” to the API as such:

http://blog.humaneguitarist.org/uploads/SAVS/currentVersion/search/?q=rare%20love

… I get the following JSON response:

{
  "results":{
    "result":[
      {
        "text":"Than in the breath that from my mistress reeks.",
        "highlighted_text":"Than in the breath that from my <mark>mistress<\/mark> <mark>reeks<\/mark>.",
        "startTime":"34",
        "stopTime":"37",
        "source":"sonnet130_shakespeare_njm",
        "relevance":"4.04993200302124"
      },
      {
        "text":"My mistress, when she walks, treads on the ground:",
        "highlighted_text":"My <mark>mistress<\/mark>, when she walks, treads on the ground:",
        "startTime":"46",
        "stopTime":"49",
        "source":"sonnet130_shakespeare_njm",
        "relevance":"1.62977826595306"
      }
    ]
  }
}

Note that the text is returned in the “text” field and I’m also trying to return a “highlighted_text” field in which search terms are surrounded by the HTML5 “mark” tag. There’s also a relevance score … of sorts (pun!).

It needs a lot of work, but there’s enough data returned to launch an audio segment using some HTML5/JavaScript or some Flash or Silverlight API, etc. Hey, it ain’t too bad for a bad stomach and some sports-entertainment distractions.

Below, I’ll paste the CSV file I used to make the table, the PHP script … and a personal note about the best male American tennis professional of the last decade.

Here’s the CSV file from the spreadsheet application (note the “line_text” field is full-text indexed in the database):

"line_id";"line_text";"start_time";"stop_time";"file_prefix"
"1";"Coral is far more red than her lips' red:";"13";"17";"sonnet130_shakespeare_njm"
"2";"If snow be white, why then her breasts are dun;";"17";"21";"sonnet130_shakespeare_njm"
"3";"If hairs be wires, black wires grow on her head.";"21";"26";"sonnet130_shakespeare_njm"
"4";"I have seen roses damask'd, red and white,";"26";"29";"sonnet130_shakespeare_njm"
"5";"But no such roses see I in her cheeks;";"29";"32";"sonnet130_shakespeare_njm"
"6";"And in some perfumes is there more delight";"32";"34";"sonnet130_shakespeare_njm"
"7";"Than in the breath that from my mistress reeks.";"34";"37";"sonnet130_shakespeare_njm"
"8";"I love to hear her speak, yet well I know";"37";"40";"sonnet130_shakespeare_njm"
"9";"That music hath a far more pleasing sound:";"40";"43";"sonnet130_shakespeare_njm"
"10";"I grant I never saw a goddess go, --";"43";"46";"sonnet130_shakespeare_njm"
"11";"My mistress, when she walks, treads on the ground:";"46";"49";"sonnet130_shakespeare_njm"
"12";"And yet, by heaven, I think my love as rare";"49";"54";"sonnet130_shakespeare_njm"
"13";"As any she belied with false compare.";"54";"56";"sonnet130_shakespeare_njm"

Here’s the PHP script:

<?php
//GET search words from URL parameter
$searchWords = trim($_GET["q"]);

//prepare for highlighting keywords
$search_array= explode(" ", $searchWords);

//prepare for output
$output = array();

//connect to database
include_once("db_setup.php");

//run query
$searchWords = mysql_real_escape_string($searchWords);
$query = "SELECT *, MATCH(line_text) AGAINST(\"$searchWords\") AS relevance
FROM $table WHERE MATCH(line_text) AGAINST(\"$searchWords\" IN BOOLEAN mode)
ORDER BY relevance DESC";
$result = mysql_query($query);

if($result) {
    while($row = mysql_fetch_array($result)) {
      $line_text = $row["line_text"];
      $start_time = $row["start_time"];
      $stop_time = $row["stop_time"];
      $file_prefix = $row["file_prefix"];
      $relevance = $row["relevance"];

      //highlight seach words in line_text
      $highlighted_text = $line_text;
      foreach ($search_array as $word) {
        $highlighted_text = str_ireplace($word, "<mark>$word</mark>", $highlighted_text);
      }

      $this_output = array("text" => htmlspecialchars($line_text),
      "highlighted_text" => htmlspecialchars($highlighted_text),
      "startTime" => $start_time,
      "stopTime" => $stop_time,
      "source" => $file_prefix,
      "relevance" => $relevance);
      array_push($output, $this_output);
    }
}

//send JSON results
if (count($output) == 0) {
  $results = array("results" => "No results.");
}

else {
  $result = array("result" => $output);
  $results = array("results" => $result);
}

$response = json_encode($results);
include_once("indent_json.php");
header("Content-type: application/json; charset=UTF-8");
echo(indent_json($response));
?>

And here’s something more important.

As a huge tennis fan, today was a melancholy one for me as Andy Roddick played his last match, having just lost a few moments ago to Juan Martin del Potro. The Wikipedia article on Roddick here already lists him as retired but the important thing to remember about Roddick is that he achieved more with less than a lot of other players with more talent and was entertaining to watch, win or loose, in big matches.

Thanks for the memories!

--------------

Related Content:

Written by nitin

September 5th, 2012 at 6:17 pm

sorta sorting API results with in-memory SQLite

leave a comment

I'll try to keep this short because it's looking like the weather is going to be agreeable enough for a nice, long Saturday walk.
 
So, I've been working on a mockup API at work that could, among other things, drive an in-site federated search across things like our Ebsco databases, the other vendor resources available through Ebsco's API using SRU, and of course our own databases with lists of the resources we offer and their descriptions.
 
Using simple textual similarity libraries it's easy to have the API return a text similarity score (a trick I learned working on HammerFlicks!) comparing the query against the title of each item. This way if someone types in "Wall St. Journal" it's easy to highlight (through an HTML/JavaScript page) the hit for "Wall Street Journal" from our own database because that'll be a good text similarity match.
 
Here's a snippet showing the similarity attribute:
<?xml version="1.0"?>
<nclive_api_response>
  <results source="ncl_resource_titles">
    <result text_similarity_score="86.666666666667">
      <title>Wall Street Journal</title>
      <url>http://www.nclive.org/cgi-bin/nclsm?rsrc=29</url>
      <description>Full articles from the Wall Street Journal (1981-current).</description>
    </result>
	…
  </results>
</nclive_api_response>

As for sorting through results brought in via multiple resources all using their own relevancy rankings – that's a different story. They're using their own relevancy calculations, so there's really no way to present results across multiple sources as the "most relevant".

 
I was toying with the idea, though, of testing what it would be like to – after the fact – index all the returned results on the fly in Solr or something just to get a relevancy ranking for the results the API returns. Now, this isn’t of course arguing that this would be a total relevancy rank across all sources. In other words, if you only pull five items from each "sub-API", each data source mentioned above, then there's no way to say that the first item from Database A is necessarily more relevant than the fifth result of Database B.
 
Anyway, I thought it was stupid to index things behind the scenes in something external just to get an on-the-fly relevancy rank to inject into the API results, when I'd only then have to quickly delete the entire index since I would just be using it to get a score.
 
But what I don't think is too stupid is the idea itself. It's making the argument that "Look, I've asked these different sources to send me their best stuff and now I'll have a way to rank them with my own criteria … because they're mine now." It's like using your own criteria to rank job candidates after asking a few of your industry friends to each send in their five best employees for the job you're hiring for. You're not necessarily going to agree with how they rank their own employees but you do trust that they've sent you five top notch folks.
 
… and so, after a colleague in another department asked if there would be a way to sort items across multiple data sources, I thought to investigate a way to do the indexing and have some kind of ranking/relevancy score done all in memory.
 
Enter SQLite.
 
This is really cool. With SQLite, I can create a full-text index/searchable on-the-fly database in memory that will let me develop some kind of rank per item. Note, one has to have SQLite with FTS3/FTS4 enabled to do full-text with SQLite.
 
Now, the way I'm doing this is to use SQLite's offsets() function to learn – for each search term/word passed to the API – if it or its Porter-based stem matches in the TITLE field (for which each hit gets, say, 2 points) or the DESCRIPTION field (1 point).
 
After getting the total points, I'm dividing the points by the total number of words within the API's TITLE + DESCRIPTION values to get a scaled result between 0 and 1.
 
Anyway, I've got a starter function below (PHP) that would return what I'm calling a "sorta" score. It'll be interesting to work it into the mockup API to see how it works in the real world in trying to sort items from across different sources.
 
And just to be clear, I'm doing this per item. That's to say I do these calculations for one item then delete the in-memory database. In other words, I'm not indexing all the API results in memory and then getting this "sorta" rank per item because the calculation is agnostic of the other items. Now, if I changed the calculation to consider the other items as well, then absolutely there would be a need to index all the items first before assigning a "sorta" score per item.
 
BTW. Get it … "sorta"?
… 'cause it's "sort of" a way to sort things from multiple sources. Ha!
 
:P
 
Anyway, the PHP's below followed by another PHP block that uses the function and then an HTML snippet of what gets returned with sample text.
 
And so much for my walk, looks like rain's on the way. Dammit.
<?php
  
//clean out special chars, etc.
function recharacter_this($htmlstring) {
  $htmlstring = htmlspecialchars($htmlstring, ENT_QUOTES);
  $htmlstring = trim($htmlstring);
  $htmlstring = preg_replace("/[^A-Za-z0-9]\s/", "", $htmlstring); //leave only alpha-numerics and whitespace
  $htmlstring = preg_replace("/\s+/", " ", $htmlstring); //replace multiple whitespaces with a single space
  return $htmlstring;
}

//get a rank score
function sorta_this($title, $description, $search_text) {
  
  $title = recharacter_this($title);
  $description = recharacter_this($description);
  $search_text = recharacter_this($search_text);
  
  //re: SQLite/PHP fundamentals, see: http://www.if-not-true-then-false.com/2012/php-pdo-sqlite3-example/
  
  //create memory db
  $memory_db = null;
  $memory_db = new PDO('sqlite::memory:');
  
  //errormode set to exceptions
  $memory_db->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
  
  //create table
  //you must use "VIRTUAL TABLE" for FTS3/4, see: http://www.sqlite.org/fts3.html#section_1_2
  $memory_db->exec("CREATE VIRTUAL TABLE box using FTS4 (
  id,
  title,
  description,
  tokenize=porter)"); //porter > simple because a search for "tree" matches up against text with "trees" where as "tokenize=simple" tokenization doesn't seem to do this;
  //granted, Porter stemming has its own problems, but it's better than nothing.
  
  $insert = "INSERT INTO box (id, title, description) VALUES('1', '$title', '$description')";
  $stmt = $memory_db->exec($insert); //insert values per above
  
  $search_text = str_replace(" ", " OR ", $search_text); //making search more liberal
  $query = "SELECT quote(offsets(box)) as rank FROM box WHERE box MATCH '$search_text' ORDER BY rank";
  $result = $memory_db->query($query); //run query per above
  
  $score = 0; //start with initial score of Zero
  $i = 0;  //to use during iteration
  
  //if query yielded anything ...
  if ($result) {
    
    //there's only one row, but still need to loop
    foreach($result as $row) {
      $rank = $row['rank'];
      preg_match_all("/[a-zA-Z0-9]+\ [a-zA-Z0-9]+\ [a-zA-Z0-9]+\ [a-zA-Z0-9]+/", $rank, $matches); //split at every 4th space, i.e. every quartet returned by SQLite offsets(); see: http://stackoverflow.com/questions/10555698/split-string-after-every-five-words
      
      //$matches is a single item array with one array inside it for each quartet; $matches[0] is thus just a plain array
      foreach ($matches[0] as $match) {
        if ($match[0] == 1) {
          //if search hits in TITLE field, get 2 points
          $score = $score + 2;
        }
        else { 
          //if in DESCRIPTION field, get 1 point
          $score = $score + 1;
        }
        $i = $i + 1;
      }
    }
  }
  
  $memory_db->exec("DROP TABLE box");
  $memory_db = null;
  
  $total_words = str_word_count($title) + str_word_count($description);
  $score = ($score/$total_words); //divide $score by total number of words in TITLE + DESCRIPTION
  
  //prevent scores greater than 1, which would only occur with an abnormally small number of total words (essentially <= to the number of words in search terms)
  if ($score > 1) {
    $score = 1;
  }
  return $score;
}
?>
Using the function with TITLE and DESCRIPTION (abstract) from this article …
<?php
//test sorta_this() function
$my_title = ("An aerobic walking programme versus muscle strengthening programme for chronic low back pain: a randomized controlled trial.");
$my_description = ("Objective:To assess the effect of aerobic walking training as compared to active training, which includes muscle strengthening, on functional abilities among patients with chronic low back pain.Design:Randomized controlled clinical trial with blind assessors.Setting:Outpatient clinic.Subjects:Fifty-two sedentary patients, aged 18-65 years with chronic low back pain. Patients who were post surgery, post trauma, with cardiovascular problems, and with oncological disease were excluded.Intervention:Experimental 'walking' group: moderate intense treadmill walking; control 'exercise' group: specific low back exercise; both, twice a week for six weeks.Main measures:Six-minute walking test, Fear-Avoidance Belief Questionnaire, back and abdomen muscle endurance tests, Oswestry Disability Questionnaire, Low Back Pain Functional Scale (LBPFS).Results:Significant improvements were noted in all outcome measures in both groups with non-significant difference between groups. The mean distance in metres covered during 6 minutes increased by 70.7 (95% confidence interval (CI) 12.3-127.7) in the 'walking' group and by 43.8 (95% CI 19.6-68.0) in the 'exercise' group. The trunk flexor endurance test showed significant improvement in both groups, increasing by 0.6 (95% CI 0.0-1.1) in the 'walking' group and by 1.1 (95% CI 0.3-1.8) in the 'exercise' group.Conclusions:A six-week walk training programme was as effective as six weeks of specific strengthening exercises programme for the low back."); 
$my_search_text = ("back pain exercise");
$my_score = sorta_this($my_title, $my_description, $my_search_text);

echo ("Searching for \"$my_search_text\" in <br /><br />TITLE: <em>$my_title</em> <br /><br />and <br /><br />DESCRIPTION: <em>$my_description</em> <br /><br />yields a \"sorta\" relevancy of<strong> ");
echo $my_score . "</strong><br /><br />";
echo ("<hr />Hits for each search word in TITLE get 2 points, hits in DESCRIPTION get 1 point.<br />This number is then divided by the total number of words in the TITLE + DESCRIPTION.");
?>

The results …

Searching for "back pain exercise" in

TITLE: An aerobic walking programme versus muscle strengthening programme for chronic low back pain: a randomized controlled trial.

and

DESCRIPTION: Objective:To assess the effect of aerobic walking training as compared to active training, which includes muscle strengthening, on functional abilities among patients with chronic low back pain.Design:Randomized controlled clinical trial with blind assessors.Setting:Outpatient clinic.Subjects:Fifty-two sedentary patients, aged 18-65 years with chronic low back pain. Patients who were post surgery, post trauma, with cardiovascular problems, and with oncological disease were excluded.Intervention:Experimental 'walking' group: moderate intense treadmill walking; control 'exercise' group: specific low back exercise; both, twice a week for six weeks.Main measures:Six-minute walking test, Fear-Avoidance Belief Questionnaire, back and abdomen muscle endurance tests, Oswestry Disability Questionnaire, Low Back Pain Functional Scale (LBPFS).Results:Significant improvements were noted in all outcome measures in both groups with non-significant difference between groups. The mean distance in metres covered during 6 minutes increased by 70.7 (95% confidence interval (CI) 12.3-127.7) in the 'walking' group and by 43.8 (95% CI 19.6-68.0) in the 'exercise' group. The trunk flexor endurance test showed significant improvement in both groups, increasing by 0.6 (95% CI 0.0-1.1) in the 'walking' group and by 1.1 (95% CI 0.3-1.8) in the 'exercise' group.Conclusions:A six-week walk training programme was as effective as six weeks of specific strengthening exercises programme for the low back.

yields a "sorta" relevancy of 0.065


Hits for each search word in TITLE get 2 points, hits in DESCRIPTION get 1 point.
This number is then divided by the total number of words in TITLE + DESCRIPTION.

--------------

Related Content:

Written by nitin

August 11th, 2012 at 12:36 am

MIR newspaper article in the Boston Globe

leave a comment

Just passing along this cool newspaper article about music information retrieval that was passed along in the Music-IR listserv – where a lot of people mentioned in the article hang out.

:)

Neyfakh, Leon.(2012) When computers listen to music, what do they hear? Retrieved July 14, 2012, from http://articles.boston.com/2012-07-08/ideas/32563331_1_shazam-music-computers.

    Written by nitin

    July 14th, 2012 at 11:00 am

    awesome sauce: augmenting PubMed Central’s OAI response

    leave a comment

    Update, 9 pm EST, May 27, 2012: Well, this is interesting. After reading this page, I see that by setting the "metadataPrefix" to "pmc_fm" I can bypass steps #3 and #4 altogether it seems – provided one's OAI harvester/indexer is set to ingest the data in that format instead of Dublin Core or provided the script below transforms the data to Dublin Core before returning it. Anyway … score one for documentation and reading it after-the-fact!

    I saw a post from a Metadata Librarian on the code4lib list about their work with placing article data from PubMed into DSpace. They are doing some metadata additions and cleanup in Excel so I emailed them off-list and let them know about PubMed2XL and we went back and forth on a few things. Among the things I learned from them was that PubMed Central has an OAI feed. Cool!

    But that OAI feed doesn't return all the data they need.

    Here's an example: http://www.pubmedcentral.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=oai_dc&set=aac.

    One of the additional bits of data they wanted was author affiliation which is available from PubMed.gov's XML output. Same for the MESH terms.

    Anyway, besides pushing PubMed2XL, I also mentioned that it would be interesting to make a sauce, if you will, for PubMed Central's OAI feed. In other words, rather than using the OAI link above, one would use a service on top of that a la: http://myPubMedCentralOAI_sauce.com/oai?verb=ListRecords&metadataPrefix=oai_dc&set=aac. And when one went to that URL, the service would fetch the real OAI feed from PubMed Central and then get the additional metadata from the NCBI EFetch APIs. It would then drop the additional metadata into the original OAI response and finally serve it up to the user (e.g. the OAI harvester).

    I went ahead and played with a proof-of-concept using Google App Engine and it's working although it's adding about 20 – 25 seconds to the OAI response time. BTW: it's faster when I run it from localhost and not actually live on App Engine.

    Here's how it's done.

    1. The user goes to http://localhost:8084/oai?verb=ListRecords&metadataPrefix=oai_dc&set=aac.
    2. The app then fetches http://www.pubmedcentral.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=oai_dc&set=aac.
    3. For each record, the app parses out the PubMed Central ID and uses the EFetch API with PubMed Central as the database to get more data about the item.
    4. Unfortunately, the API for PubMed Central doesn't return MESH terms, so in step #3 the app just uses the returned data to translate the PubMed Central ID to the regular PubMed ID.
    5. With the PubMed ID now in hand, the app goes to the EFetch API and specifies PubMed as the database and hands the API the PubMed ID from step #4.
    6. Now the app gets the <Affiliation> value and the MESH terms and adds them to the real OAI response from step #2.
    7. Finally (whew!), the app returns the OAI feed with more metadata than before.

    This seems super klunky, so I'd love to hear about more elegant ways to do this … like having more options from PubMed Central without 3rd party hacks!

    But it is working. And it's just a proof-of-concept …

    Below, I've pasted a snippet of the augmented OAI data.

    Below that is the Python code if anyone's interested.

    ps: Python users will notice I used Google App Engine's "urlfetch" instead of "urllib" to request URLs. This is because using the latter was causing 500 errors due to timeouts. I don't think, from what I've read, that you can set the timeout with "urllib" in App Engine, so I used "urlfetch" which lets one set it up to 60 seconds.

    <!--
      This is just a test to use the NCBI EFetch APIs to augment the ouput of PubMed Central's OAI feed.
      In short, it's a web servive that sits on top of the PubMed Central OAI API.
    
      *** DO NOT use this service to harvest OAI records from PubMed Central ... you will probably mess up your repository!
      ... and I haven't verified that the additional data being added to the OAI feed is accurate per the item.
    
      Currently, this supports the following OAI parameters:
     
       - ListRecords
       - set
       - metadataPrefix (must use "oai_dc"/Dublin Core)
       - resumptionToken
     
      Thanks, Nitin Arora (humaneguitarist.org), May 2012.
     
      ps: adding metadata increased the OAI response time by 22.6178297997 seconds.
      -->
    <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
     <responseDate>2012-05-27T13:34:17Z</responseDate>
     <request verb="ListRecords" metadataPrefix="oai_dc" set="aac">http://www.pubmedcentral.nih.gov/oai/oai.cgi</request>
     <ListRecords>
      <record>
       <header>
        <identifier>oai:pubmedcentral.nih.gov:89011</identifier>
        <datestamp>2002-09-12</datestamp>
        <setSpec>aac</setSpec>
       </header>
       <metadata>
        <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
         <dc:title>Antifungal Peptides: Novel Therapeutic Compounds against Emerging Pathogens</dc:title>
         <dc:creator>De Lucca, Anthony J.</dc:creator>
         <dc:creator>Walsh, Thomas J.</dc:creator>
         <dc:subject>Minireviews</dc:subject>
         <dc:description/>
         <dc:publisher>American Society for Microbiology</dc:publisher>
         <dc:identifier>http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=89011</dc:identifier>
         <dc:type>Text</dc:type>
         <dc:language>en</dc:language>
         <dc:rights/>
         <dc:contributor.affiliation>Southern Regional Research Center, Agricultural Research Service, U. S. Department of Agriculture, New Orleans, Louisiana 70124, USA. adelucca@nola.srrc.usda.gov</dc:contributor.affiliation>
         <dc:subject.mesh>Animals</dc:subject.mesh>
         <dc:subject.mesh>Anti-Bacterial Agents</dc:subject.mesh>
         <dc:subject.mesh>Antifungal Agents</dc:subject.mesh>
         <dc:subject.mesh>Fungi</dc:subject.mesh>
         <dc:subject.mesh>Humans</dc:subject.mesh>
         <dc:subject.mesh>Mycoses</dc:subject.mesh>
         <dc:subject.mesh>Peptides</dc:subject.mesh>
        </oai_dc:dc>
       </metadata>
      </record>
      <resumptionToken>oai%3Apubmedcentral.nih.gov%3A89061!!!oai_dc!aac</resumptionToken>
     </ListRecords>
    </OAI-PMH>
    

    Python:

    ### pmc-oai-topper.py
    ### 2012, Nitin Arora
    
    ### import modules
    ##import urllib #DELETE
    from google.appengine.api import urlfetch #see: https://developers.google.com/appengine/docs/python/urlfetch/overview
    from lxml import etree
    import time
    import webapp2
    
    ### set what additional metadata to get from the EFetch API
    additions = [('contributor.affiliation', 'Affiliation'),
                 ('subject.mesh', 'DescriptorName')] #(name of element to output to, XPath); eventually needs to be in external config file
                #note: the XPath has to refer to elements in the EFetch XML output for the PubMed database as in "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=12654674&retmode=xml"
    
    #####
    class pmctopper(webapp2.RequestHandler):
      def get(self):
    
        #GET OAI parameter values
        verb_value = self.request.get('verb')
        metadataPrefix_value = self.request.get('metadataPrefix')
        set_value = self.request.get('set')
        resumptionToken_value = self.request.get('resumptionToken')
    
        #define the *real* OAI feed URL and read it
        if resumptionToken_value: #if a resumptionToken is being used
          url = 'http://www.pubmedcentral.gov/oai/oai.cgi?verb=%s&resumptionToken=%s' %(verb_value, resumptionToken_value)
        elif set_value:
          url = 'http://www.pubmedcentral.gov/oai/oai.cgi?verb=%s&set=%s&metadataPrefix=%s' %(verb_value, set_value, metadataPrefix_value)
        else:
          url = 'http://www.pubmedcentral.gov/oai/oai.cgi?verb=%s&metadataPrefix=%s' %(verb_value, metadataPrefix_value)
    
    ##    oai_in = urllib.urlopen(url).read() #DELETE
        oai_in = urlfetch.fetch(url=url, deadline=60).content
        time_in = time.time() #tracking how long this takes
    
        #parse OAI response as XML
        oai_parsed = etree.XML(oai_in)
        root = oai_parsed.xpath('.') #root node
        dc = root[0].xpath('//oai_dc:dc',
                                namespaces={'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/',
                                'dc': 'http://purl.org/dc/elements/1.1/'}) #access dc:* nodes (i.e. each item)
    
        #loop through all items and for each go fetch additional metadata via the EFetch APIs for PubMed Central and PubMed
        #place that additional data into the original OAI feed
        i = 0
        for record in dc:
          identifier = record.xpath('//dc:identifier',
                                namespaces={'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/',
                                'dc': 'http://purl.org/dc/elements/1.1/'})
          pmc_id =(identifier[i].text).replace('http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=','') #get the article's unique ID
    
          #request PubMed ID from Pubmed Central API ... ugh!
          efetch_url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=%s' %pmc_id #this is the URL to get metadata about the article per it's ID
    ##      efetch_read = urllib.urlopen(efetch_url).read() #DELETE
          efetch_read = urlfetch.fetch(url=efetch_url, deadline=60).content #read the API response
          efetch_parsed = etree.XML(efetch_read) #parse as XML
          pubmed_id = efetch_parsed.xpath('//article-id[@pub-id-type="pmid"]/text()') #pubmed id
    
          #now(!) get the additional data from the PubMed API
          efetch_url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=%s&retmode=xml' %pubmed_id
    ##      efetch_read = urllib.urlopen(efetch_url).read() #DELETE
          efetch_read = urlfetch.fetch(url=efetch_url, deadline=60).content
          efetch_parsed = etree.XML(efetch_read)
    
          for addition in additions:
            added_element = efetch_parsed.xpath('//%s/text()' %addition[1]) #get data from API XML tree
            for added_value in added_element:
              etree.SubElement(record, '{http://purl.org/dc/elements/1.1/}%s' %addition[0]).text = added_value
    
          i = i + 1
    
        #for reporting how long this all takes
        time_out = time.time()
        time_diff = str(time_out - time_in)
        
        #output the *new* OAI results with the additional metadata
        self.response.headers['Content-Type'] = 'text/xml' #output as XML doc
        disclaimer= '''<!--
        This is just a test to use the NCBI EFetch APIs to augment the ouput of PubMed Central's OAI feed.
        In short, it's a web servive that sits on top of the PubMed Central OAI API.
    
        *** DO NOT use this service to harvest OAI records from PubMed Central ... you will probably mess up your repository!
        ... and I haven't verified that the additional data being added to the OAI feed is accurate per the item.
    
        Currently, this supports the following OAI parameters:
        
          - ListRecords
          - set
          - metadataPrefix (must use "oai_dc"/Dublin Core)
          - resumptionToken
        
        Thanks, Nitin Arora (humaneguitarist.org), May 2012.
        
        ps: adding metadata increased the OAI response time by %s seconds.
        -->''' %time_diff
        self.response.out.write(disclaimer)
        for node in root:
          self.response.out.write(etree.tostring(node))
    
    ### app engine stuff ...
    app = webapp2.WSGIApplication([('/oai', pmctopper)],
                                  debug=True)
    --------------

    Related Content:

    Written by nitin

    May 27th, 2012 at 10:11 am

    museline: trying to add support for compressed MusicXML

    4 comments

    Just a quick follow up to the last post about using Google Chart Tools to outline melodic contours from MusicXML files …

    I wanted to add support for compressed MusicXML files in addition to the non-compressed ones. So far, the code I've got seems to be working with the two or three compressed MusicXML files from Wikifonia I tested.

    Here's a screenshot below of A-Ha's "Take On Me", one of the best songs from the 80's with one of the absolute best videos, too! To make the graph I passed it to the app a la "http://localhost:8083/?mxml=http://static.wikifonia.org/1934/musicxml.mxl".

    museline_aha_screenshot.png

    Here's the video:

    Keep in mind the contour script doesn't take repeats into account and that the entire melody repeats three times in the song.

    Also, I don't like to make code downloadable if I'm still working on it because I don't want to junk up my web directory, but I'll paste everything essential below: the Google App Engine YAML file, the Python code, and the Jinja/HTML template.

    YAML:

    application: museline
    version: 1
    runtime: python27
    api_version: 1
    threadsafe: true
    
    handlers:
    - url: /stylesheets
      static_dir: stylesheets
    - url: /.*
      script: museline.app
     
    libraries:
    - name: jinja2
      version: latest
    - name: lxml
      version: latest
    

    Python:

    ### museline.py
    ### 2012, Nitin Arora
    
    ### import modules
    import urllib
    from lxml import etree
    import math
    import re
    import webapp2
    import jinja2
    import os
         
    jinja_environment = jinja2.Environment(
      loader=jinja2.FileSystemLoader(os.path.dirname(__file__)))
      
    #####
    class museline(webapp2.RequestHandler):
      def get(self):
        
        ### read MusicXML file
        try:
          url = self.request.get('mxml')
    ##      url = 'http://blog.humaneguitarist.org/uploads/i_heart_thee.xml' #test line
          if url[-4:] == '.xml': # uncompressed MusicXML
            readUrl = urllib.urlopen(url).read()
            
          else: # compressed MusicXML
          ### References:
            # http://stackoverflow.com/a/8858735
            # http://stackoverflow.com/questions/1313845/if-i-have-the-contents-of-a-zipfile-in-a-python-string-can-i-decompress-it-with
            from cStringIO import StringIO
            compressed = urllib.urlopen(url)
            compressedString = StringIO(compressed.read())
            import zipfile
            zipped = zipfile.ZipFile(compressedString, "r")
    
            archiveFiles = zipped.namelist()
    ##        self.response.out.write(archiveFiles) # test line
            for archiveFile in archiveFiles:
              if archiveFile[-4:] == ".xml" and "/" not in archiveFile:
                realXML = archiveFile
            extracted = zipped.open(realXML,'r')
            readUrl = extracted.read()
    
    ##      self.response.out.write(readUrl) # test line
                    
        except:
          errorMessage = '''<pre>
    You must pass an "mxml" parameter.
    If you have but still see this message, then there is a problem accessing/reading the MusicXML file.
    </pre>'''
          self.response.out.write(errorMessage)
          return
    
        ### setup pitch values
        notes = ['C','D','E','F','G','A','B']
        i = 0
        noteVals = {}
        for note in notes:
          if note == 'C' or note == 'F':
            noteVals[note] = i + 1
            i = i + 1
          else:
            noteVals[note] = i + 2
            i = i + 2
    
        ### parse MusicXML file
        parsed = etree.XML(readUrl)
    
        ### get basic descriptive metadata
        metadata = []
        elementList = ['work-title',
                       'work-number',
                       'movement-number',
                       'movement-title',
                       'creator[@type="composer"]',
                       'creator[@type="lyricist"]']
        for element in elementList:
          xpath = str(".//%s") %element
          if parsed.find(xpath) !=None:
            found = parsed.find(xpath).text
            att = re.match(r'(.*)type="(.*)\"', element)
            if att:
              element = att.group(2)
            if found:
              metadata.append((element,found))
    ##    self.response.out.write(metadata) # test line
    
        ### access part one tree                       
        part = parsed.find('.//part[@id="P1"]')
        pitches = part.findall('.//pitch')
    ##    self.response.out.write(str(len(pitches)) + " pitches.\n") # test line, number of notes (non-rests)
    ##    self.response.out.write(str(len(pitches)*.618) + " Golden Ratio.\n") # test line, maybe something for the future.
    
        ### put pitch values in a list
        pitchList = []
        i = 1
        for pitch in pitches:
          if pitch.find('.//alter') != None:
            alter = int(pitch.find('.//alter').text)
          else:
            alter = 0
          step = pitch.find('.//step')
          octave = int(pitch.find('.//octave').text)
          pitchPos = str('pitch: ' + str(i))
          pitchClassVal = ((int(noteVals[step.text]) + alter)) * .01
          pitchVal = ((int(noteVals[step.text]) + alter) + (octave * 12)) * .01
          label = (pitchPos, pitchVal, pitchClassVal)
          pitchList.append(label)
          i = i + 1
    
    ##    for pitch in pitchList: # test block
    ##      self.response.out.write(str(pitch)+'<br>')
          
        #data for the Jinja template  
        template_values = {
          'pitchList': pitchList,
          'url': url,
          'metadata': metadata}
    
        template = jinja_environment.get_template('museline.html')
        self.response.out.write(template.render(template_values)) #write data to the html template
      
    app = webapp2.WSGIApplication([('/', museline)],
                                  debug=True)
    

    Template:

    <!DOCTYPE HTML>
    <!-- museline.html -->
    <html>
      <head>
        <title>
          museline
        </title>
        <link type="text/css" rel="stylesheet" href="/stylesheets/style.css" />
        <script type="text/javascript" src="http://www.google.com/jsapi"></script>
        <script type="text/javascript">
          google.load('visualization', '1', {packages: ['corechart']});
        </script>
        <script type="text/javascript">
          function drawVisualization() {
            // Create and populate the data table.
            var data = google.visualization.arrayToDataTable([
            ['pitch position', 'melodic contour'],
            {% for pitch in pitchList %}
              ['{{ pitch[0] }}', {{ pitch[1] }}],
            {% endfor %}
            ]);
           
            // Create and draw the visualization.
            new google.visualization.LineChart(document.getElementById('visualization')).
            draw(data, {curveType: "function",
              width: 800, height: 400,
            vAxis: {maxValue: 1}}
            );
          }
          google.setOnLoadCallback(drawVisualization);
        </script>
      </head>
      <body>
        <div id="visualization"></div>
        <p>Metadata:</p>
        <ul>
        {% for metadatum in metadata %}
          <li>{{ metadatum[0] }} : {{ metadatum[1] }}</li>
        {% endfor %}
          <li>URL: <a href="{{ url }}">{{ url }}</a></li>
        </ul>
      </body>
    </html>
    
    --------------

    Related Content:

    Written by nitin

    May 5th, 2012 at 5:36 pm

    museline: charting melodic contours via web service

    leave a comment

    In the last post, I mentioned I was playing with Google App Engine and Google Chart Tools.

    Last night, with some silly movie streaming in the background, I was in bed tinkering with a little idea that I'm sure has been done a-thousand times already and that may be built into high end music notation applications. But it hasn't been done by anyone as stoopid as me!

    :P

    What I did was whip up a little App Engine/Python app where one can pass it a partwise MusicXML file and it will use Google Chart Tools to create a little line chart of the melodic contour of the first <part> element.

    Here's a screenshot below of the results using the MusicXML sample file available on the MakeMusic site of Schumann's "Im wunderschönen Monat Mai" from the Dichterliebe. The app has an "mxml" parameter that tells it which MusicXML file to use a la "http://localhost:8083/?mxml=http://downloads2.makemusic.com/musicxml/Dichterliebe01.xml".

     

    I've embedded a really nice performance on YouTube if anyone wants to follow along. The contour graph represents the vocal part only.

     

    Now, this is just a start. There's a lot of work to do if I pursue this. For starters, I'd like to make the chart synced with an audio/video recording. I don't know if I can do that with Chart Tools, but probably with the <canvas> element if nothing else. Also, I haven't tried this yet with any non-homophonic parts. Anyway, it's a start and it's kinda fun.

    I tried to add another line for the actual pitch class contour but it wasn't as interesting to look at as the melodic contour so I disabled that "feature". By pitch class, I mean I was using octave equivalency so that all "C" notes, for example, were plotted at the exact same vertical position as opposed to the screenshot above where two "C" notes an octave apart would have different vertical points on the graph to depict the intervallic difference.

    As far as plotting the notes, I ignored rests and durations. I just plotted the pitches as below, starting with "C" with a value of "1" and with the "B" a seventh up from that "C" receiving a "12".

    • C : 1
    • D : 3
    • E : 5
    • F : 6
    • G : 8
    • A : 10
    • B : 12

    This way a "C-sharp" and "D-flat" receive a score of "2", for example, because they lie between "C as 1" and "D as 3".

    In MusicXML, the <step> element has the note name and the optional <alter> element, which is a number, tells you if it's sharp or flat, etc. The numerical <octave> element tells you what octave range the pitch is in.

    So what I'm doing is pulling out the <step> value and converting it to a number as above, adding the <alter> value (a flat is a negative number), and then multiplying adding that sum to 12 times the <octave> value. Then, I multiple the value by ".01" just to reduce the number because I want the graph's vertical limit to be a small number even though this shouldn't change the contour itself.

    Last, I'm trying to pull some basic descriptive metadata if they are present in the MusicXML file and show it below the graph.

    Maybe I'll do more with this later. Just goofin' for now.

    --------------

    Related Content:

    Written by nitin

    May 3rd, 2012 at 3:55 pm

    North Carolina grants, Google App Engine, and pie … mmm.

    leave a comment

    I took April off from blogging after realizing I was over blogging, as opposed to over logging.

    I'll keep this short. Well, I'll try.

    I'm shacked up in the apartment due to some unexpected circumstances and yesterday I decided to try and be a little productive and learn something I could potentially use in the workplace.

    I learned a little about Google App Engine. I was drawn to it because of the Python support and because it gives me a free environment where I can deploy Python apps using the ever-elusive lxml library.

    While I wrote some silly stuff using lxml and data available from the Business.gov API I ended uploading a simple app – if you can call it that – that parses a CSV file from North Carolina's (USA) NCOpenBook.

    I didn't use the csv module because the CSV file I used has like three lines at the top that aren't headers (people: don't do that!). I don't know if there's a way to handle that with the csv module (there probably is) but I wasn't interested in digging around. Instead, I used a modified version of this code I wrote previously.

    The CSV file lists grantees who've received funding by North Carolina and the app pulls out the top ten since 2007 based on cumulative grant totals. The app uses Google Chart Tools to make a pie chart of the top ten recipients. I'm not so sure about the colors in the pie chart – it's hard to see the difference between some of the colors associated with each grantee – but it's a simple start.

    Here's a screenshot:

    Top Ten NC Grants by Grantee

    .. and here's the link to the app online: http://top-ten-nc-totals-by-grantee.appspot.com.

    I've also pasted the app.yaml file, my Python code, and the Jinja/HTML template below if anyone's interested.

    YAML:

    application: top-ten-nc-totals-by-grantee
    version: 1
    runtime: python27
    api_version: 1
    threadsafe: true
    
    handlers:
    - url: /stylesheets
      static_dir: stylesheets
    - url: /.*
      script: nctotals.app
     
    libraries:
    - name: jinja2
      version: latest

    Python:

    #import modules
    import urllib
    import webapp2
    
    import jinja2
    import os
         
    jinja_environment = jinja2.Environment(
      loader=jinja2.FileSystemLoader(os.path.dirname(__file__)))
    
    #####
    
    #see: http://stackoverflow.com/a/2827664
    class Object(object):
      pass
    
    #my CSV parser
    def csv2dict(fileName, delimiter):
      f = urllib.urlopen(fileName) #open file
      lines = f.read() #read file
    
      rows = lines.split("\n") #put lines in list
    
      #cut out non-header rows at top of this particular CSV file
      for i in range(0,3):
        rows.pop(0)
    
      #shorten the CSV data to 10 rows (there were too many damn rows in the CSV file!)
      for i in range(12,len(rows)+1):
        rows.pop(-1)
    
      headers = rows[0].split(delimiter) #put header titles in list
      rows.pop(0) #remove header from "rows" list
    
      i = 0
      worksheet = {}
      for header in headers: #for each header, i.e. each column
        columnCells = []
        #print header #test line
        for row in rows: #for each non-header row in delimited file
          if row != "": #!!!you need to also add a test for lines that don't split on the delimeter (i.e. notes)
            rowCells = row.split(delimiter) #get cells in row
            columnCells.append(rowCells[i].strip()) #put column's cells in list
        worksheet[header] = columnCells #set header as KEY and set "columnCells" list as VALUE
        i = i + 1
     
      return worksheet
    
    #####
    
    class MainPage(webapp2.RequestHandler):
      def get(self):
        parsed = csv2dict("http://data.osbm.state.nc.us/openbook/comma_grant_cumulative_awards_and_annual_disbursements_by_grantee.csv", '","') #pass filename and delimiter
        
        topTen = range(0,len(parsed['"Non-Profit Name (*)'])) #i.e. range is 1 to 10, or 0 to 9 depending on your p.o.v.
    
        for i in topTen: #add attributes to each of the ten agencies in the CSV file
          topTen[i] = Object()
          topTen[i].name = parsed['"Non-Profit Name (*)'][i].replace('"','')
          topTen[i].total = parsed['Cumulative Total Award'][i]
          raw_total = parsed['Cumulative Total Award'][i]
          raw_total = raw_total.replace('$','')
          raw_total = raw_total.replace(',','')
          topTen[i].raw_total = raw_total
          
        #data for the Jinja template  
        template_values = {
          'topTen': topTen}
    
        template = jinja_environment.get_template('index.html')
        self.response.out.write(template.render(template_values)) #write data to the index.html template
      
    app = webapp2.WSGIApplication([('/', MainPage)],debug=True)
    

    Template:

    <!DOCTYPE HTML>
    <html>
      <head>
        <title>
          Top Ten NC Grants by Grantee (since 2007)
        </title>
        <link type="text/css" rel="stylesheet" href="/stylesheets/style.css" />
        <script type="text/javascript" src="http://www.google.com/jsapi"></script>
        <script type="text/javascript">
          google.load('visualization', '1', {packages: ['imagepiechart']});
        </script>
        <script type="text/javascript">
          function drawVisualization() {
            // Create and populate the data table.
            var data = new google.visualization.DataTable();
            data.addColumn('string', 'name');
            data.addColumn('number', 'raw_total');
            data.addRows([
              {% for topper in topTen %}
              ["{{ topper.name }} - {{ topper.total }}", {{ topper.raw_total }}],
              {% endfor %}
            ]);
        
            // Create and draw the visualization.
            new google.visualization.ImagePieChart(document.getElementById('visualization')).
              draw(data, null);
          }
          google.setOnLoadCallback(drawVisualization);
        </script>
      </head>
      <body>
        <h3>Top Ten <a href="http://www.ncopenbook.gov/NCOpenBook/GrantsHome.jsp">NC Grants</a> by Grantee (cumulative totals since 2007)</h3>
        <p>see the source CSV file <a href="http://data.osbm.state.nc.us/openbook/comma_grant_cumulative_awards_and_annual_disbursements_by_grantee.csv">here</a></p>
        <div id="visualization"></div>
        <p>Made with:</p>
        <ul>
          <li><a href="https://developers.google.com/appengine/docs/python/gettingstartedpython27/">Google App Engine (Python 2.7)</a></li>
          <li><a href="https://developers.google.com/chart/">Google Chart Tools</a></li>
        </ul>
        <p>More info (blog post):</p>
        <ul>
          <li><a href="http://blog.humaneguitarist.org/2012/05/01/north-carolina-grants-google-app-engine-and-pie-mmm/">North Carolina grants, Google App Engine, and pie ... mmm.</a></li>
        </ul>
      </body>
    </html>
    --------------

    Related Content:

    Written by nitin

    May 1st, 2012 at 10:42 am

    Full Metal Alchemyapi.com or “more term extraction crap and linky data crud”

    leave a comment

    As I mentioned before, I'm playing with the idea of using term generating APIs to build facets in a Solr index project that I'm working on with some people.

    The results seem really promising.

    If I wasn't in need of a nap before some more college basketball gets underway, I'd say more than I'm about to.

    Instead, I'm going to do three quick things here:

    1. Provide a screenshot of the index UI using Calais "social tags" for facets.
      1. This is a local (my computer) copy of the index using a very small set of item metadata. That's to say we currently have about 37k items in the index, and I'm just using about 1k.
      2. I'm only using Calais tags if the "importance" attribute is equal to "1", so I'm leaving out tags Calais considers less relevant because, well, some of the terms generated were making me think "WTF?".
      3. Some of the terms with underscores like "War_Conflict" appear to be those used in the news industry and are potentially ones to throw out.
    2. Post a small Python script to make a call to Alchemyapi.com, which is similar – and possible better – than Calais.
    3. Post the Alchemyapi.com results XML document and talk a little about what I think it can be used for in our project.

    So, here's the Calais screenshot (you'll need to view the image at full-resolution to read it):

    Calais Facets

    Here's the Python script to call the Alchemyapi.com API:

    import urllib, urllib2
    
    #set API url and API key
    url = 'http://access.alchemyapi.com/calls/text/TextGetRankedConcepts'
    apikey = '' #your API key goes here
    #get Alchemy API key from: http://www.alchemyapi.com/api/register.html
    
    #set some text for the API
    text = '''
    Episcopal churches
    Churches Cemeteries
    Tombs and sepulchral monuments
    Postcards--North Carolina.
    Flat Rock (N.C.)
    Henderson County (N.C.)
    '''
    
    #send data to API
    params = urllib.urlencode({
      'apikey': apikey,
      'text': text,
      'showSourceText': '1', #shows the original text sent to the API
    })
    alchemyThis = urllib2.urlopen(url, params).read()
    
    #view results
    print alchemyThis
    

    And here's the output for the code above:

    <?xml version="1.0" encoding="UTF-8"?>
    <results>
      <status>OK</status>
      <usage>By accessing AlchemyAPI or using information generated by AlchemyAPI, you are agreeing to be bound by the AlchemyAPI Terms of Use: http://www.alchemyapi.com/company/terms.html</usage>
      <url/>
      <language>english</language>
      <text>Episcopal churches Churches Cemeteries Tombs and sepulchral monuments Postcards--North Carolina. Flat Rock (N.C.) Henderson County (N.C.)</text>
      <concepts>
        <concept>
          <text>North Carolina</text>
          <relevance>0.920839</relevance>
          <website>http://www.nc.gov</website>
          <dbpedia>http://dbpedia.org/resource/North_Carolina</dbpedia>
          <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f800000000002b62d</freebase>
          <opencyc>http://sw.opencyc.org/concept/Mx4rvViyspwpEbGdrcN5Y29ycA</opencyc>
          <yago>http://mpii.de/yago/resource/North_Carolina</yago>
          <geonames>http://sws.geonames.org/4482348/</geonames>
        </concept>
        <concept>
          <text>Tomb</text>
          <relevance>0.837256</relevance>
          <geo>29.855 31.219</geo>
          <dbpedia>http://dbpedia.org/resource/Tomb</dbpedia>
          <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f800000000007ff03</freebase>
          <opencyc>http://sw.opencyc.org/concept/Mx4rwQw2p5wpEbGdrcN5Y29ycA</opencyc>
        </concept>
        <concept>
          <text>Burial monuments and structures</text>
          <relevance>0.773605</relevance>
          <dbpedia>http://dbpedia.org/resource/Burial_monuments_and_structures</dbpedia>
        </concept>
        <concept>
          <text>Flat Rock, Henderson County, North Carolina</text>
          <relevance>0.718415</relevance>
          <geo>35.266666666666666 -82.45333333333333</geo>
          <website>http://villageofflatrock.org/</website>
          <dbpedia>http://dbpedia.org/resource/Flat_Rock,_Henderson_County,_North_Carolina</dbpedia>
          <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f80000000000ebc28</freebase>
          <yago>http://mpii.de/yago/resource/Flat_Rock,_Henderson_County,_North_Carolina</yago>
        </concept>
        <concept>
          <text>Henderson County, North Carolina</text>
          <relevance>0.615825</relevance>
          <geo>35.34 -82.48</geo>
          <website>http://www.hendersoncountync.org</website>
          <dbpedia>http://dbpedia.org/resource/Henderson_County,_North_Carolina</dbpedia>
          <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f80000000000a10b4</freebase>
          <yago>http://mpii.de/yago/resource/Henderson_County,_North_Carolina</yago>
        </concept>
        <concept>
          <text>Asheville, North Carolina</text>
          <relevance>0.610351</relevance>
          <website>http://www.ashevillenc.gov/</website>
          <dbpedia>http://dbpedia.org/resource/Asheville,_North_Carolina</dbpedia>
          <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f80000000000eb2ac</freebase>
          <census>http://www.rdfabout.com/rdf/usgov/geo/us/nc/counties/buncombe_county/asheville</census>
          <yago>http://mpii.de/yago/resource/Asheville,_North_Carolina</yago>
          <geonames>http://sws.geonames.org/4453066/</geonames>
        </concept>
        <concept>
          <text>Episcopal Church in the United States of America</text>
          <relevance>0.610029</relevance>
          <dbpedia>http://dbpedia.org/resource/Episcopal_Church_in_the_United_States_of_America</dbpedia>
          <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f8000000000015f1b</freebase>
          <yago>http://mpii.de/yago/resource/Episcopal_Church_in_the_United_States_of_America</yago>
        </concept>
        <concept>
          <text>New York</text>
          <relevance>0.592008</relevance>
          <geo>43.0 -75.0</geo>
          <website>http://www.ny.gov</website>
          <dbpedia>http://dbpedia.org/resource/New_York</dbpedia>
          <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f800000000054dd5d</freebase>
          <opencyc>http://sw.opencyc.org/concept/Mx4rvViNs5wpEbGdrcN5Y29ycA</opencyc>
          <census>http://www.rdfabout.com/rdf/usgov/geo/us/ny</census>
          <yago>http://mpii.de/yago/resource/New_York</yago>
        </concept>
      </concepts>
    </results>
    

    As you can see, "New York" shows up but it has less than 60% relevance, so maybe that's a threshold to consider when indexing automated subject terms with Alchemyapi. That's just my theory and only lots of testing will help determine what the threshold really is – if there's one at all.

    As you can also see, there's a lot of potential for linked data with this output: to data from relevant dbpedia pages, etc. One neat thing would be to make it so that if the user hovers over a facet, that the UI pops-up more information from these linked data sources like relevant websites, mapped geo-coords using the Google Maps API, definitions of the faceted term, and similar concept visualizations, etc.

    That's all. Sleepy time and B-ball starts soon …

    --------------

    Related Content:

    Written by nitin

    March 25th, 2012 at 4:57 pm

    Switch to our mobile site