blog.humaneguitarist.org

discoveries in digital audio, music notation, and information encoding

Archive for the ‘APIs’ tag

keyword vs. phrase searching of the Soundboard, a GFA publication

leave a comment

As I mentioned before, last summer I went to the Guitar Foundation of America convention in Charleston.

I also mentioned that I'd asked some questions about whether the GFA journal, "Soundboard" was full-text indexed.

Via the FlippingBook software the GFA uses to display current issues online (membership required), there is full-text searching capability because the content is indexed as far as I can tell. But as I was saying, I don't think one can search across *all* online Soundboards simultaneously – i.e. fire off one query and get results across all online Soundboards. I could be wrong about that.

In contrast, the PDF back issues sold on a DVD-ROM are not full-text indexed nor full-text searchable with Adobe Acrobat Reader as far as I can tell. And I think this is where there's real confusion – perhaps on my part – about what we mean when we use terms like "keyword" searching.

To me, keyword searching means full-text and not a "find" (as in Acrobat Reader). The Webopedia site differentiates these as "keyword" and "phrase" searches, respectively. The GFA is using a different meaning, per the "How to search Soundboard back issues.pdf" file that comes with the DVD, for "keyword" searching:

"These issues have been processed both to reproduce the page-by-page appearance of the originals on your computer screen, and to apply an "optical character recognition" (OCR) process to the text, so that every page of every issue is now keyword searchable."

In my experience, however, the search provided internally via Adobe Acrobat Reader (and Foxit Reader, too) is what I'd just call a "find" (i.e. the same as Ctrl-F on your browser). In fact, in my version of Acrobat Reader and per the screenshot in the "How to search Soundboard back issues.pdf" file, Adobe also uses the phrase "find" and not "search" in their application. Their "Advanced Search" adds options really dealing with what to search (comments, all files in a folder, etc.) but not really how to search (in the algorithmic sense) – so, it's still a "find", though more feature-rich. Now, if you have Acrobat Pro (admittedly I do through work) you apparently can create an index and then actually do a full-text search, but that doesn't help people who don't have the pro version and won't/can't buy it.

Granted, I can index the PDF with my operating system (Windows) and do a full-text search, but I don't really get much useful information other than what files match. I don't get useful information on where the passage exists (page number, etc).

Consider the following passage from Soundboard Volume 1, Number 1, 1974:

"Mr. Llois Mauerhofer, Elizabethstrasse 93, 8010 Graz, Lustria, was reported working on a doctoral dissertation at the University of Graz on Leonard von Call, early 19th c. guitarist active in Vienna who is best remembered for his serenades for guitar and strings."

A "find" won't match that passage if you search for "Graz University" or "University Graz" or "strings Vienna" but a real keyword search likely would.

Of course, a demonstration is in order, so using a tool called Apache Tika to extract the text from the aformentioned PDF scan of Soundboard v.1, #1, 1974; a little Python software script I wrote to output the data to a database-friendly file; and an online database, I indexed the data and made a little API – all that means is that there's page you can go to, throw some search terms at it, and get the results back as structured data (um, usually not fun to read through).

By the way, I normally use more technical jargon in my posts but I have some guitarist buddies who I want to read this page.

Anyway, here are the three searches mentioned above that don't yield results in Acrobat Reader but do using a full-text search (you can see the search terms in bold in the links below). Don't worry if you can't read the output, just focus on the fact that something comes back (provided my database isn't down at the moment!).

http://blog.humaneguitarist.org/uploads/Soundboard/currentVersion/search/?q=Graz+University
http://blog.humaneguitarist.org/uploads/Soundboard/currentVersion/search/?q=University+Graz
http://blog.humaneguitarist.org/uploads/Soundboard/currentVersion/search/?q=strings+Vienna

For a more user-friendly version, try going here:

http://blog.humaneguitarist.org/uploads/Soundboard/currentVersion/soundboard_search.html

Try typing in the three searches mentioned above. Then try some more searches for fun. For simplicity's sake, I hard-coded the system to never return more than 10 results.

Of course, this should all scale to indexing the text of all the PDFs on the DVD, but exposing those openly on the web wouldn't be appropriate.

But my point with this demo is to say that this is more like what I meant by "keyword" searching at the GFA convention. There's probably a way to ingest the old PDFs into the FlippingBook software or at least something else like the Internet Archive book reader. That would probably require re-OCRing the images so that the coordinates of the words could be indexed as well, allowing one to see where on a page the results are, just as with the current issues via FlippingBook.

Ok, if you're still here and are a geek, here's the Python script, "soundboardToTabDelimited.py".

'''
usage example:
  $ python soundboardToTabDelimited.py V01-n1-1974.pdf

This yields "V01-n1-1974.xhtml" and then "V01-n1-1974.txt"
 
Note: you must have the lxml module installed (which isn't always fun).
You can get it here: http://lxml.de/
'''

import codecs, subprocess, sys
from lxml import etree

##### globals
tab = "\t"
br = "\n"


##### run Apache Tika on the file passed via the command line
soundboard = sys.argv[1].replace(".pdf", "")
command_string = "java -jar tika-app-1.2.jar %s > %s" %(soundboard + ".pdf", soundboard + ".xhtml")
command = subprocess.Popen(command_string, shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
command.wait() #wait until the subprocess finishes.


##### write file headers (this needs to be deleted if you're going to later import the file via PHPMyAdmin).
tab_delimited = codecs.open(soundboard + ".txt", "w", "utf-8") #output file

tab_delimited.write("journal_id" + tab + "volume" + tab + \
                    "issue" + tab + "year" + tab + \
                    "page_id" + tab + "text_id" + tab + "text" + br)


##### extract volume, issue, year from filename
volume = int(soundboard.split("-")[0].replace("V", ""))
issue = int(soundboard.split("-")[1].replace("n", ""))
year = int(soundboard.split("-")[2])
journal_id = "%04d_%04d_%04d" %(volume, issue, year)


##### parse xhtml file
soundboard_parse = etree.parse(soundboard + ".xhtml")
root = soundboard_parse.xpath(".")

div_tags = root[0].xpath("//xhtml:div[@class='page']",
             namespaces={"xhtml":"http://www.w3.org/1999/xhtml"})


##### extract text from each div/p tag and write data to file
page_id = 1
for div_tag in div_tags:
  text_id = 0
  p_tags = div_tag.xpath("xhtml:p",
             namespaces={"xhtml":"http://www.w3.org/1999/xhtml"})

  for p_tag in p_tags:
    p_text = p_tag.text
    if p_text !=None and p_text !="":
      p_text = p_text.replace(br, "")
      p_text = p_text.replace(tab, "  ")
      p_text = p_text.strip()
      if p_text != "":
        tab_delimited.write(str(journal_id) + tab + str(volume) + tab + \
                            str(issue) + tab + str(year) + tab + \
                            str(page_id) + tab + str(text_id) + \
                            tab + p_text + br)
        text_id = text_id + 1
     
  page_id = page_id + 1

tab_delimited.close()
# fin
--------------

Related Content:

Written by nitin

January 5th, 2013 at 12:35 pm

PyEDS: a simple Python starter library for Ebsco’s Discovery Service (EDS)

leave a comment

Before this little vacation I'm on started (sadly, it's almost over!), I was allowed to have access to Ebsco's Discovery Service (EDS) API and its documentation WIKI.

I sent a tiny bit of feedback on some of the things in the documentation that I think are unclear or really need correction and I'm looking to send more when I return to work.

My biggest concern was that  – and I think this is true of A LOT of API documentation – it requires a lot of reading on the user's part to figure out what means what, which almost invariably exceeds the amount of work to actually write the code to authenticate, make queries, etc.

That's to say that often working through documentation about tying a shoelace is more of a task than actually tying said shoelace.

I *think* developers really just want to start experimenting with code, so clarity and really concise language with examples are really of the utmost importance.

Speaking of examples, I also think that sample code needs to have scope in mind. What I'm getting at is that sample code for a search API shouldn't be a "soup to nuts" thing that entails authenticating, making requests, having a client-side UI/interface and displaying results, etc. That's too much. Again, I think (off the top of my head of course and with nothing more than a gut feeling) that it might be more helpful to simply show how to authenticate and make a request and show the formatting of a sample response. The other stuff – interface, UI, etc, etc. – just convolutes the code and adds noise to the basics. In fact, that confuses API usage implementation vs. the API usage itself.

Better still would be to offer small libraries in popular scripting languages that simplify the basics – again, to facilitate people playing with one's API's. The easier and more "fun" it is, the more likely I think (yeah, yeah, I know!) people are likely to really dream about incorporating the API, etc. into their applications and what-nots.

So along those lines, I've pasted a little sample Python script below that makes it really easier for me to authenticate, open a session, conduct searches, format the JSON response, and close the session. It needs work (what doesn't?) but it does what I mean for it to for now.

I probably shouldn't post a sample response since access to the EDS WIKI is for customers only, but if you aren't a customer or at least aren't interested, why are you even reading this?

:P

#PyEDS.py

'''
This module provides a basic Python binding to Ebsco's EDS API, allowing one to:
  - authenticate with a UserID and Password,
  - open and close a session,
  - perform a search (results are returned as JSON),
  - pretty print the JSON.
 
Thanks,
Nitin Arora; nitaro74@gmail.com
____________________________________________________________________________________________________
#Usage example:
 
  import PyEDS as eds
  
  eds.authenticateUser('USERID_GOES_HERE', 'PASSWORD_GOES_HERE')
  eds.openSession('PROFILE_GOES_HERE', 'GUEST_GOES_HERE', 'ORG_GOES_HERE')
 
  #eds.authenticateFile() #alternative to using authenticateUser() and openSession()
  #uses values in JSON config file argument(default="config.json")
  
  #sample "config.json" file:
  """
  {
    "EDS_config": {
      "UserId": "USERID_GOES_HERE",
      "Password": "PASSWORD_GOES_HERE",
      "Profile": "PROFILE_GOES_HERE",
      "Guest": "GUEST_GOES_HERE",
      "Org": ORG_GOES_HERE
    }
  }
  """
 
  kittens = eds.advancedSearch('{"SearchCriteria":{"Queries":[{"Term":"kittens"}],"SearchMode":"smart","IncludeFacets":"y","Sort":"relevance"},"RetrievalCriteria":{"View":"brief","ResultsPerPage":10,"PageNumber":1,"Highlight":"y"},"Actions":null}')
  puppies = eds.advancedSearch('{"SearchCriteria":{"Queries":[{"Term":"puppies"}],"SearchMode":"smart","IncludeFacets":"y","Sort":"relevance"},"RetrievalCriteria":{"View":"brief","ResultsPerPage":10,"PageNumber":1,"Highlight":"y"},"Actions":null}')
  cubs = eds.basicSearch('cubs')
  piglets = eds.basicSearch('piglets', view='brief', offset=1, limit=10, order='relevance')
  
  eds.closeSession()
  
  print 'Some search results with the EDS API ...'
  print '\n"kittens" advanced search as original JSON:'
  print kittens
  print '\n"puppies" advanced search as original JSON:'
  print puppies
  print '\n"kittens" advanced search as JSON with indentation and non-ascii escaping:'
  print eds.prettyPrint(kittens)
  print '\n"cubs" and "piglets" basic searches as original JSON:'
  print cubs, piglets
  print '\nGoodbye.'
____________________________________________________________________________________________________
 
TO DO:
  - add more options to basicSearch() like "facets", "search mode", "fulltext", "thesauras", etc.
    - can't hurt! :-]
  - consider adding an authenticateIP() function that uses the IP authentication method.
  - deal with expired tokens, etc.; see: http://edswiki.ebscohost.com/API_Reference_Guide:_Appendix
'''
 
import urllib2
_EDS_ = {}
 
 
def authenticateUser(UserId, Password):
  '''Authenticates user with an EDS UserId and Password.'''
  auth_json = '{"UserId":"%s","Password":"%s","InterfaceId":"WSapi"}' %(UserId, Password)
  req = urllib2.Request(url='https://eds-api.ebscohost.com/authservice/rest/UIDAuth',
                        data=auth_json,
                        headers={'Content-Type':'application/json'})
  req_open = urllib2.urlopen(req)
  req_results = req_open.read()
  
  req_results_dictionary = eval(req_results) #convert JSON to dictionary.
  _EDS_['AuthToken'] = req_results_dictionary['AuthToken']
  _EDS_['AuthTimeout'] = req_results_dictionary['AuthTimeout']
 
 
def openSession(Profile, Guest, Org):
  '''Opens the EDS session with an EDS Profile, the Guest value ("y" or "n"), and the Org nickname.'''
  sessionOpen_json = '{"Profile":"%s","Guest":"%s","Org":"%s"}' %(Profile, Guest, Org)
  req = urllib2.Request(url='http://eds-api.ebscohost.com/edsapi/rest/CreateSession',
                        data=sessionOpen_json,
                        headers={'Content-Type':'application/json',
                        'x-authenticationToken':_EDS_['AuthToken']})
  req_open = urllib2.urlopen(req)
  req_results = req_open.read()
 
  req_results_dictionary = eval(req_results)
  _EDS_['SessionToken'] = req_results_dictionary['SessionToken'].replace('\\/', '/')
 
 
def closeSession():
  '''Closes the EDS sesssion.'''
  sessionClose_json = '{"SessionToken":"%s"}' %(_EDS_['SessionToken'])
  req = urllib2.Request(url='http://eds-api.ebscohost.com//edsapi/rest/EndSession',
                        data=sessionClose_json,
                        headers={'Content-Type':'application/json',
                        'x-authenticationToken':_EDS_['AuthToken']})
  urllib2.urlopen(req)
  
  
def authenticateFile(config_file='config.json'):
  '''Uses values in JSON config file to authenticate *and* open a session.'''
  config = open(config_file, 'r').read()
  config = eval(config)
  config = config['EDS_config']
  authenticateUser(config['UserId'], config['Password'])
  openSession(config['Profile'], config['Guest'], config['Org'])
 
 
def basicSearch(query, view='brief', offset=1, limit=10, order='relevance'):
  '''Returns search results using basic arguments.'''
  search_json = '''{"SearchCriteria":{"Queries":[{"Term":"%s"}],"SearchMode":"smart","IncludeFacets":"n","Sort":"%s"},
                   "RetrievalCriteria":{"View":"%s","ResultsPerPage":%d,"PageNumber":%d,"Highlight":"n"},"Actions":null}
                   ''' %(query, order, view, limit, offset)
  return advancedSearch(search_json)
 
         
def advancedSearch(search_json):
  '''Returns search results using the full EDS search syntax (JSON).'''
  req = urllib2.Request(url='http://eds-api.ebscohost.com/edsapi/rest/Search',
                        data=search_json, headers={'Content-Type':'application/json',
                        'x-authenticationToken':_EDS_['AuthToken'],
                        'x-sessionToken':_EDS_['SessionToken']})
  req_open = urllib2.urlopen(req)
  req_results = req_open.read()
  return req_results
 
 
def prettyPrint(json_string):
  '''Returns a pretty-printed, UTF-8 encoded JSON string with escaped non-ASCII characters.'''
  import json
  dictionary = json.loads(json_string, encoding='utf=8')
  return json.dumps(dictionary, ensure_ascii=True, indent=2, encoding='utf-8')
 
 
#fin
--------------

Related Content:

Written by nitin

December 30th, 2012 at 11:23 am

sorta sorting API results with in-memory SQLite

leave a comment

I'll try to keep this short because it's looking like the weather is going to be agreeable enough for a nice, long Saturday walk.
 
So, I've been working on a mockup API at work that could, among other things, drive an in-site federated search across things like our Ebsco databases, the other vendor resources available through Ebsco's API using SRU, and of course our own databases with lists of the resources we offer and their descriptions.
 
Using simple textual similarity libraries it's easy to have the API return a text similarity score (a trick I learned working on HammerFlicks!) comparing the query against the title of each item. This way if someone types in "Wall St. Journal" it's easy to highlight (through an HTML/JavaScript page) the hit for "Wall Street Journal" from our own database because that'll be a good text similarity match.
 
Here's a snippet showing the similarity attribute:
<?xml version="1.0"?>
<nclive_api_response>
  <results source="ncl_resource_titles">
    <result text_similarity_score="86.666666666667">
      <title>Wall Street Journal</title>
      <url>http://www.nclive.org/cgi-bin/nclsm?rsrc=29</url>
      <description>Full articles from the Wall Street Journal (1981-current).</description>
    </result>
	…
  </results>
</nclive_api_response>

As for sorting through results brought in via multiple resources all using their own relevancy rankings – that's a different story. They're using their own relevancy calculations, so there's really no way to present results across multiple sources as the "most relevant".

 
I was toying with the idea, though, of testing what it would be like to – after the fact – index all the returned results on the fly in Solr or something just to get a relevancy ranking for the results the API returns. Now, this isn’t of course arguing that this would be a total relevancy rank across all sources. In other words, if you only pull five items from each "sub-API", each data source mentioned above, then there's no way to say that the first item from Database A is necessarily more relevant than the fifth result of Database B.
 
Anyway, I thought it was stupid to index things behind the scenes in something external just to get an on-the-fly relevancy rank to inject into the API results, when I'd only then have to quickly delete the entire index since I would just be using it to get a score.
 
But what I don't think is too stupid is the idea itself. It's making the argument that "Look, I've asked these different sources to send me their best stuff and now I'll have a way to rank them with my own criteria … because they're mine now." It's like using your own criteria to rank job candidates after asking a few of your industry friends to each send in their five best employees for the job you're hiring for. You're not necessarily going to agree with how they rank their own employees but you do trust that they've sent you five top notch folks.
 
… and so, after a colleague in another department asked if there would be a way to sort items across multiple data sources, I thought to investigate a way to do the indexing and have some kind of ranking/relevancy score done all in memory.
 
Enter SQLite.
 
This is really cool. With SQLite, I can create a full-text index/searchable on-the-fly database in memory that will let me develop some kind of rank per item. Note, one has to have SQLite with FTS3/FTS4 enabled to do full-text with SQLite.
 
Now, the way I'm doing this is to use SQLite's offsets() function to learn – for each search term/word passed to the API – if it or its Porter-based stem matches in the TITLE field (for which each hit gets, say, 2 points) or the DESCRIPTION field (1 point).
 
After getting the total points, I'm dividing the points by the total number of words within the API's TITLE + DESCRIPTION values to get a scaled result between 0 and 1.
 
Anyway, I've got a starter function below (PHP) that would return what I'm calling a "sorta" score. It'll be interesting to work it into the mockup API to see how it works in the real world in trying to sort items from across different sources.
 
And just to be clear, I'm doing this per item. That's to say I do these calculations for one item then delete the in-memory database. In other words, I'm not indexing all the API results in memory and then getting this "sorta" rank per item because the calculation is agnostic of the other items. Now, if I changed the calculation to consider the other items as well, then absolutely there would be a need to index all the items first before assigning a "sorta" score per item.
 
BTW. Get it … "sorta"?
… 'cause it's "sort of" a way to sort things from multiple sources. Ha!
 
:P
 
Anyway, the PHP's below followed by another PHP block that uses the function and then an HTML snippet of what gets returned with sample text.
 
And so much for my walk, looks like rain's on the way. Dammit.
<?php
  
//clean out special chars, etc.
function recharacter_this($htmlstring) {
  $htmlstring = htmlspecialchars($htmlstring, ENT_QUOTES);
  $htmlstring = trim($htmlstring);
  $htmlstring = preg_replace("/[^A-Za-z0-9]\s/", "", $htmlstring); //leave only alpha-numerics and whitespace
  $htmlstring = preg_replace("/\s+/", " ", $htmlstring); //replace multiple whitespaces with a single space
  return $htmlstring;
}

//get a rank score
function sorta_this($title, $description, $search_text) {
  
  $title = recharacter_this($title);
  $description = recharacter_this($description);
  $search_text = recharacter_this($search_text);
  
  //re: SQLite/PHP fundamentals, see: http://www.if-not-true-then-false.com/2012/php-pdo-sqlite3-example/
  
  //create memory db
  $memory_db = null;
  $memory_db = new PDO('sqlite::memory:');
  
  //errormode set to exceptions
  $memory_db->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
  
  //create table
  //you must use "VIRTUAL TABLE" for FTS3/4, see: http://www.sqlite.org/fts3.html#section_1_2
  $memory_db->exec("CREATE VIRTUAL TABLE box using FTS4 (
  id,
  title,
  description,
  tokenize=porter)"); //porter > simple because a search for "tree" matches up against text with "trees" where as "tokenize=simple" tokenization doesn't seem to do this;
  //granted, Porter stemming has its own problems, but it's better than nothing.
  
  $insert = "INSERT INTO box (id, title, description) VALUES('1', '$title', '$description')";
  $stmt = $memory_db->exec($insert); //insert values per above
  
  $search_text = str_replace(" ", " OR ", $search_text); //making search more liberal
  $query = "SELECT quote(offsets(box)) as rank FROM box WHERE box MATCH '$search_text' ORDER BY rank";
  $result = $memory_db->query($query); //run query per above
  
  $score = 0; //start with initial score of Zero
  $i = 0;  //to use during iteration
  
  //if query yielded anything ...
  if ($result) {
    
    //there's only one row, but still need to loop
    foreach($result as $row) {
      $rank = $row['rank'];
      preg_match_all("/[a-zA-Z0-9]+\ [a-zA-Z0-9]+\ [a-zA-Z0-9]+\ [a-zA-Z0-9]+/", $rank, $matches); //split at every 4th space, i.e. every quartet returned by SQLite offsets(); see: http://stackoverflow.com/questions/10555698/split-string-after-every-five-words
      
      //$matches is a single item array with one array inside it for each quartet; $matches[0] is thus just a plain array
      foreach ($matches[0] as $match) {
        if ($match[0] == 1) {
          //if search hits in TITLE field, get 2 points
          $score = $score + 2;
        }
        else { 
          //if in DESCRIPTION field, get 1 point
          $score = $score + 1;
        }
        $i = $i + 1;
      }
    }
  }
  
  $memory_db->exec("DROP TABLE box");
  $memory_db = null;
  
  $total_words = str_word_count($title) + str_word_count($description);
  $score = ($score/$total_words); //divide $score by total number of words in TITLE + DESCRIPTION
  
  //prevent scores greater than 1, which would only occur with an abnormally small number of total words (essentially <= to the number of words in search terms)
  if ($score > 1) {
    $score = 1;
  }
  return $score;
}
?>
Using the function with TITLE and DESCRIPTION (abstract) from this article …
<?php
//test sorta_this() function
$my_title = ("An aerobic walking programme versus muscle strengthening programme for chronic low back pain: a randomized controlled trial.");
$my_description = ("Objective:To assess the effect of aerobic walking training as compared to active training, which includes muscle strengthening, on functional abilities among patients with chronic low back pain.Design:Randomized controlled clinical trial with blind assessors.Setting:Outpatient clinic.Subjects:Fifty-two sedentary patients, aged 18-65 years with chronic low back pain. Patients who were post surgery, post trauma, with cardiovascular problems, and with oncological disease were excluded.Intervention:Experimental 'walking' group: moderate intense treadmill walking; control 'exercise' group: specific low back exercise; both, twice a week for six weeks.Main measures:Six-minute walking test, Fear-Avoidance Belief Questionnaire, back and abdomen muscle endurance tests, Oswestry Disability Questionnaire, Low Back Pain Functional Scale (LBPFS).Results:Significant improvements were noted in all outcome measures in both groups with non-significant difference between groups. The mean distance in metres covered during 6 minutes increased by 70.7 (95% confidence interval (CI) 12.3-127.7) in the 'walking' group and by 43.8 (95% CI 19.6-68.0) in the 'exercise' group. The trunk flexor endurance test showed significant improvement in both groups, increasing by 0.6 (95% CI 0.0-1.1) in the 'walking' group and by 1.1 (95% CI 0.3-1.8) in the 'exercise' group.Conclusions:A six-week walk training programme was as effective as six weeks of specific strengthening exercises programme for the low back."); 
$my_search_text = ("back pain exercise");
$my_score = sorta_this($my_title, $my_description, $my_search_text);

echo ("Searching for \"$my_search_text\" in <br /><br />TITLE: <em>$my_title</em> <br /><br />and <br /><br />DESCRIPTION: <em>$my_description</em> <br /><br />yields a \"sorta\" relevancy of<strong> ");
echo $my_score . "</strong><br /><br />";
echo ("<hr />Hits for each search word in TITLE get 2 points, hits in DESCRIPTION get 1 point.<br />This number is then divided by the total number of words in the TITLE + DESCRIPTION.");
?>

The results …

Searching for "back pain exercise" in

TITLE: An aerobic walking programme versus muscle strengthening programme for chronic low back pain: a randomized controlled trial.

and

DESCRIPTION: Objective:To assess the effect of aerobic walking training as compared to active training, which includes muscle strengthening, on functional abilities among patients with chronic low back pain.Design:Randomized controlled clinical trial with blind assessors.Setting:Outpatient clinic.Subjects:Fifty-two sedentary patients, aged 18-65 years with chronic low back pain. Patients who were post surgery, post trauma, with cardiovascular problems, and with oncological disease were excluded.Intervention:Experimental 'walking' group: moderate intense treadmill walking; control 'exercise' group: specific low back exercise; both, twice a week for six weeks.Main measures:Six-minute walking test, Fear-Avoidance Belief Questionnaire, back and abdomen muscle endurance tests, Oswestry Disability Questionnaire, Low Back Pain Functional Scale (LBPFS).Results:Significant improvements were noted in all outcome measures in both groups with non-significant difference between groups. The mean distance in metres covered during 6 minutes increased by 70.7 (95% confidence interval (CI) 12.3-127.7) in the 'walking' group and by 43.8 (95% CI 19.6-68.0) in the 'exercise' group. The trunk flexor endurance test showed significant improvement in both groups, increasing by 0.6 (95% CI 0.0-1.1) in the 'walking' group and by 1.1 (95% CI 0.3-1.8) in the 'exercise' group.Conclusions:A six-week walk training programme was as effective as six weeks of specific strengthening exercises programme for the low back.

yields a "sorta" relevancy of 0.065


Hits for each search word in TITLE get 2 points, hits in DESCRIPTION get 1 point.
This number is then divided by the total number of words in TITLE + DESCRIPTION.

--------------

Related Content:

Written by nitin

August 11th, 2012 at 12:36 am

awesome sauce: augmenting PubMed Central’s OAI response

leave a comment

Update, 9 pm EST, May 27, 2012: Well, this is interesting. After reading this page, I see that by setting the "metadataPrefix" to "pmc_fm" I can bypass steps #3 and #4 altogether it seems – provided one's OAI harvester/indexer is set to ingest the data in that format instead of Dublin Core or provided the script below transforms the data to Dublin Core before returning it. Anyway … score one for documentation and reading it after-the-fact!

I saw a post from a Metadata Librarian on the code4lib list about their work with placing article data from PubMed into DSpace. They are doing some metadata additions and cleanup in Excel so I emailed them off-list and let them know about PubMed2XL and we went back and forth on a few things. Among the things I learned from them was that PubMed Central has an OAI feed. Cool!

But that OAI feed doesn't return all the data they need.

Here's an example: http://www.pubmedcentral.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=oai_dc&set=aac.

One of the additional bits of data they wanted was author affiliation which is available from PubMed.gov's XML output. Same for the MESH terms.

Anyway, besides pushing PubMed2XL, I also mentioned that it would be interesting to make a sauce, if you will, for PubMed Central's OAI feed. In other words, rather than using the OAI link above, one would use a service on top of that a la: http://myPubMedCentralOAI_sauce.com/oai?verb=ListRecords&metadataPrefix=oai_dc&set=aac. And when one went to that URL, the service would fetch the real OAI feed from PubMed Central and then get the additional metadata from the NCBI EFetch APIs. It would then drop the additional metadata into the original OAI response and finally serve it up to the user (e.g. the OAI harvester).

I went ahead and played with a proof-of-concept using Google App Engine and it's working although it's adding about 20 – 25 seconds to the OAI response time. BTW: it's faster when I run it from localhost and not actually live on App Engine.

Here's how it's done.

  1. The user goes to http://localhost:8084/oai?verb=ListRecords&metadataPrefix=oai_dc&set=aac.
  2. The app then fetches http://www.pubmedcentral.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=oai_dc&set=aac.
  3. For each record, the app parses out the PubMed Central ID and uses the EFetch API with PubMed Central as the database to get more data about the item.
  4. Unfortunately, the API for PubMed Central doesn't return MESH terms, so in step #3 the app just uses the returned data to translate the PubMed Central ID to the regular PubMed ID.
  5. With the PubMed ID now in hand, the app goes to the EFetch API and specifies PubMed as the database and hands the API the PubMed ID from step #4.
  6. Now the app gets the <Affiliation> value and the MESH terms and adds them to the real OAI response from step #2.
  7. Finally (whew!), the app returns the OAI feed with more metadata than before.

This seems super klunky, so I'd love to hear about more elegant ways to do this … like having more options from PubMed Central without 3rd party hacks!

But it is working. And it's just a proof-of-concept …

Below, I've pasted a snippet of the augmented OAI data.

Below that is the Python code if anyone's interested.

ps: Python users will notice I used Google App Engine's "urlfetch" instead of "urllib" to request URLs. This is because using the latter was causing 500 errors due to timeouts. I don't think, from what I've read, that you can set the timeout with "urllib" in App Engine, so I used "urlfetch" which lets one set it up to 60 seconds.

<!--
  This is just a test to use the NCBI EFetch APIs to augment the ouput of PubMed Central's OAI feed.
  In short, it's a web servive that sits on top of the PubMed Central OAI API.

  *** DO NOT use this service to harvest OAI records from PubMed Central ... you will probably mess up your repository!
  ... and I haven't verified that the additional data being added to the OAI feed is accurate per the item.

  Currently, this supports the following OAI parameters:
 
   - ListRecords
   - set
   - metadataPrefix (must use "oai_dc"/Dublin Core)
   - resumptionToken
 
  Thanks, Nitin Arora (humaneguitarist.org), May 2012.
 
  ps: adding metadata increased the OAI response time by 22.6178297997 seconds.
  -->
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
 <responseDate>2012-05-27T13:34:17Z</responseDate>
 <request verb="ListRecords" metadataPrefix="oai_dc" set="aac">http://www.pubmedcentral.nih.gov/oai/oai.cgi</request>
 <ListRecords>
  <record>
   <header>
    <identifier>oai:pubmedcentral.nih.gov:89011</identifier>
    <datestamp>2002-09-12</datestamp>
    <setSpec>aac</setSpec>
   </header>
   <metadata>
    <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
     <dc:title>Antifungal Peptides: Novel Therapeutic Compounds against Emerging Pathogens</dc:title>
     <dc:creator>De Lucca, Anthony J.</dc:creator>
     <dc:creator>Walsh, Thomas J.</dc:creator>
     <dc:subject>Minireviews</dc:subject>
     <dc:description/>
     <dc:publisher>American Society for Microbiology</dc:publisher>
     <dc:identifier>http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=89011</dc:identifier>
     <dc:type>Text</dc:type>
     <dc:language>en</dc:language>
     <dc:rights/>
     <dc:contributor.affiliation>Southern Regional Research Center, Agricultural Research Service, U. S. Department of Agriculture, New Orleans, Louisiana 70124, USA. adelucca@nola.srrc.usda.gov</dc:contributor.affiliation>
     <dc:subject.mesh>Animals</dc:subject.mesh>
     <dc:subject.mesh>Anti-Bacterial Agents</dc:subject.mesh>
     <dc:subject.mesh>Antifungal Agents</dc:subject.mesh>
     <dc:subject.mesh>Fungi</dc:subject.mesh>
     <dc:subject.mesh>Humans</dc:subject.mesh>
     <dc:subject.mesh>Mycoses</dc:subject.mesh>
     <dc:subject.mesh>Peptides</dc:subject.mesh>
    </oai_dc:dc>
   </metadata>
  </record>
  <resumptionToken>oai%3Apubmedcentral.nih.gov%3A89061!!!oai_dc!aac</resumptionToken>
 </ListRecords>
</OAI-PMH>

Python:

### pmc-oai-topper.py
### 2012, Nitin Arora

### import modules
##import urllib #DELETE
from google.appengine.api import urlfetch #see: https://developers.google.com/appengine/docs/python/urlfetch/overview
from lxml import etree
import time
import webapp2

### set what additional metadata to get from the EFetch API
additions = [('contributor.affiliation', 'Affiliation'),
             ('subject.mesh', 'DescriptorName')] #(name of element to output to, XPath); eventually needs to be in external config file
            #note: the XPath has to refer to elements in the EFetch XML output for the PubMed database as in "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=12654674&retmode=xml"

#####
class pmctopper(webapp2.RequestHandler):
  def get(self):

    #GET OAI parameter values
    verb_value = self.request.get('verb')
    metadataPrefix_value = self.request.get('metadataPrefix')
    set_value = self.request.get('set')
    resumptionToken_value = self.request.get('resumptionToken')

    #define the *real* OAI feed URL and read it
    if resumptionToken_value: #if a resumptionToken is being used
      url = 'http://www.pubmedcentral.gov/oai/oai.cgi?verb=%s&resumptionToken=%s' %(verb_value, resumptionToken_value)
    elif set_value:
      url = 'http://www.pubmedcentral.gov/oai/oai.cgi?verb=%s&set=%s&metadataPrefix=%s' %(verb_value, set_value, metadataPrefix_value)
    else:
      url = 'http://www.pubmedcentral.gov/oai/oai.cgi?verb=%s&metadataPrefix=%s' %(verb_value, metadataPrefix_value)

##    oai_in = urllib.urlopen(url).read() #DELETE
    oai_in = urlfetch.fetch(url=url, deadline=60).content
    time_in = time.time() #tracking how long this takes

    #parse OAI response as XML
    oai_parsed = etree.XML(oai_in)
    root = oai_parsed.xpath('.') #root node
    dc = root[0].xpath('//oai_dc:dc',
                            namespaces={'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/',
                            'dc': 'http://purl.org/dc/elements/1.1/'}) #access dc:* nodes (i.e. each item)

    #loop through all items and for each go fetch additional metadata via the EFetch APIs for PubMed Central and PubMed
    #place that additional data into the original OAI feed
    i = 0
    for record in dc:
      identifier = record.xpath('//dc:identifier',
                            namespaces={'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/',
                            'dc': 'http://purl.org/dc/elements/1.1/'})
      pmc_id =(identifier[i].text).replace('http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=','') #get the article's unique ID

      #request PubMed ID from Pubmed Central API ... ugh!
      efetch_url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=%s' %pmc_id #this is the URL to get metadata about the article per it's ID
##      efetch_read = urllib.urlopen(efetch_url).read() #DELETE
      efetch_read = urlfetch.fetch(url=efetch_url, deadline=60).content #read the API response
      efetch_parsed = etree.XML(efetch_read) #parse as XML
      pubmed_id = efetch_parsed.xpath('//article-id[@pub-id-type="pmid"]/text()') #pubmed id

      #now(!) get the additional data from the PubMed API
      efetch_url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=%s&retmode=xml' %pubmed_id
##      efetch_read = urllib.urlopen(efetch_url).read() #DELETE
      efetch_read = urlfetch.fetch(url=efetch_url, deadline=60).content
      efetch_parsed = etree.XML(efetch_read)

      for addition in additions:
        added_element = efetch_parsed.xpath('//%s/text()' %addition[1]) #get data from API XML tree
        for added_value in added_element:
          etree.SubElement(record, '{http://purl.org/dc/elements/1.1/}%s' %addition[0]).text = added_value

      i = i + 1

    #for reporting how long this all takes
    time_out = time.time()
    time_diff = str(time_out - time_in)
    
    #output the *new* OAI results with the additional metadata
    self.response.headers['Content-Type'] = 'text/xml' #output as XML doc
    disclaimer= '''<!--
    This is just a test to use the NCBI EFetch APIs to augment the ouput of PubMed Central's OAI feed.
    In short, it's a web servive that sits on top of the PubMed Central OAI API.

    *** DO NOT use this service to harvest OAI records from PubMed Central ... you will probably mess up your repository!
    ... and I haven't verified that the additional data being added to the OAI feed is accurate per the item.

    Currently, this supports the following OAI parameters:
    
      - ListRecords
      - set
      - metadataPrefix (must use "oai_dc"/Dublin Core)
      - resumptionToken
    
    Thanks, Nitin Arora (humaneguitarist.org), May 2012.
    
    ps: adding metadata increased the OAI response time by %s seconds.
    -->''' %time_diff
    self.response.out.write(disclaimer)
    for node in root:
      self.response.out.write(etree.tostring(node))

### app engine stuff ...
app = webapp2.WSGIApplication([('/oai', pmctopper)],
                              debug=True)
--------------

Related Content:

Written by nitin

May 27th, 2012 at 10:11 am

easy calls to OpenCalais with Python, daggummit!

2 comments

Yesterday, I wrote this post about using Yahoo's deprecated term extraction web service to generate "subjects" – or whatever you want to call them – for an item based on the metadata housed in a Solr-compatible XML file. I'd also wondered about doing the same thing with OpenCalais.

Before we go any further, I'd just like to say I wrote that post from my hotel room. I'm writing today's from the Denver airport with about 2 hours to kill before my flight departs. And I'd also like to point out that when writing blog posts with spotty Wi-Fi connections, one should not compose their post online through WordPress. I'm using WordPad, and I should probably make that a habit.

Yeah, so anyway there's not that much good documentation on how to make calls on the Calais site. By "good" I mean there's no code sample to rip off. I'm sure it's perfectly fine for people who actually know what they're doing.

Using "The Google" I found this helpful post on making calls to OpenCalais. While I found it very well written and the code very helpful, I didn't want to have "httplib2" as a dependency since it's not available out-of-the-box with Python 2.7, as far as I know. Nor did I want to do anything with JSON. I'm just trying to make a simple POST request to the OpenCalais REST API – is all.

Using that post's code as a starting point, I whipped up some simple Python without "httplib2".

Note that this code passes three parameters to the API through the following variables:

  • "myCalaisAPI_key": this is where to paste your API key once you get it from Calais here.
  • "sampleText": this is a string of plain text to send to Calais for it to analyze and build terms for.
  • "calaisParams": these are the options to pass to the service in XML format. 

Note that I'm specifically requesting what I really want, "social tags", via the following option:

c:enableMetadataType="GenericRelations,SocialTags"

… and I'm specifically requesting a simple result format as follows:

c:outputFormat="Text/Simple"

There are other options, including RDF, that can be requested per the options mentioned on this page.

If you look at the code, you can see I'm asking Calais to analyze some text about Tim Tebow since I was in Denver when the Denver Broncos football team acquired Peyton Manning and traded Tebow to the New York Jets. The text is from a USA Today article from, um, yesterday.

The Jets, I'd like to state, are not worthy of a hyperlink. And that's only part of the reason I'm sad to see Tebow go there. Alas.

Anway, here's the output below, followed by the code. Note that – as mentioned in the code – I'm using the slightly older REST API. But what do I care right now. I'm just testing.

Here's the output:

<!--Use of the Calais Web Service is governed by the Terms of Service located at http://www.opencalais.com. By using this service or the results of the service you agree to these terms of service.-->
<!--
Company: HBO,
Organization: New York Jets,
Person: Tim Tebow,
TVShow: Hard Knocks,
-->
<OpenCalaisSimple>
  <Description>
    <calaisRequestID>dafa6c80-b4f6-77b1-1363-de96bb7764f4</calaisRequestID>
    <id>http://id.opencalais.com/ODNr1ciDte8wwv0nU3G1jw</id>
    <about>http://d.opencalais.com/dochash-1/895ba8ff-4c32-3ae1-9615-9a9a9a1bcb39</about>
    <docTitle/>
    <docDate>2012-03-23 00:56:09.679</docDate>
    <externalMetadata/>
  </Description>
  <CalaisSimpleOutputFormat>
    <Company count="1" relevance="0.643" normalized="HBO &amp; Company">HBO</Company>
    <Organization count="1" relevance="0.643">New York Jets</Organization>
    <Person count="1" relevance="0.643">Tim Tebow</Person>
    <TVShow count="1" relevance="0.643">Hard Knocks</TVShow>
    <SocialTags>
      <SocialTag importance="2">Training camp<originalValue>Training camp (National Football League)</originalValue>
      </SocialTag>
      <SocialTag importance="2">New York Jets<originalValue>New York Jets</originalValue>
      </SocialTag>
      <SocialTag importance="2">Florida Gators football team<originalValue>2008 Florida Gators football team</originalValue>
      </SocialTag>
      <SocialTag importance="1">Tim Tebow<originalValue>Tim Tebow</originalValue>
      </SocialTag>
      <SocialTag importance="1">HBO<originalValue>HBO</originalValue>
      </SocialTag>
      <SocialTag importance="1">Hard Knocks<originalValue>Hard Knocks (TV series)</originalValue>
      </SocialTag>
      <SocialTag importance="1">Entertainment_Culture</SocialTag>
      <SocialTag importance="1">Sports</SocialTag>
    </SocialTags>
    <Topics>
      <Topic Taxonomy="Calais" Score="1.000">Entertainment_Culture</Topic>
      <Topic Taxonomy="Calais" Score="1.000">Sports</Topic>
    </Topics>
  </CalaisSimpleOutputFormat>
</OpenCalaisSimple>

And the code:

# this code is based on: http://www.flagonwiththedragon.com/2011/06/08/dead-simple-python-calls-to-open-calais-api/

import urllib, urllib2

#########################
##### set API key and REST URL values.

myCalaisAPI_key = '' # your Calais API key.
calaisREST_URL = 'http://api.opencalais.com/enlighten/rest/' # this is the older REST interface.
# info on the newer one: http://www.opencalais.com/documentation/calais-web-service-api/api-invocation/rest

# alert user and shut down if the API key variable is still null.
if myCalaisAPI_key == '':
  print "You need to set your Calais API key in the 'myCalaisAPI_key' variable."
  import sys
  sys.exit()

#########################
##### set the text to ask Calais to analyze.

# text from: http://www.usatoday.com/sports/football/nfl/story/2012-03-22/Tim-Tebow-Jets-hoping-to-avoid-controversy/53717542/1
sampleText = '''
Like millions of football fans, Tim Tebow caught a few training camp glimpses of the New York Jets during the summer of 2010 on HBO's Hard Knocks.
'''

#########################
##### set XML parameters for Calais.

# see "Input Parameters" at: http://www.opencalais.com/documentation/calais-web-service-api/forming-api-calls/input-parameters
calaisParams = '''
<c:params xmlns:c="http://s.opencalais.com/1/pred/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <c:processingDirectives c:contentType="text/txt"
      c:enableMetadataType="GenericRelations,SocialTags"
      c:outputFormat="Text/Simple"/>
  <c:userDirectives/>
  <c:externalMetadata/>
</c:params>
'''

#########################
##### send data to Calais API.

# see: http://www.opencalais.com/APICalls
dataToSend = urllib.urlencode({
    'licenseID': myCalaisAPI_key,
    'content': sampleText,
    'paramsXML': calaisParams
})

#########################
##### get API results and print them.

results = urllib2.urlopen(calaisREST_URL, dataToSend).read()
print results
--------------

Related Content:

Written by nitin

March 23rd, 2012 at 1:28 pm

make you some facets, boy!

leave a comment

As I mentioned the other day in this post, I've been working with some awesome people to harvest, index, and make searchable metadata for digital library collections from multiple institutions across the state of North Carolina, USA.

In the post I just linked to, I talked about the problems of inconsistent metadata across institutions and how that negatively impacts browsing via facets with Solr. I also wondered out loud about resolving/aligning small discrepancies via text analysis.

Well, another way to tackle this problem is – after harvesting the metadata but before indexing it – to "make" facet-able terms via some sort of term extraction. While at DrupalCon 2012 in Denver, CO this week I went to a presentation where the presenter mentioned a project he'd worked on pulling in RSS feeds. In passing, he mentioned using OpenCalais to make a tag cloud. I totally forgot I had an API key for OpenCalais!

Anyway, now I see there are lots of similar web services. Which one is best in terms of term extraction and which one allows the most API hits per day is a matter for another day, but today – in my hotel now that the conference has ended – I thought I'd do a little scripting to get me on the path to really testing this.

Using the soon-to-be deprecated Yahoo Term Extraction Web Service I tested taking a sample Solr-compatible XML index file and sending the metadata in it to the service to retrieve new subject terms. While my test script doesn't do it here, the idea is that after retrieving from the API these new terms, the terms could be placed into the Solr-compatible index file. After indexing the updated file, these new terms could be exposed to the user as click-able facets.

I'll have to test this with lots of real-world metadata from across our test-set of metadata to see if the term extraction service can be used to produce nicer facets with disparate metadata than what we currently see, but for now I just needed to write a play/test script.

Below, I've pasted the Python script and the the output which explains a little what it's doing.

Actually, I've pasted the output first since people might not need or want to see the code. At the end, I've posted the "social tags" that OpenCalais would seem to generate for the same metadata – for comparison purposes.

The output:

Here's an XML file that can indexed by Solr (it was generated via harvesting data from the Library of Congress using Python and XSL).

<add>
  <doc>
    <field name="identifier">http://hdl.loc.gov/loc.mbrsmi/amrlv.4007</field>
    <field name="title">[Theater commercial--electric refrigerators]. Buy an electric refrigerator /</field>
    <field name="creator">AFI/Kalinowski (Eugene) Collection (Library of Congress)</field>
    <field name="subject">Refrigerators.</field>
    <field name="subject">Advertising--Electric household appliances--Pennsylvania--Pittsburgh.</field>
    <field name="subject">Trade shows--Pennsylvania--Pittsburgh.</field>
    <field name="subject">Silent films.</field>
    <field name="subject">Pittsburgh (Pa.)--Manufactures.</field>
    <field name="description">Largely graphic commercial for electric refrigerators in general and a refrigerator show, presumably in Pittsburgh, in particular.</field>
  </doc>
 </add>

-----

After using the Yahoo term extraction service we can create more <field> elements.

<field name="yahooTerm">electric household appliances</field>
<field name="yahooTerm">electric refrigerators</field>
<field name="yahooTerm">electric refrigerator</field>
<field name="yahooTerm">library of congress</field>
<field name="yahooTerm">silent films</field>
<field name="yahooTerm">collection library</field>
<field name="yahooTerm">pittsburgh pa</field>
<field name="yahooTerm">pennsylvania</field>

-----

If we place those new terms into the original XML file and reindex the item, we'll have new facets to play with.

This is a *potential* solution for creating practical, useable, and consistent(?) facets for metadata harvested from different institutions that use different subject terms and internal taxonomies, etc.

I think the basic Yahoo term extractor is deprecated(?), but there are other options such as their newer Context Analysis API, OpenCalais, and AlchemyAPI.com, etc.

The script:

#####
## merge all <fields> into one string; place in "context" variable.
SolrXML = '''
<add>
  <doc>
    <field name="identifier">http://hdl.loc.gov/loc.mbrsmi/amrlv.4007</field>
    <field name="title">[Theater commercial--electric refrigerators]. Buy an electric refrigerator /</field>
    <field name="creator">AFI/Kalinowski (Eugene) Collection (Library of Congress)</field>
    <field name="subject">Refrigerators.</field>
    <field name="subject">Advertising--Electric household appliances--Pennsylvania--Pittsburgh.</field>
    <field name="subject">Trade shows--Pennsylvania--Pittsburgh.</field>
    <field name="subject">Silent films.</field>
    <field name="subject">Pittsburgh (Pa.)--Manufactures.</field>
    <field name="description">Largely graphic commercial for electric refrigerators in general and a refrigerator show, presumably in Pittsburgh, in particular.</field>
  </doc>
 </add>
'''

from lxml import etree # see: http://lxml.de/ for this library.

SolrXML_parsed = etree.XML(SolrXML)
SolrXML_combined = SolrXML_parsed.findall(".//field")
SolrXML_combined.pop(0) #remove <field name="indentifier"> since we don't want
                        #a term generated from the URL; ideally this should be
                        #removed by having an attribute of "identifier" rather
                        #than by position, but this is just a test.

SolrXML_combinedList = []
for element in SolrXML_combined:
  SolrXML_combinedList.append(element.text)
context = (" ".join(SolrXML_combinedList))
#print context #test line


#####
## send XML example to Yahoo termExtraction service; print generated terms
## reference example: http://developer.yahoo.com/python/python-rest.html#post
import urllib, urllib2

url = 'http://search.yahooapis.com/ContentAnalysisService/V1/termExtraction'
appid = 'YahooTermTest'

params = urllib.urlencode({
  'appid': appid,
  'context': context,
})

yahooResultsXML = urllib2.urlopen(url, params).read()
#print yahooResultsXML #test line

yahooResultsXML_parsed = etree.XML(yahooResultsXML)
newSolrTerms = ""
for yahooTerm in yahooResultsXML_parsed:
  newSolrTerms = newSolrTerms + "<field name=\"yahooTerm\">" + yahooTerm.text \
  + "</field>\n"
 
#####
## print what the script is trying to do and the results ...
print "Here's an XML file that can indexed by Solr\
 (it was generated via harvesting data from the Library of Congress and XSL)."
 
print SolrXML

print "-"*5 + "\n"

print "After using the Yahoo term extraction service we can create more\
 <field> elements.\n"
 
print newSolrTerms

print "-"*5 + "\n"

print "If we place those new terms into the original XML file and reindex the\
 item, we'll have new facets to play with.\n"

print "This is a *potential* solution for creating practical, useable, and\
 consistent(?) facets for metadata harvested from different institutions that use\
 different subject terms and internal taxonomies, etc.\n"

print "I think the basic Yahoo term extractor is deprecated(?), but there are\
 other options such as their newer Context Analysis API, OpenCalais, and\
 AlchemyAPI.com, etc."

And here's what OpenCalais extracted as "social tags":

  • Business Finance
  • Entertainment Culture
  • Food storage
  • Food preservation
  • Home appliances
  • Pittsburgh
  • Refrigerator
--------------

Related Content:

Written by nitin

March 22nd, 2012 at 7:58 pm

on adding a JavaScript API to our Flash player at work

leave a comment

Sometimes being home sick means finding things to work on that I wouldn't have done had I been in the office, but need to be done eventually.

So, today I worked on augmenting a JavaScript API for our Flash player at work. Before I go any further, here's a screenshot below. Note that the JavaScript console for Googles' Chrome browser is visible at the bottom.

NC Live Media Player

As you can see, the "movie" is not just the "movie", so to speak. That's to say, there's a very cool bookmarking feature that, if clicked, returns a URL that if visited will start the given video at the point in time at which the user clicked the bookmark. The "movie" also includes some whitespace on the left hand side where links to "part 2"s of certain videos appear if they exist. That's great, but it totally makes our Flash player inappropriate for providing embed codes, etc. since the "movie" is more than the actual video screen and the controls (play, pause, captions, etc.). And, yes, we have to use our own player given that we have to deal with all kinds of authentication and rights issues, which this player support via calls to PHP scripts.

Anyway, a few months ago I'd created a basic JavaScript API so that I could make a demo using our player for SAVS, which is completely reliant on two things: being able to receive the current time of the player and also being able to send a new current time to the player.

Today, I expanded the API a little bit though in the image above you can see a call to the function that moves, in the case above, the player to the 10 second mark. I added support for changing the volume, pausing the video, playing the video if it's paused, turning captions on/off, and getting the total duration of the media file, etc. Basically, I'm trying to add support for anything we'd need to replace the bookmark button and other features with HTML buttons, etc.

There's still lots of work to do, but it's working well provided I embed the SWF file with the <object> tag:

<object
  id="thisMovie"
  data="video2_js.swf"
  style="height: 500px; width: 500px;"
  type="application/x-shockwave-flash">
  <param name="movie" value="video2_js.swf" />
</object>

Then I can use these JavaScript functions on the page that embeds the player …

<script type="text/javascript">
//see: http://kirill-poletaev.blogspot.com/2011/02/exchange-data-between-actionscript-3.html
function getFlashMovie(movieName) {
  var isIE = navigator.appName.indexOf("Microsoft") != -1;
  return (isIE) ? window[movieName] : document[movieName];
}

// ... more functions were here, but there's too much for a blog post. :-]

function ncl_getCurrentTime() {
  var callResult = getFlashMovie("thisMovie").getCurrentTime("");
  return callResult;
}

function ncl_getTotalTime() {
  var callResult = getFlashMovie("thisMovie").getTotalTime("");
  return callResult;
}
</script>

… provided I make sure the ActionScript in the Flash player is prepared for those callbacks …

//see: http://kirill-poletaev.blogspot.com/2011/02/exchange-data-between-actionscript-3.html


//send current time value to JavaScript function
function sendCurrentTimeToJS(name:Number):Number
{
    var now:int = cfp.playheadTime;
    return now;
}
ExternalInterface.addCallback("getCurrentTime", sendCurrentTimeToJS);

//send total time value to JavaScript function
function sendTotalTimeToJS(name:Number):Number
{
    var total:int = cfp.totalTime;
    return total;
}
ExternalInterface.addCallback("getTotalTime", sendTotalTimeToJS);
--------------

Related Content:

Written by nitin

February 23rd, 2012 at 6:14 pm

Posted in scripts

Tagged with , , ,

a HammerFlix update

leave a comment

Update, November 27, 2011: If you're looking for a live list of Hammer Films streaming on Netflix you can see it here.

To read more about the HammerFlicks project, click here.

Principal photography has begun on HammerFlix – a small project to use the Netflix API to discover which Hammer Films movies are available on Netflix's Watch Instantly.

So far, I've got a PHP file that has two functions.

The first one, named Igor, takes two arguments: a movie title and its release year. Igor then sends this to the Netflix API and retrieves only the first result for searching against that given title. Then Igor sends the XML version of the API results to another function, Master.

Master, aka Dr. Frankenstein, then evaluates the result. If the Netflix release date for the movie returned by the API matches the year value sent to Igor, then Master will display the link to that movie on Netflix. If the movie is available via Watch Instantly, Master will also display the link to the streaming movie. If the year doesn't match, Master reports that no results were found.

Testing against just the first match and using release year as the only qualifier might not be the best, but I think it might work pretty decently. If not, I'll have Igor retrieve more results and then Master can evaluate more results and use more test criteria before assuming the movie isn't on Netflix.

The next step is to get all the Hammer titles from the Hammer Filmography on Wikipedia and send each title and release year to HammerFlix. There might be some open/linked data opportunities later down the road with dbPedia, but that's not important for now.

You can see this very basic test of HammerFlix 0.01 here.

Anyway, here's the code for the development version of 0.01.

<?php

//Igor does the hard work of hitting up the Netflix API for movies matching $title.
//The code is mainly from: http://developer.netflix.com/page/resources/sample_php
function Igor($title, $year) {

    include ('../authentication/myAPI.php'); //this includes my Netflix API key and shared secret as $apiKey and $sharedSecret.

    //build stuff to send to API.
    $arguments = Array(
        'term' => $title,
        'expand' => 'formats',
        'max_results' => '1',
        'output' => 'xml'
    );

    $path = "http://api.netflix.com/catalog/titles";
    $oauth = new OAuthSimple();
    $signed = $oauth->sign(Array('path' => $path,
                'parameters' => $arguments,
                'signatures' => Array('consumer_key' => $apiKey,
                    'shared_secret' => $sharedSecret
                    )));

    //hit up API via CURL.
    $curl = curl_init();
    curl_setopt($curl, CURLOPT_URL, $signed['signed_url']);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
    //curl_setopt($curl, CURLOPT_SETTIMEOUT, 2); //Nitin commented this out on 2/5/2011 to prevent a PHP error message.
    $buffer = curl_exec($curl);
    if (curl_errno($curl)) {
        die("An error occurred:" . curl_error());
    }

    Master($buffer, $title, $year); //send XML results to Master().
}

//Master (Dr. Frankenstein) parses/returns the Netflix XML results retrieved by Igor.
function Master($buffer, $title, $year) {

    $xml = simplexml_load_string($buffer);
    $movieInfo = ($xml->catalog_title);
    $short = "short";
    $movieTitle_short = ($movieInfo->title->attributes()->$short);
    $regular = "regular";
    $movieTitle_regular = ($movieInfo->title->attributes()->$regular);
    $movieLink = ($movieInfo->id);
    $movieId = str_replace("http://api.netflix.com/catalog/titles/movies/", "", $movieLink);
    $movieYear = ($movieInfo->release_year);

    //test if movie is available for Watch Instantly/streaming.
    $streaming = $xml->xpath('//availability/category/@label');
    foreach ($streaming as $instantTest) {
        if ($instantTest == 'instant') {
            $streams = '';
        }
    }

    //output findings.
    echo "<li><p>Testing:<em> " . $title . " </em>from the year:<em> " . $year . "</em><br />";
    if ($movieYear == $year) {
        echo "<a href='http://movies.netflix.com/WiMovie/" . $movieId . "'>" . $movieTitle_short . "</a>";

        //IT'S ALIVE!!!
        //aka: show user the Watch Instantly link if it exists.
        if (isset($streams)) {
            echo "<br /><strong><a href='http://movies.netflix.com/WiPlayer?movieid=" . $movieId . "'>Watch Instantly</a></strong>";
        }
    } else {
        echo "No match.";
    }
    echo "</p></li>";
}

//Create life!
//aka: start doing things.
include ('../authentication/OAuthSimple.php');

echo "<ul>"; //put results in unordered list; send arguments to Igor().
Igor("The Brides of Dracula", "1960");
Igor("The Brides of Dracula", "1961");
Igor("Dracula Has Risen from the Grave", "1968");
Igor("Vampire Circus", "1972");
echo "</ul>";
?>
--------------

Related Content:

Written by nitin

September 10th, 2011 at 1:16 pm

Posted in scripts

Tagged with ,

HammerFlix 2: Terror of the lost API Keys

leave a comment

Update, November 27, 2011: If you're looking for a live list of Hammer Films streaming on Netflix you can see it here.

To read more about the HammerFlicks project, click here.

So, like six months ago I posted about an idea involving the Netflix API and trying to get a list of Hammer horror films available for streaming.

The good news is I've decided to carve out some time each Saturday and actually make this into a little pet project.

The bad news is that in editing the information about my app, I accidentally deleted my API keys and have to wait … and wait … on the news ones getting approved, which apparently is a process experiencing some delays lately.

Bummer.

Anyway, there are some interesting things that have happened since last time.

  1. The Netflix API is now totally about the Watch Instantly (streaming) catalog. See here.
  2. This PHP code example of hitting up the API which used to work out of the box now seems to require that I surround the keys like "term" and "output" with quotation marks a la:
$arguments = Array(
'term'=>'fargo',
'expand'=>'formats,synopsis',
'max_results'=> '1',
'output'=>'json'
);
  1. Netflix now has this open OData API in the works, too. So I might not even need to use the old API that requires a key …

That's to say that while I need to read more, if the ODdata API is also only for streaming titles then I can still retrieve the Netflix movie identifier from the OData API for given titles. What this means is that I should still be able to pass a list of Hammer film titles to the OData API and scrape the results for the movie's identifier. This in turn means I can create a link so I can start streaming the movie by clicking on a hyperlink.

Here's an OData example for "The Name of the Rose" taken from the Netflix developer site:

http://odata.netflix.com/Catalog/Titles?$filter=Name%20eq%20'The%20Name%20of%20The%20Rose'

Looking at the source for the results reveals the Netflix movie identifier like so:

<d:Url>http://www.netflix.com/Movie/The_Name_of_the_Rose/70000552</d:Url>

With that identifier, I should just be able to pass a link to the streaming page like this (must be logged into Netflix):

http://movies.netflix.com/WiPlayer?movieid=70000552

Ok, I just clicked on that and now there's a good possibility I'll be watching "The Name of the Rose" tonight … or not.

;(

--------------

Related Content:

Written by nitin

August 28th, 2011 at 8:20 pm

library APIs

leave a comment

Here's a cool list of library-related APIs … aka "toys".

:)

http://techessence.info/apis/

--------------

Related Content:

Written by nitin

June 18th, 2011 at 4:09 pm

Posted in technophilia

Tagged with ,

Switch to our mobile site