blog.humaneguitarist.org

discoveries in digital audio, music notation, and information encoding

Archive for the ‘scripts’ Category

HammerFlicks and pOAIndexter source codes available

leave a comment

Quickie:

I've made the source code available for HammerFlicks (click on the "Source Code" link) and "pOAIndexter" (scroll down to "pOAIndexter").

The "pOIAndexter" scripts is used to drive the metadata harvesting for NC ECHO.

I seriously doubt anyone else will download/use these, but making them downloadable forces me to do a decent job – I hope – of being organized.

--------------

Related Content:

Written by nitin

March 31st, 2013 at 2:52 pm

Posted in news,scripts

Tagged with , ,

getting real-time values from imported modules with a Python GUI

leave a comment

Situation: Over a year ago I wrote a Python script to allow one to convert XML citations from Pubmed.gov to a Microsoft spreadsheet.

I know some people are using it here and there, so I wanted to make it better.

The main problem with the old script is it's just sloppy (I knew even less back then than I do now!). It's also a script that wraps everything into one: non-GUI and GUI. If you don't pass command line options, then it launches the GUI, etc. Anyway, that makes the code hard to read for me since it intermixes data parsing with command line option stuff and GUI stuff.

So, for the next version I'm working on, I started with the premise to write it as a Python library so that it can be imported and one can use the function to make a spreadsheet inside another Python script a la:

import pubmed2xl
pubmed2xl.makeSheet("pubmed.xml", "pubmed.xls") #pass input and output (Excel file to be written)

It's also setup to make it easy to use command line options a la:

python pubmed2xl.py pubmed.xml pubmed.xls

The function and command line options also support showing the progress of completion while the spreadsheet is being made. This can be called as such:

pubmed2xl.makeSheet("pubmed.xml", "pubmed.xls", showProgress=True)

or

python pubmed2xl.py pubmed.xml pubmed.xsl --verbose

The problem for me, then, was how to show the progress inside a GUI application. Essentially, I needed the value of a the progress counter "variable" that was created inside a loop and updated each time the loop occurred – i.e. updating the progress counter. But I couldn't figure out how to retrieve the value of the progress counter variable in real time as the loop occurred. And I need it in real time so my GUI could show the progress update to the user – in real time!

I spent way too much time following leads that got me nowhere. I tried threads, running the python script as a sub-process, etc.  but I could never access the variable "progressValue" that equates to the percentage of task completion as citations are getting processed into a spreadsheet.

So, somehow I found my way to realizing that if my original script had a class and my second script added a method to the class then I could get the value of "progressValue" in real time.

Anyway, I've got two scripts below. The "first.py" script emulates a progress calculator by simply counting to 100. The script also has a class, "callback" and a global dictionary "_CALLBACK_DICT" into which I can place key/value pairs for whatever variables I want to retrieve during the loop.

The function "canYouSeeMe()" inside "first.py" also tries to execute the method "_CALLBACK.callback()" during the loop. In other words, if the method's there, run it, otherwise just ignore it.

The second script "second.py" is a little TKinter GUI app. It imports the first module and the instantiated class("_CALLBACK"). It also has a function called "getCallback()" that does what I want: i.e. retrieve the progress count in real time and show it in the GUI in real time. I then I equate "getCallback()" to the "_CALLBACK.callback()" method. So now, when I run the "second.py" script, the loop in "first.py" can give me the data I want to show in "second.py" in real time. Make sense? I hope so because it seems to be working OK.

Here's a screenshot from running "second.py" and below are the scripts themselves. I'd love any feedback on better ways of doing this, by the way.

Tkinter callback example

first.py

##### "first.py"

class callback():
  pass
_CALLBACK = callback()
_CALLBACK_DICT = {}

rangers = range(0, 101, 10)
def canYouSeeMe():
  for ranger in rangers:
    _CALLBACK_DICT["this_ranger"] = str(ranger)
    try:
      _CALLBACK.callback()
    except:
      pass

second.py

##### "second.py"

#import first module
import first
from first import _CALLBACK

#import Tkinter
from Tkinter import *

#create function and add as method to class "_CALLBACK"
def getCallback():
    importedValue = first._CALLBACK_DICT["this_ranger"]
    t.insert(END, importedValue + "%\n")
    if importedValue == "100":
        t.insert(END, "\nDone.")
    t.see(END)
    t.update_idletasks()
_CALLBACK.callback = getCallback #adding method to class

#create GUI buttons
class buttons():
   
    def __init__(self, root):
 
        #make frame/button
        frame = Frame(root)
        frame.pack()
       
        buttonText = "go"
        buttonAction = self.go
        self.makeButton = Button(frame, text=buttonText, command=buttonAction)
        self.makeButton.pack()
       
    #run go()
    def go(self):
      first.canYouSeeMe()

#create GUI
root = Tk()
buttons = buttons(root)
t = Text(root, background="black", foreground="blue")
t.pack()
geo = ("150x250")
root.geometry(geo)
root.mainloop()
--------------

Related Content:

Written by nitin

February 7th, 2013 at 3:00 pm

Posted in scripts

Tagged with , , ,

questionable questions and a lazy way to add command line support to a Python module

leave a comment

Ugh.

I just took this Myers-Brigg questionnaire and I just "love" (sarcasm) some of these questions:

Do you usually get along better with imaginative people, or realistic people?

What? Truly imaginative people have to be "realistic" otherwise they're just spewing out pipe-dreams. People who can think of the new and make it a reality have to have a large degree of realism.
   
In reading for pleasure, do you enjoy odd or original ways of saying things, or like writers to say exactly what they mean?

First of call, I question why the question is making some kind of statement as to writer, not the book or the "message", as the basis for selection.

Secondly, show me a writer who thinks they are saying "exactly" what they mean and I'll show you a "writer" who knows not one thing about interpretation of the self let alone the words of others.

Anyway …

Ok, on to the second thing I wanted to say before I jump in the shower past noon (I'm home … say it with me … sick).

I have a Python module with lots of functions. I want them to be importable in other Python scripts AND callable via the command line.

And I want them all callable via the command line without taking the time to write out command line option support.

:P

For example, in a script with a function "echo" that prints an argument, "add" that adds two integers, and "times" that multiplies two integers, I just want to do this:

$ python cl.py echo('hello world')
hello world

$ python cl.py echo(add(1,2))
3

$ python cl.py echo((add(100, times(2,10))))
120

instead of stuff like this:

$ python cl.py --function=echo arg="hello world"

etc. …

Using the built-in "eval" function and "sys.argv" seems to be working:

#cl.py

def echo(s):
  print s

def add(x, y):
  return x + y

def times(x, y):
  return x * y

def main():
  import sys
  try:
    funks = sys.argv[1:] #user must pass strings in single quotes. 
    funks = " ".join(funks)
    eval(funks)
  except:
    pass

if __name__ == "__main__":
  main()
--------------

Related Content:

Written by nitin

January 31st, 2013 at 12:30 pm

Posted in scripts

Tagged with ,

keyword vs. phrase searching of the Soundboard, a GFA publication

leave a comment

As I mentioned before, last summer I went to the Guitar Foundation of America convention in Charleston.

I also mentioned that I'd asked some questions about whether the GFA journal, "Soundboard" was full-text indexed.

Via the FlippingBook software the GFA uses to display current issues online (membership required), there is full-text searching capability because the content is indexed as far as I can tell. But as I was saying, I don't think one can search across *all* online Soundboards simultaneously – i.e. fire off one query and get results across all online Soundboards. I could be wrong about that.

In contrast, the PDF back issues sold on a DVD-ROM are not full-text indexed nor full-text searchable with Adobe Acrobat Reader as far as I can tell. And I think this is where there's real confusion – perhaps on my part – about what we mean when we use terms like "keyword" searching.

To me, keyword searching means full-text and not a "find" (as in Acrobat Reader). The Webopedia site differentiates these as "keyword" and "phrase" searches, respectively. The GFA is using a different meaning, per the "How to search Soundboard back issues.pdf" file that comes with the DVD, for "keyword" searching:

"These issues have been processed both to reproduce the page-by-page appearance of the originals on your computer screen, and to apply an "optical character recognition" (OCR) process to the text, so that every page of every issue is now keyword searchable."

In my experience, however, the search provided internally via Adobe Acrobat Reader (and Foxit Reader, too) is what I'd just call a "find" (i.e. the same as Ctrl-F on your browser). In fact, in my version of Acrobat Reader and per the screenshot in the "How to search Soundboard back issues.pdf" file, Adobe also uses the phrase "find" and not "search" in their application. Their "Advanced Search" adds options really dealing with what to search (comments, all files in a folder, etc.) but not really how to search (in the algorithmic sense) – so, it's still a "find", though more feature-rich. Now, if you have Acrobat Pro (admittedly I do through work) you apparently can create an index and then actually do a full-text search, but that doesn't help people who don't have the pro version and won't/can't buy it.

Granted, I can index the PDF with my operating system (Windows) and do a full-text search, but I don't really get much useful information other than what files match. I don't get useful information on where the passage exists (page number, etc).

Consider the following passage from Soundboard Volume 1, Number 1, 1974:

"Mr. Llois Mauerhofer, Elizabethstrasse 93, 8010 Graz, Lustria, was reported working on a doctoral dissertation at the University of Graz on Leonard von Call, early 19th c. guitarist active in Vienna who is best remembered for his serenades for guitar and strings."

A "find" won't match that passage if you search for "Graz University" or "University Graz" or "strings Vienna" but a real keyword search likely would.

Of course, a demonstration is in order, so using a tool called Apache Tika to extract the text from the aformentioned PDF scan of Soundboard v.1, #1, 1974; a little Python software script I wrote to output the data to a database-friendly file; and an online database, I indexed the data and made a little API – all that means is that there's page you can go to, throw some search terms at it, and get the results back as structured data (um, usually not fun to read through).

By the way, I normally use more technical jargon in my posts but I have some guitarist buddies who I want to read this page.

Anyway, here are the three searches mentioned above that don't yield results in Acrobat Reader but do using a full-text search (you can see the search terms in bold in the links below). Don't worry if you can't read the output, just focus on the fact that something comes back (provided my database isn't down at the moment!).

http://blog.humaneguitarist.org/uploads/Soundboard/currentVersion/search/?q=Graz+University
http://blog.humaneguitarist.org/uploads/Soundboard/currentVersion/search/?q=University+Graz
http://blog.humaneguitarist.org/uploads/Soundboard/currentVersion/search/?q=strings+Vienna

For a more user-friendly version, try going here:

http://blog.humaneguitarist.org/uploads/Soundboard/currentVersion/soundboard_search.html

Try typing in the three searches mentioned above. Then try some more searches for fun. For simplicity's sake, I hard-coded the system to never return more than 10 results.

Of course, this should all scale to indexing the text of all the PDFs on the DVD, but exposing those openly on the web wouldn't be appropriate.

But my point with this demo is to say that this is more like what I meant by "keyword" searching at the GFA convention. There's probably a way to ingest the old PDFs into the FlippingBook software or at least something else like the Internet Archive book reader. That would probably require re-OCRing the images so that the coordinates of the words could be indexed as well, allowing one to see where on a page the results are, just as with the current issues via FlippingBook.

Ok, if you're still here and are a geek, here's the Python script, "soundboardToTabDelimited.py".

'''
usage example:
  $ python soundboardToTabDelimited.py V01-n1-1974.pdf

This yields "V01-n1-1974.xhtml" and then "V01-n1-1974.txt"
 
Note: you must have the lxml module installed (which isn't always fun).
You can get it here: http://lxml.de/
'''

import codecs, subprocess, sys
from lxml import etree

##### globals
tab = "\t"
br = "\n"


##### run Apache Tika on the file passed via the command line
soundboard = sys.argv[1].replace(".pdf", "")
command_string = "java -jar tika-app-1.2.jar %s > %s" %(soundboard + ".pdf", soundboard + ".xhtml")
command = subprocess.Popen(command_string, shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
command.wait() #wait until the subprocess finishes.


##### write file headers (this needs to be deleted if you're going to later import the file via PHPMyAdmin).
tab_delimited = codecs.open(soundboard + ".txt", "w", "utf-8") #output file

tab_delimited.write("journal_id" + tab + "volume" + tab + \
                    "issue" + tab + "year" + tab + \
                    "page_id" + tab + "text_id" + tab + "text" + br)


##### extract volume, issue, year from filename
volume = int(soundboard.split("-")[0].replace("V", ""))
issue = int(soundboard.split("-")[1].replace("n", ""))
year = int(soundboard.split("-")[2])
journal_id = "%04d_%04d_%04d" %(volume, issue, year)


##### parse xhtml file
soundboard_parse = etree.parse(soundboard + ".xhtml")
root = soundboard_parse.xpath(".")

div_tags = root[0].xpath("//xhtml:div[@class='page']",
             namespaces={"xhtml":"http://www.w3.org/1999/xhtml"})


##### extract text from each div/p tag and write data to file
page_id = 1
for div_tag in div_tags:
  text_id = 0
  p_tags = div_tag.xpath("xhtml:p",
             namespaces={"xhtml":"http://www.w3.org/1999/xhtml"})

  for p_tag in p_tags:
    p_text = p_tag.text
    if p_text !=None and p_text !="":
      p_text = p_text.replace(br, "")
      p_text = p_text.replace(tab, "  ")
      p_text = p_text.strip()
      if p_text != "":
        tab_delimited.write(str(journal_id) + tab + str(volume) + tab + \
                            str(issue) + tab + str(year) + tab + \
                            str(page_id) + tab + str(text_id) + \
                            tab + p_text + br)
        text_id = text_id + 1
     
  page_id = page_id + 1

tab_delimited.close()
# fin
--------------

Related Content:

Written by nitin

January 5th, 2013 at 12:35 pm

PyEDS: a simple Python starter library for Ebsco’s Discovery Service (EDS)

leave a comment

Before this little vacation I'm on started (sadly, it's almost over!), I was allowed to have access to Ebsco's Discovery Service (EDS) API and its documentation WIKI.

I sent a tiny bit of feedback on some of the things in the documentation that I think are unclear or really need correction and I'm looking to send more when I return to work.

My biggest concern was that  – and I think this is true of A LOT of API documentation – it requires a lot of reading on the user's part to figure out what means what, which almost invariably exceeds the amount of work to actually write the code to authenticate, make queries, etc.

That's to say that often working through documentation about tying a shoelace is more of a task than actually tying said shoelace.

I *think* developers really just want to start experimenting with code, so clarity and really concise language with examples are really of the utmost importance.

Speaking of examples, I also think that sample code needs to have scope in mind. What I'm getting at is that sample code for a search API shouldn't be a "soup to nuts" thing that entails authenticating, making requests, having a client-side UI/interface and displaying results, etc. That's too much. Again, I think (off the top of my head of course and with nothing more than a gut feeling) that it might be more helpful to simply show how to authenticate and make a request and show the formatting of a sample response. The other stuff – interface, UI, etc, etc. – just convolutes the code and adds noise to the basics. In fact, that confuses API usage implementation vs. the API usage itself.

Better still would be to offer small libraries in popular scripting languages that simplify the basics – again, to facilitate people playing with one's API's. The easier and more "fun" it is, the more likely I think (yeah, yeah, I know!) people are likely to really dream about incorporating the API, etc. into their applications and what-nots.

So along those lines, I've pasted a little sample Python script below that makes it really easier for me to authenticate, open a session, conduct searches, format the JSON response, and close the session. It needs work (what doesn't?) but it does what I mean for it to for now.

I probably shouldn't post a sample response since access to the EDS WIKI is for customers only, but if you aren't a customer or at least aren't interested, why are you even reading this?

:P

#PyEDS.py

'''
This module provides a basic Python binding to Ebsco's EDS API, allowing one to:
  - authenticate with a UserID and Password,
  - open and close a session,
  - perform a search (results are returned as JSON),
  - pretty print the JSON.
 
Thanks,
Nitin Arora; nitaro74@gmail.com
____________________________________________________________________________________________________
#Usage example:
 
  import PyEDS as eds
  
  eds.authenticateUser('USERID_GOES_HERE', 'PASSWORD_GOES_HERE')
  eds.openSession('PROFILE_GOES_HERE', 'GUEST_GOES_HERE', 'ORG_GOES_HERE')
 
  #eds.authenticateFile() #alternative to using authenticateUser() and openSession()
  #uses values in JSON config file argument(default="config.json")
  
  #sample "config.json" file:
  """
  {
    "EDS_config": {
      "UserId": "USERID_GOES_HERE",
      "Password": "PASSWORD_GOES_HERE",
      "Profile": "PROFILE_GOES_HERE",
      "Guest": "GUEST_GOES_HERE",
      "Org": ORG_GOES_HERE
    }
  }
  """
 
  kittens = eds.advancedSearch('{"SearchCriteria":{"Queries":[{"Term":"kittens"}],"SearchMode":"smart","IncludeFacets":"y","Sort":"relevance"},"RetrievalCriteria":{"View":"brief","ResultsPerPage":10,"PageNumber":1,"Highlight":"y"},"Actions":null}')
  puppies = eds.advancedSearch('{"SearchCriteria":{"Queries":[{"Term":"puppies"}],"SearchMode":"smart","IncludeFacets":"y","Sort":"relevance"},"RetrievalCriteria":{"View":"brief","ResultsPerPage":10,"PageNumber":1,"Highlight":"y"},"Actions":null}')
  cubs = eds.basicSearch('cubs')
  piglets = eds.basicSearch('piglets', view='brief', offset=1, limit=10, order='relevance')
  
  eds.closeSession()
  
  print 'Some search results with the EDS API ...'
  print '\n"kittens" advanced search as original JSON:'
  print kittens
  print '\n"puppies" advanced search as original JSON:'
  print puppies
  print '\n"kittens" advanced search as JSON with indentation and non-ascii escaping:'
  print eds.prettyPrint(kittens)
  print '\n"cubs" and "piglets" basic searches as original JSON:'
  print cubs, piglets
  print '\nGoodbye.'
____________________________________________________________________________________________________
 
TO DO:
  - add more options to basicSearch() like "facets", "search mode", "fulltext", "thesauras", etc.
    - can't hurt! :-]
  - consider adding an authenticateIP() function that uses the IP authentication method.
  - deal with expired tokens, etc.; see: http://edswiki.ebscohost.com/API_Reference_Guide:_Appendix
'''
 
import urllib2
_EDS_ = {}
 
 
def authenticateUser(UserId, Password):
  '''Authenticates user with an EDS UserId and Password.'''
  auth_json = '{"UserId":"%s","Password":"%s","InterfaceId":"WSapi"}' %(UserId, Password)
  req = urllib2.Request(url='https://eds-api.ebscohost.com/authservice/rest/UIDAuth',
                        data=auth_json,
                        headers={'Content-Type':'application/json'})
  req_open = urllib2.urlopen(req)
  req_results = req_open.read()
  
  req_results_dictionary = eval(req_results) #convert JSON to dictionary.
  _EDS_['AuthToken'] = req_results_dictionary['AuthToken']
  _EDS_['AuthTimeout'] = req_results_dictionary['AuthTimeout']
 
 
def openSession(Profile, Guest, Org):
  '''Opens the EDS session with an EDS Profile, the Guest value ("y" or "n"), and the Org nickname.'''
  sessionOpen_json = '{"Profile":"%s","Guest":"%s","Org":"%s"}' %(Profile, Guest, Org)
  req = urllib2.Request(url='http://eds-api.ebscohost.com/edsapi/rest/CreateSession',
                        data=sessionOpen_json,
                        headers={'Content-Type':'application/json',
                        'x-authenticationToken':_EDS_['AuthToken']})
  req_open = urllib2.urlopen(req)
  req_results = req_open.read()
 
  req_results_dictionary = eval(req_results)
  _EDS_['SessionToken'] = req_results_dictionary['SessionToken'].replace('\\/', '/')
 
 
def closeSession():
  '''Closes the EDS sesssion.'''
  sessionClose_json = '{"SessionToken":"%s"}' %(_EDS_['SessionToken'])
  req = urllib2.Request(url='http://eds-api.ebscohost.com//edsapi/rest/EndSession',
                        data=sessionClose_json,
                        headers={'Content-Type':'application/json',
                        'x-authenticationToken':_EDS_['AuthToken']})
  urllib2.urlopen(req)
  
  
def authenticateFile(config_file='config.json'):
  '''Uses values in JSON config file to authenticate *and* open a session.'''
  config = open(config_file, 'r').read()
  config = eval(config)
  config = config['EDS_config']
  authenticateUser(config['UserId'], config['Password'])
  openSession(config['Profile'], config['Guest'], config['Org'])
 
 
def basicSearch(query, view='brief', offset=1, limit=10, order='relevance'):
  '''Returns search results using basic arguments.'''
  search_json = '''{"SearchCriteria":{"Queries":[{"Term":"%s"}],"SearchMode":"smart","IncludeFacets":"n","Sort":"%s"},
                   "RetrievalCriteria":{"View":"%s","ResultsPerPage":%d,"PageNumber":%d,"Highlight":"n"},"Actions":null}
                   ''' %(query, order, view, limit, offset)
  return advancedSearch(search_json)
 
         
def advancedSearch(search_json):
  '''Returns search results using the full EDS search syntax (JSON).'''
  req = urllib2.Request(url='http://eds-api.ebscohost.com/edsapi/rest/Search',
                        data=search_json, headers={'Content-Type':'application/json',
                        'x-authenticationToken':_EDS_['AuthToken'],
                        'x-sessionToken':_EDS_['SessionToken']})
  req_open = urllib2.urlopen(req)
  req_results = req_open.read()
  return req_results
 
 
def prettyPrint(json_string):
  '''Returns a pretty-printed, UTF-8 encoded JSON string with escaped non-ASCII characters.'''
  import json
  dictionary = json.loads(json_string, encoding='utf=8')
  return json.dumps(dictionary, ensure_ascii=True, indent=2, encoding='utf-8')
 
 
#fin
--------------

Related Content:

Written by nitin

December 30th, 2012 at 11:23 am

pixelation: custom XSLT functions with Python and lxml

leave a comment

I'll be brief.

Because the Python "lxml" module doesn't support XSLT 2.0 functions, I was looking at support for EXSLT

… but then stumbled on how to write my own functions and call them from stylesheets.

Freakin' cool.

I like calling it "pxslt" for "Python XSLT" and pronouncing it like "pixelate".

:P

Example below of the "module" I made;  the script that calls it, and the results.

Told you I'd be brief.

Module:

#pxslt.py

def underscore(context, word):
  '''Replace whitespace with underscore.'''
  out = word[0].replace(' ', '_')
  return out

def multiply(context, int_val, int2_val):
  '''Multiply two integers.'''
  int_val, int2_val = int(int_val[0]), int(int2_val[0])
  return int_val * int2_val

def libraryThing(context, isbn):
  '''Get language for a work based on ISBN using LibraryThing API.'''
  isbn = isbn[0]
  import urllib
  res = urllib.urlopen('http://www.librarything.com/api/thingLang.php?isbn=' + isbn)
  res_r = res.read()
  return res_r

##### DO NOT EDIT
##### makes it possible to call the above functions with XSLT
def pxslt():
  myFunctions = []
  gbs = globals()
  from inspect import isfunction
  for gb in gbs:
    if isfunction(gbs[gb]) and gb != 'pxslt':
      #print gb
      myFunctions.append(gbs[gb])

  from lxml import etree
  #see: http://lxml.de/extensions.html
  ns = etree.FunctionNamespace('file://libs/pxslt.py')
  ns.prefix = 'pxsl'
  for myFunction in myFunctions:
    name = str(myFunction.func_name)
    ns[name] = myFunction
  return ns

Usage example:

from lxml import etree

#####
myXML = etree.XML('''\
<a>
  <b>Hello. This will appear with whitespaces replaced by underscores.</b>
  <c>3</c>
</a>''')

myXSL = etree.XSLT(etree.XML('''\
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:pxslt="file://libs/pxslt.py">
  <xsl:output method="text" version="1.0" />
  <xsl:template match="a">
    <xsl:variable name="isbn">9955081260</xsl:variable>
    <xsl:value-of select="pxslt:libraryThing($isbn)" />
    <xsl:text>\n</xsl:text> <!-- Python will line break here -->
    <xsl:value-of select="pxslt:underscore(b/text())" />
    <xsl:text>\n</xsl:text> <!-- Python will line break here -->
    <xsl:call-template name="mathFunc">
    </xsl:call-template>
  </xsl:template>
  <xsl:template name="mathFunc">
    <xsl:variable name="myNum">10</xsl:variable>
    <xsl:value-of select="pxslt:multiply(c/text(), $myNum)" />
  </xsl:template>
</xsl:stylesheet>'''))

import pxslt
pxslt.pxslt() #get all set up with namespaces and function stuff

print(myXSL(myXML))

#myXSL_file = etree.XSLT(etree.parse('foo.xsl')) #for testing with a real XSL file
#print(myXSL_file(myXML))

Output:

>>>
lit
Hello._This_will_appear_with_whitespaces_replaced_by_underscores.
30

--------------

Related Content:

Written by nitin

November 2nd, 2012 at 5:28 pm

Python, lxml, and xsl:include

leave a comment

Keeping this short because yes, dammit, I'm home sick.

I needed/wanted to do some XSL transformations with Python using an <xsl:include> statement. But I kept getting some errors along the lines of "lxml cannot' resolve uri string".

So anyway after deciding I didn't want to read through all the crap on the lxml site about this, I fumbled my way through to what appears to work.

It seems the include statements work fine when I DO NOT read() the XSL file before using it for a transformation.

In the interest of really keeping it short like I said, here's some code and the results below.

from lxml import etree
                
def works(someXML):
  #don't even open() the XSL file ...
  xslt_tree = etree.parse(xslFile)
  transform = etree.XSLT(xslt_tree)
  result = transform(someXML)
  return result

def also_works(someXML):
  #open() the XSL file, but don't read() it ...
  xsl_opened = open(xslFile, "r")
  xslt_tree = etree.parse(xsl_opened)
  transform = etree.XSLT(xslt_tree)
  result = transform(someXML)
  return result

def fails(someXML):
  #open() and read() the XSL file ...
  xsl_opened = open(xslFile, "r")
  xsl_read = xsl_opened.read()
  xsl_parsed = etree.XML(xsl_read)
  transform = etree.XSLT(xsl_parsed)
  result = transform(someXML)
  return result

#####
xslFile = "b.xsl"

myXML = etree.XML('''\
<a>
  <b>b-val</b>
  <c>c-val</c>
  <d>d-val</d>
</a>''')

print "Trying works() ..."
print works(myXML)

print "Trying also_works() ..."
print also_works(myXML)

print "Trying fails() ..."
print fails(myXML)

Here's what the code spits out …

Trying works() ...
<?xml version="1.0" encoding="iso-8859-1"?>
<div>
  <p>I'm from a.xsl.</p>
  <p>I'm from b.xsl.</p>
  <p>b-val c-val d-val</p>
</div>

Trying also_works() ...
<?xml version="1.0" encoding="iso-8859-1"?>
<div>
  <p>I'm from a.xsl.</p>
  <p>I'm from b.xsl.</p>
  <p>b-val c-val d-val</p>
</div>

Trying fails() ...

Traceback (most recent call last):
  File "C:\Users\nitaro\Dropbox\lxml_include\inc.py", line 44, in <module>
    print fails(myXML)
  File "C:\Users\nitaro\Dropbox\lxml_include\inc.py", line 23, in fails
    style = etree.XSLT(xsl_parsed)
  File "xslt.pxi", line 399, in lxml.etree.XSLT.__init__ (src/lxml/lxml.etree.c:118852)
  File "lxml.etree.pyx", line 280, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:7959)
XSLTParseError: Cannot resolve URI string://__STRING__XSLT__/a.xsl

Oh and here are the XSL files, "a.xsl" and "b.xsl" …

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="xml" version="1.0" encoding="iso-8859-1" indent="yes"/>  
  <xsl:template match="a">
    <div>     
      <p>I'm from a.xsl.</p>    
      <xsl:call-template name="canUCme">
        <xsl:with-param name="name" select="/" />
      </xsl:call-template>  
    </div>
  </xsl:template>
</xsl:stylesheet>

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:include href="a.xsl"/>
  <xsl:template name="canUCme">
    <xsl:param name="name" />
    <p>I'm from b.xsl.</p> 
    <p><xsl:value-of select="normalize-space($name)" /></p>
  </xsl:template>
</xsl:stylesheet>
--------------

Related Content:

Written by nitin

October 25th, 2012 at 12:27 pm

Posted in scripts,XML

Tagged with , , , ,

parsing command line options kinda like GET variables in Python

leave a comment

I think – OK I know – I had a drink too many last night.

… and tomorrow I'll be working from home to clean up this script that we'll be using to harvest metadata feeds. Currently, just OAI/Simple Dublin Core but the script should support any XML feed as long as it's parse-able via XSLT 1.0 and as long as the paging of items can be facilitated via being in the XML itself (i.e. OAI's brilliant Resumption Token) or via numeric GET variable values a la:

&page=1 ... &p=3 ... &nextPage=4 ... etc.

Anyway, I need a simple way to pass arguments via the command line and for whatever reason I don't want to use the deprecated "optparse" module (well, I guess the reason is it's deprecated) or the "argparse" module.

I think I'll just use the following which lets me pass arguments kinda like GET variables in a URL. Sure, there's no error checking or messaging, but I really don't care. Maybe I'll care more tomorrow when my head's a little clearer.

def args2dict():
  import sys
  args = sys.argv
  args = args[1].split("$")
  args_dict = {}
  for arg in args:
    keyVal = arg.split("=")
    key = keyVal[0]
    val = keyVal[1]
    args_dict[key] = val
  return args_dict
# this returns a dictionary with each variable name as a key; the variable's value is the key's value

Say this is in a script called "test.py" which calls the function a la:

myArgs = args2dict()
print myArgs
print myArgs["a"]
if myArgs.has_key("foo"):
  print "yes"

If I run the following:

$ test.py a=1$b=2$c=3$foo=bar

I get this:

{'a': '1', 'c': '3', 'b': '2', 'foo': 'bar'}
1
yes

--------------

Related Content:

Written by nitin

October 7th, 2012 at 12:19 pm

Posted in scripts

Tagged with , ,

full-text searching of timed text and a farewell to Andy Roddick

leave a comment

It’s been a while since I had one of my “So, I’m home sick today and wrote this silly, little script” things.

Well, here’s another one while the antibiotics take root.

I’ve always wanted to do something with offering full-text search against timed-text files and allowing a user to click on a result and skip to the audio segment matching the returned line of timed-text, etc. Hulu has had a BETA version of this kind of thing for a while and I suspect others do too.

Well, today I just whipped up a little search API using PHP and MySQL. It’s a nice little start and super easy to do.

I made a database table using the timed-text data from my SAVS project, OpenOffice Calc, and phpMyAdmin. The text is from Shakepeare’s Sonnet 130 using a LibriVox recording (version #14, Miller). BTW, parsing DFXP or SRT files and throwing those into a table is easy, but it’s not within the scope of this little mock-up.

If I send a query for “rare love” to the API as such:

http://blog.humaneguitarist.org/uploads/SAVS/currentVersion/search/?q=rare%20love

… I get the following JSON response:

{
  "results":{
    "result":[
      {
        "text":"Than in the breath that from my mistress reeks.",
        "highlighted_text":"Than in the breath that from my &lt;mark&gt;mistress&lt;\/mark&gt; &lt;mark&gt;reeks&lt;\/mark&gt;.",
        "startTime":"34",
        "stopTime":"37",
        "source":"sonnet130_shakespeare_njm",
        "relevance":"4.04993200302124"
      },
      {
        "text":"My mistress, when she walks, treads on the ground:",
        "highlighted_text":"My &lt;mark&gt;mistress&lt;\/mark&gt;, when she walks, treads on the ground:",
        "startTime":"46",
        "stopTime":"49",
        "source":"sonnet130_shakespeare_njm",
        "relevance":"1.62977826595306"
      }
    ]
  }
}

Note that the text is returned in the “text” field and I’m also trying to return a “highlighted_text” field in which search terms are surrounded by the HTML5 “mark” tag. There’s also a relevance score … of sorts (pun!).

It needs a lot of work, but there’s enough data returned to launch an audio segment using some HTML5/JavaScript or some Flash or Silverlight API, etc. Hey, it ain’t too bad for a bad stomach and some sports-entertainment distractions.

Below, I’ll paste the CSV file I used to make the table, the PHP script … and a personal note about the best male American tennis professional of the last decade.

Here’s the CSV file from the spreadsheet application (note the “line_text” field is full-text indexed in the database):

"line_id";"line_text";"start_time";"stop_time";"file_prefix"
"1";"Coral is far more red than her lips' red:";"13";"17";"sonnet130_shakespeare_njm"
"2";"If snow be white, why then her breasts are dun;";"17";"21";"sonnet130_shakespeare_njm"
"3";"If hairs be wires, black wires grow on her head.";"21";"26";"sonnet130_shakespeare_njm"
"4";"I have seen roses damask'd, red and white,";"26";"29";"sonnet130_shakespeare_njm"
"5";"But no such roses see I in her cheeks;";"29";"32";"sonnet130_shakespeare_njm"
"6";"And in some perfumes is there more delight";"32";"34";"sonnet130_shakespeare_njm"
"7";"Than in the breath that from my mistress reeks.";"34";"37";"sonnet130_shakespeare_njm"
"8";"I love to hear her speak, yet well I know";"37";"40";"sonnet130_shakespeare_njm"
"9";"That music hath a far more pleasing sound:";"40";"43";"sonnet130_shakespeare_njm"
"10";"I grant I never saw a goddess go, --";"43";"46";"sonnet130_shakespeare_njm"
"11";"My mistress, when she walks, treads on the ground:";"46";"49";"sonnet130_shakespeare_njm"
"12";"And yet, by heaven, I think my love as rare";"49";"54";"sonnet130_shakespeare_njm"
"13";"As any she belied with false compare.";"54";"56";"sonnet130_shakespeare_njm"

Here’s the PHP script:

<?php
//GET search words from URL parameter
$searchWords = trim($_GET["q"]);

//prepare for highlighting keywords
$search_array= explode(" ", $searchWords);

//prepare for output
$output = array();

//connect to database
include_once("db_setup.php");

//run query
$searchWords = mysql_real_escape_string($searchWords);
$query = "SELECT *, MATCH(line_text) AGAINST(\"$searchWords\") AS relevance
FROM $table WHERE MATCH(line_text) AGAINST(\"$searchWords\" IN BOOLEAN mode)
ORDER BY relevance DESC";
$result = mysql_query($query);

if($result) {
    while($row = mysql_fetch_array($result)) {
      $line_text = $row["line_text"];
      $start_time = $row["start_time"];
      $stop_time = $row["stop_time"];
      $file_prefix = $row["file_prefix"];
      $relevance = $row["relevance"];

      //highlight seach words in line_text
      $highlighted_text = $line_text;
      foreach ($search_array as $word) {
        $highlighted_text = str_ireplace($word, "<mark>$word</mark>", $highlighted_text);
      }

      $this_output = array("text" => htmlspecialchars($line_text),
      "highlighted_text" => htmlspecialchars($highlighted_text),
      "startTime" => $start_time,
      "stopTime" => $stop_time,
      "source" => $file_prefix,
      "relevance" => $relevance);
      array_push($output, $this_output);
    }
}

//send JSON results
if (count($output) == 0) {
  $results = array("results" => "No results.");
}

else {
  $result = array("result" => $output);
  $results = array("results" => $result);
}

$response = json_encode($results);
include_once("indent_json.php");
header("Content-type: application/json; charset=UTF-8");
echo(indent_json($response));
?>

And here’s something more important.

As a huge tennis fan, today was a melancholy one for me as Andy Roddick played his last match, having just lost a few moments ago to Juan Martin del Potro. The Wikipedia article on Roddick here already lists him as retired but the important thing to remember about Roddick is that he achieved more with less than a lot of other players with more talent and was entertaining to watch, win or loose, in big matches.

Thanks for the memories!

--------------

Related Content:

Written by nitin

September 5th, 2012 at 6:17 pm

sorta sorting API results with in-memory SQLite

leave a comment

I'll try to keep this short because it's looking like the weather is going to be agreeable enough for a nice, long Saturday walk.
 
So, I've been working on a mockup API at work that could, among other things, drive an in-site federated search across things like our Ebsco databases, the other vendor resources available through Ebsco's API using SRU, and of course our own databases with lists of the resources we offer and their descriptions.
 
Using simple textual similarity libraries it's easy to have the API return a text similarity score (a trick I learned working on HammerFlicks!) comparing the query against the title of each item. This way if someone types in "Wall St. Journal" it's easy to highlight (through an HTML/JavaScript page) the hit for "Wall Street Journal" from our own database because that'll be a good text similarity match.
 
Here's a snippet showing the similarity attribute:
<?xml version="1.0"?>
<nclive_api_response>
  <results source="ncl_resource_titles">
    <result text_similarity_score="86.666666666667">
      <title>Wall Street Journal</title>
      <url>http://www.nclive.org/cgi-bin/nclsm?rsrc=29</url>
      <description>Full articles from the Wall Street Journal (1981-current).</description>
    </result>
	…
  </results>
</nclive_api_response>

As for sorting through results brought in via multiple resources all using their own relevancy rankings – that's a different story. They're using their own relevancy calculations, so there's really no way to present results across multiple sources as the "most relevant".

 
I was toying with the idea, though, of testing what it would be like to – after the fact – index all the returned results on the fly in Solr or something just to get a relevancy ranking for the results the API returns. Now, this isn’t of course arguing that this would be a total relevancy rank across all sources. In other words, if you only pull five items from each "sub-API", each data source mentioned above, then there's no way to say that the first item from Database A is necessarily more relevant than the fifth result of Database B.
 
Anyway, I thought it was stupid to index things behind the scenes in something external just to get an on-the-fly relevancy rank to inject into the API results, when I'd only then have to quickly delete the entire index since I would just be using it to get a score.
 
But what I don't think is too stupid is the idea itself. It's making the argument that "Look, I've asked these different sources to send me their best stuff and now I'll have a way to rank them with my own criteria … because they're mine now." It's like using your own criteria to rank job candidates after asking a few of your industry friends to each send in their five best employees for the job you're hiring for. You're not necessarily going to agree with how they rank their own employees but you do trust that they've sent you five top notch folks.
 
… and so, after a colleague in another department asked if there would be a way to sort items across multiple data sources, I thought to investigate a way to do the indexing and have some kind of ranking/relevancy score done all in memory.
 
Enter SQLite.
 
This is really cool. With SQLite, I can create a full-text index/searchable on-the-fly database in memory that will let me develop some kind of rank per item. Note, one has to have SQLite with FTS3/FTS4 enabled to do full-text with SQLite.
 
Now, the way I'm doing this is to use SQLite's offsets() function to learn – for each search term/word passed to the API – if it or its Porter-based stem matches in the TITLE field (for which each hit gets, say, 2 points) or the DESCRIPTION field (1 point).
 
After getting the total points, I'm dividing the points by the total number of words within the API's TITLE + DESCRIPTION values to get a scaled result between 0 and 1.
 
Anyway, I've got a starter function below (PHP) that would return what I'm calling a "sorta" score. It'll be interesting to work it into the mockup API to see how it works in the real world in trying to sort items from across different sources.
 
And just to be clear, I'm doing this per item. That's to say I do these calculations for one item then delete the in-memory database. In other words, I'm not indexing all the API results in memory and then getting this "sorta" rank per item because the calculation is agnostic of the other items. Now, if I changed the calculation to consider the other items as well, then absolutely there would be a need to index all the items first before assigning a "sorta" score per item.
 
BTW. Get it … "sorta"?
… 'cause it's "sort of" a way to sort things from multiple sources. Ha!
 
:P
 
Anyway, the PHP's below followed by another PHP block that uses the function and then an HTML snippet of what gets returned with sample text.
 
And so much for my walk, looks like rain's on the way. Dammit.
<?php
  
//clean out special chars, etc.
function recharacter_this($htmlstring) {
  $htmlstring = htmlspecialchars($htmlstring, ENT_QUOTES);
  $htmlstring = trim($htmlstring);
  $htmlstring = preg_replace("/[^A-Za-z0-9]\s/", "", $htmlstring); //leave only alpha-numerics and whitespace
  $htmlstring = preg_replace("/\s+/", " ", $htmlstring); //replace multiple whitespaces with a single space
  return $htmlstring;
}

//get a rank score
function sorta_this($title, $description, $search_text) {
  
  $title = recharacter_this($title);
  $description = recharacter_this($description);
  $search_text = recharacter_this($search_text);
  
  //re: SQLite/PHP fundamentals, see: http://www.if-not-true-then-false.com/2012/php-pdo-sqlite3-example/
  
  //create memory db
  $memory_db = null;
  $memory_db = new PDO('sqlite::memory:');
  
  //errormode set to exceptions
  $memory_db->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
  
  //create table
  //you must use "VIRTUAL TABLE" for FTS3/4, see: http://www.sqlite.org/fts3.html#section_1_2
  $memory_db->exec("CREATE VIRTUAL TABLE box using FTS4 (
  id,
  title,
  description,
  tokenize=porter)"); //porter > simple because a search for "tree" matches up against text with "trees" where as "tokenize=simple" tokenization doesn't seem to do this;
  //granted, Porter stemming has its own problems, but it's better than nothing.
  
  $insert = "INSERT INTO box (id, title, description) VALUES('1', '$title', '$description')";
  $stmt = $memory_db->exec($insert); //insert values per above
  
  $search_text = str_replace(" ", " OR ", $search_text); //making search more liberal
  $query = "SELECT quote(offsets(box)) as rank FROM box WHERE box MATCH '$search_text' ORDER BY rank";
  $result = $memory_db->query($query); //run query per above
  
  $score = 0; //start with initial score of Zero
  $i = 0;  //to use during iteration
  
  //if query yielded anything ...
  if ($result) {
    
    //there's only one row, but still need to loop
    foreach($result as $row) {
      $rank = $row['rank'];
      preg_match_all("/[a-zA-Z0-9]+\ [a-zA-Z0-9]+\ [a-zA-Z0-9]+\ [a-zA-Z0-9]+/", $rank, $matches); //split at every 4th space, i.e. every quartet returned by SQLite offsets(); see: http://stackoverflow.com/questions/10555698/split-string-after-every-five-words
      
      //$matches is a single item array with one array inside it for each quartet; $matches[0] is thus just a plain array
      foreach ($matches[0] as $match) {
        if ($match[0] == 1) {
          //if search hits in TITLE field, get 2 points
          $score = $score + 2;
        }
        else { 
          //if in DESCRIPTION field, get 1 point
          $score = $score + 1;
        }
        $i = $i + 1;
      }
    }
  }
  
  $memory_db->exec("DROP TABLE box");
  $memory_db = null;
  
  $total_words = str_word_count($title) + str_word_count($description);
  $score = ($score/$total_words); //divide $score by total number of words in TITLE + DESCRIPTION
  
  //prevent scores greater than 1, which would only occur with an abnormally small number of total words (essentially <= to the number of words in search terms)
  if ($score > 1) {
    $score = 1;
  }
  return $score;
}
?>
Using the function with TITLE and DESCRIPTION (abstract) from this article …
<?php
//test sorta_this() function
$my_title = ("An aerobic walking programme versus muscle strengthening programme for chronic low back pain: a randomized controlled trial.");
$my_description = ("Objective:To assess the effect of aerobic walking training as compared to active training, which includes muscle strengthening, on functional abilities among patients with chronic low back pain.Design:Randomized controlled clinical trial with blind assessors.Setting:Outpatient clinic.Subjects:Fifty-two sedentary patients, aged 18-65 years with chronic low back pain. Patients who were post surgery, post trauma, with cardiovascular problems, and with oncological disease were excluded.Intervention:Experimental 'walking' group: moderate intense treadmill walking; control 'exercise' group: specific low back exercise; both, twice a week for six weeks.Main measures:Six-minute walking test, Fear-Avoidance Belief Questionnaire, back and abdomen muscle endurance tests, Oswestry Disability Questionnaire, Low Back Pain Functional Scale (LBPFS).Results:Significant improvements were noted in all outcome measures in both groups with non-significant difference between groups. The mean distance in metres covered during 6 minutes increased by 70.7 (95% confidence interval (CI) 12.3-127.7) in the 'walking' group and by 43.8 (95% CI 19.6-68.0) in the 'exercise' group. The trunk flexor endurance test showed significant improvement in both groups, increasing by 0.6 (95% CI 0.0-1.1) in the 'walking' group and by 1.1 (95% CI 0.3-1.8) in the 'exercise' group.Conclusions:A six-week walk training programme was as effective as six weeks of specific strengthening exercises programme for the low back."); 
$my_search_text = ("back pain exercise");
$my_score = sorta_this($my_title, $my_description, $my_search_text);

echo ("Searching for \"$my_search_text\" in <br /><br />TITLE: <em>$my_title</em> <br /><br />and <br /><br />DESCRIPTION: <em>$my_description</em> <br /><br />yields a \"sorta\" relevancy of<strong> ");
echo $my_score . "</strong><br /><br />";
echo ("<hr />Hits for each search word in TITLE get 2 points, hits in DESCRIPTION get 1 point.<br />This number is then divided by the total number of words in the TITLE + DESCRIPTION.");
?>

The results …

Searching for "back pain exercise" in

TITLE: An aerobic walking programme versus muscle strengthening programme for chronic low back pain: a randomized controlled trial.

and

DESCRIPTION: Objective:To assess the effect of aerobic walking training as compared to active training, which includes muscle strengthening, on functional abilities among patients with chronic low back pain.Design:Randomized controlled clinical trial with blind assessors.Setting:Outpatient clinic.Subjects:Fifty-two sedentary patients, aged 18-65 years with chronic low back pain. Patients who were post surgery, post trauma, with cardiovascular problems, and with oncological disease were excluded.Intervention:Experimental 'walking' group: moderate intense treadmill walking; control 'exercise' group: specific low back exercise; both, twice a week for six weeks.Main measures:Six-minute walking test, Fear-Avoidance Belief Questionnaire, back and abdomen muscle endurance tests, Oswestry Disability Questionnaire, Low Back Pain Functional Scale (LBPFS).Results:Significant improvements were noted in all outcome measures in both groups with non-significant difference between groups. The mean distance in metres covered during 6 minutes increased by 70.7 (95% confidence interval (CI) 12.3-127.7) in the 'walking' group and by 43.8 (95% CI 19.6-68.0) in the 'exercise' group. The trunk flexor endurance test showed significant improvement in both groups, increasing by 0.6 (95% CI 0.0-1.1) in the 'walking' group and by 1.1 (95% CI 0.3-1.8) in the 'exercise' group.Conclusions:A six-week walk training programme was as effective as six weeks of specific strengthening exercises programme for the low back.

yields a "sorta" relevancy of 0.065


Hits for each search word in TITLE get 2 points, hits in DESCRIPTION get 1 point.
This number is then divided by the total number of words in TITLE + DESCRIPTION.

--------------

Related Content:

Written by nitin

August 11th, 2012 at 12:36 am

Switch to our mobile site