blog.humaneguitarist.org

discoveries in digital audio, music notation, and information encoding

Archive for the ‘technophilia’ Category

on the brain: audio + ocr/hocr, “did you mean”, and “there are no ebooks”

leave a comment

Lightning talk style 'cause I'm home sick and need to get a few things out there …

Audio + OCR/HOCR

Some time ago I wrote this post on OCR/HOCR and making searchable pages. I recently did some tests with generating audio with Festival and using simple HTML5 audio to ad audio to the page. I only used Festival on the OCR output, but by using the HOCR output it's no big thing to make audio for every line that Tesseract "detects" and incorporate it with SAVS or something.

"Did You Mean?"

Google doesn't seem to offer a "did you mean" API, but you can get around it. Other options might be to use Wikipedia's API or Google's own search suggestion API (i.e. first suggestion). In both links to the API, I've sent it "disese" instead of "disease".

There are no eBooks

In digital, why the hell are we still thinking of "eBOOKS" and "digital AUDIO BOOKS", etc.?

Why can't we just think of them as web applications? And instead of having "ebook reader software" and "audio book software" why can't we just use, say, a Python/PyQT based application that uses Webkit? That's to say, I get that for monetary reasons not everything can be open, but why can't I just download a compiled script that runs Webkit and disallows me from doing things like viewing the source, saving the page, etc.? If the files need to be downloaded they can be saved in password secured ZIP files (which can only be downloaded, say, with a username/password). The Webkit app would be the only thing that could talk to a centralized DB and determine if the user still has rights to view the material, if so the DB could hand the application the password to read the contents into memory from the ZIP file, show them through the browser, etc. without making the contents readable otherwise.

What am I missing here? A lot, I'm sure. But there's got to be a better way and we need to stop thinking of digital as a bits/bytes rendering of the physical world. After all, is this an eBook or a digital audio book? I can, of course, both read and/or hear it.

In early tests, making "eBooks" readable via a Python/Webkit app is working, HTML5 audio is working on my computer but not at work, video on neither. But I *think* based on what I've read that there are some bugs with the PyQT Webkit, so maybe it's just a temp thing.

Either way, why keep inventing new software when a secure browser lets us read, listen, and watch? Adding stuff like bookmarks is a simple matter I'd think of storing data in a centralized DB for that given user account (or even a local SQLite db).

Moreover, will there even be "e-readers" and "mp3 players" in the near future? Won't it all just be a "device" of a given size that just runs a web browser?

--------------

Related Content:

Written by nitin

March 7th, 2013 at 12:11 pm

pixelation: custom XSLT functions with Python and lxml

leave a comment

I'll be brief.

Because the Python "lxml" module doesn't support XSLT 2.0 functions, I was looking at support for EXSLT

… but then stumbled on how to write my own functions and call them from stylesheets.

Freakin' cool.

I like calling it "pxslt" for "Python XSLT" and pronouncing it like "pixelate".

:P

Example below of the "module" I made;  the script that calls it, and the results.

Told you I'd be brief.

Module:

#pxslt.py

def underscore(context, word):
  '''Replace whitespace with underscore.'''
  out = word[0].replace(' ', '_')
  return out

def multiply(context, int_val, int2_val):
  '''Multiply two integers.'''
  int_val, int2_val = int(int_val[0]), int(int2_val[0])
  return int_val * int2_val

def libraryThing(context, isbn):
  '''Get language for a work based on ISBN using LibraryThing API.'''
  isbn = isbn[0]
  import urllib
  res = urllib.urlopen('http://www.librarything.com/api/thingLang.php?isbn=' + isbn)
  res_r = res.read()
  return res_r

##### DO NOT EDIT
##### makes it possible to call the above functions with XSLT
def pxslt():
  myFunctions = []
  gbs = globals()
  from inspect import isfunction
  for gb in gbs:
    if isfunction(gbs[gb]) and gb != 'pxslt':
      #print gb
      myFunctions.append(gbs[gb])

  from lxml import etree
  #see: http://lxml.de/extensions.html
  ns = etree.FunctionNamespace('file://libs/pxslt.py')
  ns.prefix = 'pxsl'
  for myFunction in myFunctions:
    name = str(myFunction.func_name)
    ns[name] = myFunction
  return ns

Usage example:

from lxml import etree

#####
myXML = etree.XML('''\
<a>
  <b>Hello. This will appear with whitespaces replaced by underscores.</b>
  <c>3</c>
</a>''')

myXSL = etree.XSLT(etree.XML('''\
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:pxslt="file://libs/pxslt.py">
  <xsl:output method="text" version="1.0" />
  <xsl:template match="a">
    <xsl:variable name="isbn">9955081260</xsl:variable>
    <xsl:value-of select="pxslt:libraryThing($isbn)" />
    <xsl:text>\n</xsl:text> <!-- Python will line break here -->
    <xsl:value-of select="pxslt:underscore(b/text())" />
    <xsl:text>\n</xsl:text> <!-- Python will line break here -->
    <xsl:call-template name="mathFunc">
    </xsl:call-template>
  </xsl:template>
  <xsl:template name="mathFunc">
    <xsl:variable name="myNum">10</xsl:variable>
    <xsl:value-of select="pxslt:multiply(c/text(), $myNum)" />
  </xsl:template>
</xsl:stylesheet>'''))

import pxslt
pxslt.pxslt() #get all set up with namespaces and function stuff

print(myXSL(myXML))

#myXSL_file = etree.XSLT(etree.parse('foo.xsl')) #for testing with a real XSL file
#print(myXSL_file(myXML))

Output:

>>>
lit
Hello._This_will_appear_with_whitespaces_replaced_by_underscores.
30

--------------

Related Content:

Written by nitin

November 2nd, 2012 at 5:28 pm

launching Google Navigation from email directions I send to myself

leave a comment

I have a Droid Incredible phone I've had for a few years.

It's OK.

I'm not too big on phones in-and-of-themselves (seriously people, if you identify with your freaking phone you're a loser) but it is useful to be able to take high quality photos of musical compositions I'm working on and email them to myself, to make low quality audio recordings of ideas I have when I'm too tired/lazy to notate the ideas, and …

… to not get lost.

Google Navigation is pretty helpful. What I usually do is email myself directions, then when I get in my car I click on the address I'm headed to and just follow the voice telling me where to go.

The problem is that at some point clicking on the message body's link on my phone (circled in red in the image below) stopped launching Navigator and would just launch Google Maps in my browser, meaning I've had to type/paste the address in the Navigator bar before I could get voice directions. My guess is that some update caused this to stop working.

But it seems like if I click on the street address in the email subject line itself (yellow highlighted in the image below), the actual Maps app launches from where it's easy to click on the Navigation icon and start hearing the directions.

Google Navigation screenshot

--------------

Related Content:

Written by nitin

October 26th, 2012 at 7:59 pm

Posted in technophilia

Tagged with , ,

sorta sorting API results with in-memory SQLite

leave a comment

I'll try to keep this short because it's looking like the weather is going to be agreeable enough for a nice, long Saturday walk.
 
So, I've been working on a mockup API at work that could, among other things, drive an in-site federated search across things like our Ebsco databases, the other vendor resources available through Ebsco's API using SRU, and of course our own databases with lists of the resources we offer and their descriptions.
 
Using simple textual similarity libraries it's easy to have the API return a text similarity score (a trick I learned working on HammerFlicks!) comparing the query against the title of each item. This way if someone types in "Wall St. Journal" it's easy to highlight (through an HTML/JavaScript page) the hit for "Wall Street Journal" from our own database because that'll be a good text similarity match.
 
Here's a snippet showing the similarity attribute:
<?xml version="1.0"?>
<nclive_api_response>
  <results source="ncl_resource_titles">
    <result text_similarity_score="86.666666666667">
      <title>Wall Street Journal</title>
      <url>http://www.nclive.org/cgi-bin/nclsm?rsrc=29</url>
      <description>Full articles from the Wall Street Journal (1981-current).</description>
    </result>
	…
  </results>
</nclive_api_response>

As for sorting through results brought in via multiple resources all using their own relevancy rankings – that's a different story. They're using their own relevancy calculations, so there's really no way to present results across multiple sources as the "most relevant".

 
I was toying with the idea, though, of testing what it would be like to – after the fact – index all the returned results on the fly in Solr or something just to get a relevancy ranking for the results the API returns. Now, this isn’t of course arguing that this would be a total relevancy rank across all sources. In other words, if you only pull five items from each "sub-API", each data source mentioned above, then there's no way to say that the first item from Database A is necessarily more relevant than the fifth result of Database B.
 
Anyway, I thought it was stupid to index things behind the scenes in something external just to get an on-the-fly relevancy rank to inject into the API results, when I'd only then have to quickly delete the entire index since I would just be using it to get a score.
 
But what I don't think is too stupid is the idea itself. It's making the argument that "Look, I've asked these different sources to send me their best stuff and now I'll have a way to rank them with my own criteria … because they're mine now." It's like using your own criteria to rank job candidates after asking a few of your industry friends to each send in their five best employees for the job you're hiring for. You're not necessarily going to agree with how they rank their own employees but you do trust that they've sent you five top notch folks.
 
… and so, after a colleague in another department asked if there would be a way to sort items across multiple data sources, I thought to investigate a way to do the indexing and have some kind of ranking/relevancy score done all in memory.
 
Enter SQLite.
 
This is really cool. With SQLite, I can create a full-text index/searchable on-the-fly database in memory that will let me develop some kind of rank per item. Note, one has to have SQLite with FTS3/FTS4 enabled to do full-text with SQLite.
 
Now, the way I'm doing this is to use SQLite's offsets() function to learn – for each search term/word passed to the API – if it or its Porter-based stem matches in the TITLE field (for which each hit gets, say, 2 points) or the DESCRIPTION field (1 point).
 
After getting the total points, I'm dividing the points by the total number of words within the API's TITLE + DESCRIPTION values to get a scaled result between 0 and 1.
 
Anyway, I've got a starter function below (PHP) that would return what I'm calling a "sorta" score. It'll be interesting to work it into the mockup API to see how it works in the real world in trying to sort items from across different sources.
 
And just to be clear, I'm doing this per item. That's to say I do these calculations for one item then delete the in-memory database. In other words, I'm not indexing all the API results in memory and then getting this "sorta" rank per item because the calculation is agnostic of the other items. Now, if I changed the calculation to consider the other items as well, then absolutely there would be a need to index all the items first before assigning a "sorta" score per item.
 
BTW. Get it … "sorta"?
… 'cause it's "sort of" a way to sort things from multiple sources. Ha!
 
:P
 
Anyway, the PHP's below followed by another PHP block that uses the function and then an HTML snippet of what gets returned with sample text.
 
And so much for my walk, looks like rain's on the way. Dammit.
<?php
  
//clean out special chars, etc.
function recharacter_this($htmlstring) {
  $htmlstring = htmlspecialchars($htmlstring, ENT_QUOTES);
  $htmlstring = trim($htmlstring);
  $htmlstring = preg_replace("/[^A-Za-z0-9]\s/", "", $htmlstring); //leave only alpha-numerics and whitespace
  $htmlstring = preg_replace("/\s+/", " ", $htmlstring); //replace multiple whitespaces with a single space
  return $htmlstring;
}

//get a rank score
function sorta_this($title, $description, $search_text) {
  
  $title = recharacter_this($title);
  $description = recharacter_this($description);
  $search_text = recharacter_this($search_text);
  
  //re: SQLite/PHP fundamentals, see: http://www.if-not-true-then-false.com/2012/php-pdo-sqlite3-example/
  
  //create memory db
  $memory_db = null;
  $memory_db = new PDO('sqlite::memory:');
  
  //errormode set to exceptions
  $memory_db->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
  
  //create table
  //you must use "VIRTUAL TABLE" for FTS3/4, see: http://www.sqlite.org/fts3.html#section_1_2
  $memory_db->exec("CREATE VIRTUAL TABLE box using FTS4 (
  id,
  title,
  description,
  tokenize=porter)"); //porter > simple because a search for "tree" matches up against text with "trees" where as "tokenize=simple" tokenization doesn't seem to do this;
  //granted, Porter stemming has its own problems, but it's better than nothing.
  
  $insert = "INSERT INTO box (id, title, description) VALUES('1', '$title', '$description')";
  $stmt = $memory_db->exec($insert); //insert values per above
  
  $search_text = str_replace(" ", " OR ", $search_text); //making search more liberal
  $query = "SELECT quote(offsets(box)) as rank FROM box WHERE box MATCH '$search_text' ORDER BY rank";
  $result = $memory_db->query($query); //run query per above
  
  $score = 0; //start with initial score of Zero
  $i = 0;  //to use during iteration
  
  //if query yielded anything ...
  if ($result) {
    
    //there's only one row, but still need to loop
    foreach($result as $row) {
      $rank = $row['rank'];
      preg_match_all("/[a-zA-Z0-9]+\ [a-zA-Z0-9]+\ [a-zA-Z0-9]+\ [a-zA-Z0-9]+/", $rank, $matches); //split at every 4th space, i.e. every quartet returned by SQLite offsets(); see: http://stackoverflow.com/questions/10555698/split-string-after-every-five-words
      
      //$matches is a single item array with one array inside it for each quartet; $matches[0] is thus just a plain array
      foreach ($matches[0] as $match) {
        if ($match[0] == 1) {
          //if search hits in TITLE field, get 2 points
          $score = $score + 2;
        }
        else { 
          //if in DESCRIPTION field, get 1 point
          $score = $score + 1;
        }
        $i = $i + 1;
      }
    }
  }
  
  $memory_db->exec("DROP TABLE box");
  $memory_db = null;
  
  $total_words = str_word_count($title) + str_word_count($description);
  $score = ($score/$total_words); //divide $score by total number of words in TITLE + DESCRIPTION
  
  //prevent scores greater than 1, which would only occur with an abnormally small number of total words (essentially <= to the number of words in search terms)
  if ($score > 1) {
    $score = 1;
  }
  return $score;
}
?>
Using the function with TITLE and DESCRIPTION (abstract) from this article …
<?php
//test sorta_this() function
$my_title = ("An aerobic walking programme versus muscle strengthening programme for chronic low back pain: a randomized controlled trial.");
$my_description = ("Objective:To assess the effect of aerobic walking training as compared to active training, which includes muscle strengthening, on functional abilities among patients with chronic low back pain.Design:Randomized controlled clinical trial with blind assessors.Setting:Outpatient clinic.Subjects:Fifty-two sedentary patients, aged 18-65 years with chronic low back pain. Patients who were post surgery, post trauma, with cardiovascular problems, and with oncological disease were excluded.Intervention:Experimental 'walking' group: moderate intense treadmill walking; control 'exercise' group: specific low back exercise; both, twice a week for six weeks.Main measures:Six-minute walking test, Fear-Avoidance Belief Questionnaire, back and abdomen muscle endurance tests, Oswestry Disability Questionnaire, Low Back Pain Functional Scale (LBPFS).Results:Significant improvements were noted in all outcome measures in both groups with non-significant difference between groups. The mean distance in metres covered during 6 minutes increased by 70.7 (95% confidence interval (CI) 12.3-127.7) in the 'walking' group and by 43.8 (95% CI 19.6-68.0) in the 'exercise' group. The trunk flexor endurance test showed significant improvement in both groups, increasing by 0.6 (95% CI 0.0-1.1) in the 'walking' group and by 1.1 (95% CI 0.3-1.8) in the 'exercise' group.Conclusions:A six-week walk training programme was as effective as six weeks of specific strengthening exercises programme for the low back."); 
$my_search_text = ("back pain exercise");
$my_score = sorta_this($my_title, $my_description, $my_search_text);

echo ("Searching for \"$my_search_text\" in <br /><br />TITLE: <em>$my_title</em> <br /><br />and <br /><br />DESCRIPTION: <em>$my_description</em> <br /><br />yields a \"sorta\" relevancy of<strong> ");
echo $my_score . "</strong><br /><br />";
echo ("<hr />Hits for each search word in TITLE get 2 points, hits in DESCRIPTION get 1 point.<br />This number is then divided by the total number of words in the TITLE + DESCRIPTION.");
?>

The results …

Searching for "back pain exercise" in

TITLE: An aerobic walking programme versus muscle strengthening programme for chronic low back pain: a randomized controlled trial.

and

DESCRIPTION: Objective:To assess the effect of aerobic walking training as compared to active training, which includes muscle strengthening, on functional abilities among patients with chronic low back pain.Design:Randomized controlled clinical trial with blind assessors.Setting:Outpatient clinic.Subjects:Fifty-two sedentary patients, aged 18-65 years with chronic low back pain. Patients who were post surgery, post trauma, with cardiovascular problems, and with oncological disease were excluded.Intervention:Experimental 'walking' group: moderate intense treadmill walking; control 'exercise' group: specific low back exercise; both, twice a week for six weeks.Main measures:Six-minute walking test, Fear-Avoidance Belief Questionnaire, back and abdomen muscle endurance tests, Oswestry Disability Questionnaire, Low Back Pain Functional Scale (LBPFS).Results:Significant improvements were noted in all outcome measures in both groups with non-significant difference between groups. The mean distance in metres covered during 6 minutes increased by 70.7 (95% confidence interval (CI) 12.3-127.7) in the 'walking' group and by 43.8 (95% CI 19.6-68.0) in the 'exercise' group. The trunk flexor endurance test showed significant improvement in both groups, increasing by 0.6 (95% CI 0.0-1.1) in the 'walking' group and by 1.1 (95% CI 0.3-1.8) in the 'exercise' group.Conclusions:A six-week walk training programme was as effective as six weeks of specific strengthening exercises programme for the low back.

yields a "sorta" relevancy of 0.065


Hits for each search word in TITLE get 2 points, hits in DESCRIPTION get 1 point.
This number is then divided by the total number of words in TITLE + DESCRIPTION.

--------------

Related Content:

Written by nitin

August 11th, 2012 at 12:36 am

okra pie: some simple ocr/hocr tests

2 comments

A couple of years ago while at the University of Alabama, we were using tesseract-ocr to OCR images of old printed texts. At that version of tesseract, without editing the code there didn't seem to be a way to get the actual coordinates of the words.

This week I kind of got re-interested in seeing if there was a simple way to use tesseract to get the bounding box info, i.e. where the words are located on the image. With the newer (3+) version of tesseract I initially learned that one can get the box coordinates by passing the "makebox" argument a la:

$ tesseract foo.tif foo.txt makebox

This actually outputs the coordinates of each character, so I wrote a little Python script to take the text OCR output and compare it against the character coordinates to give me the location of each word. In turn, the script would use ImageMagick to make a PNG file from the TIFF and then dump out an HTML file that placed the words over the images, though the words were transparent. This allowed me to just use the browser's native Find feature (CTRL-F) to search and highlight HTML words as they rested on top of their respective image-based words.

But, of course, then I learned you can just do this:

$ tesseract foo.tif foo hocr

to create an HTML (4.0) file with the coordinates of each word, eliminating the script's need to compare whole words against character coordinates.

Anyway, there's lots of work to do if I want to pursue this. I need to investigate more about some of the weird things happening with text like newspapers with multiple columns (text for some columns is severely offset from the image itself, etc.) but it's a nice little start.

I also want to see if there's a way to map the tesseract output to an Abby Fine Reader like XML output and maybe that way tesseract could be used in conjunction with the Internet Archive's fine eBook reader. I'm sure someone's already done that, so a little research would be step #1. I think the IA's reader uses Abby output(?) and Abby's not open or free, if I understand.

I'd also like to think about doing this for OCR-ed images of audio transcriptions and synchronizing the image and/or the HTML text with the media.

Anywho, here's a link to a sample HTML file. You'll probably want to zoom out given that I didn't resize the image – yeah, so it loads slowly, too. Might not want to use IE, because I don't think it lets you search for more than one word like "Mulberry Tree". Also, there are two JS functions in the page called "hideImage()" and "showImage()" if anyone wants to play a little and see how the text looks like without the image in the background.

By the way the image I tested on is from the Library of Congress' awesome American Memory collection. You can see it here.

… and I almost forgot the best part. The script is called "Okra Pie" because "ocr" is like "okra" and "pie" is for Python.

:P

--------------

Related Content:

Written by nitin

July 14th, 2012 at 11:10 am

Posted in scripts,technophilia

Tagged with , ,

North Carolina grants, Google App Engine, and pie … mmm.

leave a comment

I took April off from blogging after realizing I was over blogging, as opposed to over logging.

I'll keep this short. Well, I'll try.

I'm shacked up in the apartment due to some unexpected circumstances and yesterday I decided to try and be a little productive and learn something I could potentially use in the workplace.

I learned a little about Google App Engine. I was drawn to it because of the Python support and because it gives me a free environment where I can deploy Python apps using the ever-elusive lxml library.

While I wrote some silly stuff using lxml and data available from the Business.gov API I ended uploading a simple app – if you can call it that – that parses a CSV file from North Carolina's (USA) NCOpenBook.

I didn't use the csv module because the CSV file I used has like three lines at the top that aren't headers (people: don't do that!). I don't know if there's a way to handle that with the csv module (there probably is) but I wasn't interested in digging around. Instead, I used a modified version of this code I wrote previously.

The CSV file lists grantees who've received funding by North Carolina and the app pulls out the top ten since 2007 based on cumulative grant totals. The app uses Google Chart Tools to make a pie chart of the top ten recipients. I'm not so sure about the colors in the pie chart – it's hard to see the difference between some of the colors associated with each grantee – but it's a simple start.

Here's a screenshot:

Top Ten NC Grants by Grantee

.. and here's the link to the app online: http://top-ten-nc-totals-by-grantee.appspot.com.

I've also pasted the app.yaml file, my Python code, and the Jinja/HTML template below if anyone's interested.

YAML:

application: top-ten-nc-totals-by-grantee
version: 1
runtime: python27
api_version: 1
threadsafe: true

handlers:
- url: /stylesheets
  static_dir: stylesheets
- url: /.*
  script: nctotals.app
 
libraries:
- name: jinja2
  version: latest

Python:

#import modules
import urllib
import webapp2

import jinja2
import os
     
jinja_environment = jinja2.Environment(
  loader=jinja2.FileSystemLoader(os.path.dirname(__file__)))

#####

#see: http://stackoverflow.com/a/2827664
class Object(object):
  pass

#my CSV parser
def csv2dict(fileName, delimiter):
  f = urllib.urlopen(fileName) #open file
  lines = f.read() #read file

  rows = lines.split("\n") #put lines in list

  #cut out non-header rows at top of this particular CSV file
  for i in range(0,3):
    rows.pop(0)

  #shorten the CSV data to 10 rows (there were too many damn rows in the CSV file!)
  for i in range(12,len(rows)+1):
    rows.pop(-1)

  headers = rows[0].split(delimiter) #put header titles in list
  rows.pop(0) #remove header from "rows" list

  i = 0
  worksheet = {}
  for header in headers: #for each header, i.e. each column
    columnCells = []
    #print header #test line
    for row in rows: #for each non-header row in delimited file
      if row != "": #!!!you need to also add a test for lines that don't split on the delimeter (i.e. notes)
        rowCells = row.split(delimiter) #get cells in row
        columnCells.append(rowCells[i].strip()) #put column's cells in list
    worksheet[header] = columnCells #set header as KEY and set "columnCells" list as VALUE
    i = i + 1
 
  return worksheet

#####

class MainPage(webapp2.RequestHandler):
  def get(self):
    parsed = csv2dict("http://data.osbm.state.nc.us/openbook/comma_grant_cumulative_awards_and_annual_disbursements_by_grantee.csv", '","') #pass filename and delimiter
    
    topTen = range(0,len(parsed['"Non-Profit Name (*)'])) #i.e. range is 1 to 10, or 0 to 9 depending on your p.o.v.

    for i in topTen: #add attributes to each of the ten agencies in the CSV file
      topTen[i] = Object()
      topTen[i].name = parsed['"Non-Profit Name (*)'][i].replace('"','')
      topTen[i].total = parsed['Cumulative Total Award'][i]
      raw_total = parsed['Cumulative Total Award'][i]
      raw_total = raw_total.replace('$','')
      raw_total = raw_total.replace(',','')
      topTen[i].raw_total = raw_total
      
    #data for the Jinja template  
    template_values = {
      'topTen': topTen}

    template = jinja_environment.get_template('index.html')
    self.response.out.write(template.render(template_values)) #write data to the index.html template
  
app = webapp2.WSGIApplication([('/', MainPage)],debug=True)

Template:

<!DOCTYPE HTML>
<html>
  <head>
    <title>
      Top Ten NC Grants by Grantee (since 2007)
    </title>
    <link type="text/css" rel="stylesheet" href="/stylesheets/style.css" />
    <script type="text/javascript" src="http://www.google.com/jsapi"></script>
    <script type="text/javascript">
      google.load('visualization', '1', {packages: ['imagepiechart']});
    </script>
    <script type="text/javascript">
      function drawVisualization() {
        // Create and populate the data table.
        var data = new google.visualization.DataTable();
        data.addColumn('string', 'name');
        data.addColumn('number', 'raw_total');
        data.addRows([
          {% for topper in topTen %}
          ["{{ topper.name }} - {{ topper.total }}", {{ topper.raw_total }}],
          {% endfor %}
        ]);
    
        // Create and draw the visualization.
        new google.visualization.ImagePieChart(document.getElementById('visualization')).
          draw(data, null);
      }
      google.setOnLoadCallback(drawVisualization);
    </script>
  </head>
  <body>
    <h3>Top Ten <a href="http://www.ncopenbook.gov/NCOpenBook/GrantsHome.jsp">NC Grants</a> by Grantee (cumulative totals since 2007)</h3>
    <p>see the source CSV file <a href="http://data.osbm.state.nc.us/openbook/comma_grant_cumulative_awards_and_annual_disbursements_by_grantee.csv">here</a></p>
    <div id="visualization"></div>
    <p>Made with:</p>
    <ul>
      <li><a href="https://developers.google.com/appengine/docs/python/gettingstartedpython27/">Google App Engine (Python 2.7)</a></li>
      <li><a href="https://developers.google.com/chart/">Google Chart Tools</a></li>
    </ul>
    <p>More info (blog post):</p>
    <ul>
      <li><a href="http://blog.humaneguitarist.org/2012/05/01/north-carolina-grants-google-app-engine-and-pie-mmm/">North Carolina grants, Google App Engine, and pie ... mmm.</a></li>
    </ul>
  </body>
</html>
--------------

Related Content:

Written by nitin

May 1st, 2012 at 10:42 am

Full Metal Alchemyapi.com or “more term extraction crap and linky data crud”

leave a comment

As I mentioned before, I'm playing with the idea of using term generating APIs to build facets in a Solr index project that I'm working on with some people.

The results seem really promising.

If I wasn't in need of a nap before some more college basketball gets underway, I'd say more than I'm about to.

Instead, I'm going to do three quick things here:

  1. Provide a screenshot of the index UI using Calais "social tags" for facets.
    1. This is a local (my computer) copy of the index using a very small set of item metadata. That's to say we currently have about 37k items in the index, and I'm just using about 1k.
    2. I'm only using Calais tags if the "importance" attribute is equal to "1", so I'm leaving out tags Calais considers less relevant because, well, some of the terms generated were making me think "WTF?".
    3. Some of the terms with underscores like "War_Conflict" appear to be those used in the news industry and are potentially ones to throw out.
  2. Post a small Python script to make a call to Alchemyapi.com, which is similar – and possible better – than Calais.
  3. Post the Alchemyapi.com results XML document and talk a little about what I think it can be used for in our project.

So, here's the Calais screenshot (you'll need to view the image at full-resolution to read it):

Calais Facets

Here's the Python script to call the Alchemyapi.com API:

import urllib, urllib2

#set API url and API key
url = 'http://access.alchemyapi.com/calls/text/TextGetRankedConcepts'
apikey = '' #your API key goes here
#get Alchemy API key from: http://www.alchemyapi.com/api/register.html

#set some text for the API
text = '''
Episcopal churches
Churches Cemeteries
Tombs and sepulchral monuments
Postcards--North Carolina.
Flat Rock (N.C.)
Henderson County (N.C.)
'''

#send data to API
params = urllib.urlencode({
  'apikey': apikey,
  'text': text,
  'showSourceText': '1', #shows the original text sent to the API
})
alchemyThis = urllib2.urlopen(url, params).read()

#view results
print alchemyThis

And here's the output for the code above:

<?xml version="1.0" encoding="UTF-8"?>
<results>
  <status>OK</status>
  <usage>By accessing AlchemyAPI or using information generated by AlchemyAPI, you are agreeing to be bound by the AlchemyAPI Terms of Use: http://www.alchemyapi.com/company/terms.html</usage>
  <url/>
  <language>english</language>
  <text>Episcopal churches Churches Cemeteries Tombs and sepulchral monuments Postcards--North Carolina. Flat Rock (N.C.) Henderson County (N.C.)</text>
  <concepts>
    <concept>
      <text>North Carolina</text>
      <relevance>0.920839</relevance>
      <website>http://www.nc.gov</website>
      <dbpedia>http://dbpedia.org/resource/North_Carolina</dbpedia>
      <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f800000000002b62d</freebase>
      <opencyc>http://sw.opencyc.org/concept/Mx4rvViyspwpEbGdrcN5Y29ycA</opencyc>
      <yago>http://mpii.de/yago/resource/North_Carolina</yago>
      <geonames>http://sws.geonames.org/4482348/</geonames>
    </concept>
    <concept>
      <text>Tomb</text>
      <relevance>0.837256</relevance>
      <geo>29.855 31.219</geo>
      <dbpedia>http://dbpedia.org/resource/Tomb</dbpedia>
      <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f800000000007ff03</freebase>
      <opencyc>http://sw.opencyc.org/concept/Mx4rwQw2p5wpEbGdrcN5Y29ycA</opencyc>
    </concept>
    <concept>
      <text>Burial monuments and structures</text>
      <relevance>0.773605</relevance>
      <dbpedia>http://dbpedia.org/resource/Burial_monuments_and_structures</dbpedia>
    </concept>
    <concept>
      <text>Flat Rock, Henderson County, North Carolina</text>
      <relevance>0.718415</relevance>
      <geo>35.266666666666666 -82.45333333333333</geo>
      <website>http://villageofflatrock.org/</website>
      <dbpedia>http://dbpedia.org/resource/Flat_Rock,_Henderson_County,_North_Carolina</dbpedia>
      <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f80000000000ebc28</freebase>
      <yago>http://mpii.de/yago/resource/Flat_Rock,_Henderson_County,_North_Carolina</yago>
    </concept>
    <concept>
      <text>Henderson County, North Carolina</text>
      <relevance>0.615825</relevance>
      <geo>35.34 -82.48</geo>
      <website>http://www.hendersoncountync.org</website>
      <dbpedia>http://dbpedia.org/resource/Henderson_County,_North_Carolina</dbpedia>
      <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f80000000000a10b4</freebase>
      <yago>http://mpii.de/yago/resource/Henderson_County,_North_Carolina</yago>
    </concept>
    <concept>
      <text>Asheville, North Carolina</text>
      <relevance>0.610351</relevance>
      <website>http://www.ashevillenc.gov/</website>
      <dbpedia>http://dbpedia.org/resource/Asheville,_North_Carolina</dbpedia>
      <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f80000000000eb2ac</freebase>
      <census>http://www.rdfabout.com/rdf/usgov/geo/us/nc/counties/buncombe_county/asheville</census>
      <yago>http://mpii.de/yago/resource/Asheville,_North_Carolina</yago>
      <geonames>http://sws.geonames.org/4453066/</geonames>
    </concept>
    <concept>
      <text>Episcopal Church in the United States of America</text>
      <relevance>0.610029</relevance>
      <dbpedia>http://dbpedia.org/resource/Episcopal_Church_in_the_United_States_of_America</dbpedia>
      <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f8000000000015f1b</freebase>
      <yago>http://mpii.de/yago/resource/Episcopal_Church_in_the_United_States_of_America</yago>
    </concept>
    <concept>
      <text>New York</text>
      <relevance>0.592008</relevance>
      <geo>43.0 -75.0</geo>
      <website>http://www.ny.gov</website>
      <dbpedia>http://dbpedia.org/resource/New_York</dbpedia>
      <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f800000000054dd5d</freebase>
      <opencyc>http://sw.opencyc.org/concept/Mx4rvViNs5wpEbGdrcN5Y29ycA</opencyc>
      <census>http://www.rdfabout.com/rdf/usgov/geo/us/ny</census>
      <yago>http://mpii.de/yago/resource/New_York</yago>
    </concept>
  </concepts>
</results>

As you can see, "New York" shows up but it has less than 60% relevance, so maybe that's a threshold to consider when indexing automated subject terms with Alchemyapi. That's just my theory and only lots of testing will help determine what the threshold really is – if there's one at all.

As you can also see, there's a lot of potential for linked data with this output: to data from relevant dbpedia pages, etc. One neat thing would be to make it so that if the user hovers over a facet, that the UI pops-up more information from these linked data sources like relevant websites, mapped geo-coords using the Google Maps API, definitions of the faceted term, and similar concept visualizations, etc.

That's all. Sleepy time and B-ball starts soon …

--------------

Related Content:

Written by nitin

March 25th, 2012 at 4:57 pm

audio transcription and the undead

leave a comment

Let's forget the fact I've blogged more this month than I intend to in a whole year …

What I really want to mention is that I'm reading Dracula by Bram Stoker and noticed these very interesting bits (or should I say 'bites'?) in Chapter 17.

In this chapter the character of Mina Harker is becoming acquainted with a friend of her now dead friend, Lucy. This friend, Dr. Seward, uses a phonograph to record his patient notes, much as my dad used to use a micro-cassette back in the late 1970's and 1980's. Mina, on the other hand, uses her cutting edge writing tool, the typewriter, to make her diary entries easily readable.

The funny thing is that Seward confesses to Mina that he doesn't have a way to get to specific points within each recording, i.e. he doesn't have a way to denote and retrieve audio at a specific time with advanced knowledge of what passages exist at those points. Um, sound familiar?

:P

MINA HARKER'S JOURNAL

29 September.

Again he paused, and I could see that he was trying to invent an excuse. At length, he stammered out, "You see, I do not know how to pick out any particular part of the diary."

I could not but smile, at which he grimaced. "I gave myself away that time!" he said. "But do you know that, although I have kept the diary for months past, it never once struck me how I was going to find any particular part of it in case I wanted to look it up?"

Mina goes on to transcribe his recordings so that the text can be compared with other diary entries by principal characters as they try to formulate the totality of Dracula's agenda.

DR. SEWARD'S DIARY

30 September.

Harker has gone back, and is again collecting material. He says that by dinner time they will be able to show a whole connected narrative. He thinks that in the meantime I should see Renfield, as hitherto he has been a sort of index to the coming and going of the Count. I hardly see this yet, but when I get at the dates I suppose I shall. What a good thing that Mrs. Harker put my cylinders into type! We never could have found the dates otherwise.

Update (or "later" as in the novel): It might be a nice homage to sync the transcript to the audio of Orson Welles' radio play based on the book.

    Written by nitin

    January 31st, 2012 at 11:22 pm

    geo this, geo that: easy acquisition of KML files with BatchGeo

    leave a comment

    Geolocation/geocoding is so "hip" these days. Everyone's so obsessed where where they and other things are. There's almost a comparison with 3-D filmmaking …

    Funny. Not too many folks seem all that concerned with when things are.

    Anyway …

    At work, we have a database with all the libraries we serve and their addresses. And the other week we needed to quickly make a map with all their locations.

    If necessity is the mother of invention, laziness is it's favorite uncle.

    Enter BatchGeo. We were able to take those values from our database and get a map generated in minutes. But it gets better.

    One of the nice things about this process is that in addition to a map, you also get a KML file download option. Taking this little XML file, it's a simple process (via XSL or other) to make a delimited file containing the inputted names of institutions and their latitude and longitude (altitude is also available).

    From there, it's not brain surgery to get those coordinates into a database and using an SQL JOIN to be able to push out an institution's name and now its coordinates, too, whenever.

    Just in case someone wants/needs to do something similar with an address book or a list of businesses, etc.

    --------------

    Related Content:

    Written by nitin

    January 28th, 2012 at 9:52 am

    Posted in technophilia,XML

    Tagged with , , , ,

    installing lxml on my Amazon Linux instance

    2 comments

    Last night I installed lxml on my Amazon Linux AMI (ami-31814f58) and it was just as not-straightforward as when my co-worker and I put it on our CentOS server a few weeks ago.

    So by referring to the yum log, I think the following covers what I needed to install with yum:

    gcc-4.4.5-6.35.amzn1.i686 #for building lxml
    python26-devel-2.6.7-1.36.amzn1.i686
    libxslt-1.1.26-2.6.amzn1.i686 #this can't be necessary given the line below, right?
    libxslt-devel-1.1.26-2.6.amzn1.i686
    libxml2-devel-2.7.6-1.9.amzn1.i686

    … and then I used easy_install (which already was on the system 'far as I know) to install lxml a la: easy_install lxml.

    For whatever reason, the easy_install part took several, several minutes. But I was watching "Season of the Witch" so I didn't mind.

    --------------

    Related Content:

    Written by nitin

    January 16th, 2012 at 10:39 am

    Posted in technophilia

    Tagged with , , , ,

    Switch to our mobile site