blog.humaneguitarist.org

discoveries in digital audio, music notation, and information encoding

Archive for the ‘Semantic Web’ tag

easy calls to OpenCalais with Python, daggummit!

2 comments

Yesterday, I wrote this post about using Yahoo's deprecated term extraction web service to generate "subjects" – or whatever you want to call them – for an item based on the metadata housed in a Solr-compatible XML file. I'd also wondered about doing the same thing with OpenCalais.

Before we go any further, I'd just like to say I wrote that post from my hotel room. I'm writing today's from the Denver airport with about 2 hours to kill before my flight departs. And I'd also like to point out that when writing blog posts with spotty Wi-Fi connections, one should not compose their post online through WordPress. I'm using WordPad, and I should probably make that a habit.

Yeah, so anyway there's not that much good documentation on how to make calls on the Calais site. By "good" I mean there's no code sample to rip off. I'm sure it's perfectly fine for people who actually know what they're doing.

Using "The Google" I found this helpful post on making calls to OpenCalais. While I found it very well written and the code very helpful, I didn't want to have "httplib2" as a dependency since it's not available out-of-the-box with Python 2.7, as far as I know. Nor did I want to do anything with JSON. I'm just trying to make a simple POST request to the OpenCalais REST API – is all.

Using that post's code as a starting point, I whipped up some simple Python without "httplib2".

Note that this code passes three parameters to the API through the following variables:

  • "myCalaisAPI_key": this is where to paste your API key once you get it from Calais here.
  • "sampleText": this is a string of plain text to send to Calais for it to analyze and build terms for.
  • "calaisParams": these are the options to pass to the service in XML format. 

Note that I'm specifically requesting what I really want, "social tags", via the following option:

c:enableMetadataType="GenericRelations,SocialTags"

… and I'm specifically requesting a simple result format as follows:

c:outputFormat="Text/Simple"

There are other options, including RDF, that can be requested per the options mentioned on this page.

If you look at the code, you can see I'm asking Calais to analyze some text about Tim Tebow since I was in Denver when the Denver Broncos football team acquired Peyton Manning and traded Tebow to the New York Jets. The text is from a USA Today article from, um, yesterday.

The Jets, I'd like to state, are not worthy of a hyperlink. And that's only part of the reason I'm sad to see Tebow go there. Alas.

Anway, here's the output below, followed by the code. Note that – as mentioned in the code – I'm using the slightly older REST API. But what do I care right now. I'm just testing.

Here's the output:

<!--Use of the Calais Web Service is governed by the Terms of Service located at http://www.opencalais.com. By using this service or the results of the service you agree to these terms of service.-->
<!--
Company: HBO,
Organization: New York Jets,
Person: Tim Tebow,
TVShow: Hard Knocks,
-->
<OpenCalaisSimple>
  <Description>
    <calaisRequestID>dafa6c80-b4f6-77b1-1363-de96bb7764f4</calaisRequestID>
    <id>http://id.opencalais.com/ODNr1ciDte8wwv0nU3G1jw</id>
    <about>http://d.opencalais.com/dochash-1/895ba8ff-4c32-3ae1-9615-9a9a9a1bcb39</about>
    <docTitle/>
    <docDate>2012-03-23 00:56:09.679</docDate>
    <externalMetadata/>
  </Description>
  <CalaisSimpleOutputFormat>
    <Company count="1" relevance="0.643" normalized="HBO &amp; Company">HBO</Company>
    <Organization count="1" relevance="0.643">New York Jets</Organization>
    <Person count="1" relevance="0.643">Tim Tebow</Person>
    <TVShow count="1" relevance="0.643">Hard Knocks</TVShow>
    <SocialTags>
      <SocialTag importance="2">Training camp<originalValue>Training camp (National Football League)</originalValue>
      </SocialTag>
      <SocialTag importance="2">New York Jets<originalValue>New York Jets</originalValue>
      </SocialTag>
      <SocialTag importance="2">Florida Gators football team<originalValue>2008 Florida Gators football team</originalValue>
      </SocialTag>
      <SocialTag importance="1">Tim Tebow<originalValue>Tim Tebow</originalValue>
      </SocialTag>
      <SocialTag importance="1">HBO<originalValue>HBO</originalValue>
      </SocialTag>
      <SocialTag importance="1">Hard Knocks<originalValue>Hard Knocks (TV series)</originalValue>
      </SocialTag>
      <SocialTag importance="1">Entertainment_Culture</SocialTag>
      <SocialTag importance="1">Sports</SocialTag>
    </SocialTags>
    <Topics>
      <Topic Taxonomy="Calais" Score="1.000">Entertainment_Culture</Topic>
      <Topic Taxonomy="Calais" Score="1.000">Sports</Topic>
    </Topics>
  </CalaisSimpleOutputFormat>
</OpenCalaisSimple>

And the code:

# this code is based on: http://www.flagonwiththedragon.com/2011/06/08/dead-simple-python-calls-to-open-calais-api/

import urllib, urllib2

#########################
##### set API key and REST URL values.

myCalaisAPI_key = '' # your Calais API key.
calaisREST_URL = 'http://api.opencalais.com/enlighten/rest/' # this is the older REST interface.
# info on the newer one: http://www.opencalais.com/documentation/calais-web-service-api/api-invocation/rest

# alert user and shut down if the API key variable is still null.
if myCalaisAPI_key == '':
  print "You need to set your Calais API key in the 'myCalaisAPI_key' variable."
  import sys
  sys.exit()

#########################
##### set the text to ask Calais to analyze.

# text from: http://www.usatoday.com/sports/football/nfl/story/2012-03-22/Tim-Tebow-Jets-hoping-to-avoid-controversy/53717542/1
sampleText = '''
Like millions of football fans, Tim Tebow caught a few training camp glimpses of the New York Jets during the summer of 2010 on HBO's Hard Knocks.
'''

#########################
##### set XML parameters for Calais.

# see "Input Parameters" at: http://www.opencalais.com/documentation/calais-web-service-api/forming-api-calls/input-parameters
calaisParams = '''
<c:params xmlns:c="http://s.opencalais.com/1/pred/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <c:processingDirectives c:contentType="text/txt"
      c:enableMetadataType="GenericRelations,SocialTags"
      c:outputFormat="Text/Simple"/>
  <c:userDirectives/>
  <c:externalMetadata/>
</c:params>
'''

#########################
##### send data to Calais API.

# see: http://www.opencalais.com/APICalls
dataToSend = urllib.urlencode({
    'licenseID': myCalaisAPI_key,
    'content': sampleText,
    'paramsXML': calaisParams
})

#########################
##### get API results and print them.

results = urllib2.urlopen(calaisREST_URL, dataToSend).read()
print results
--------------

Related Content:

Written by nitin

March 23rd, 2012 at 1:28 pm

make you some facets, boy!

leave a comment

As I mentioned the other day in this post, I've been working with some awesome people to harvest, index, and make searchable metadata for digital library collections from multiple institutions across the state of North Carolina, USA.

In the post I just linked to, I talked about the problems of inconsistent metadata across institutions and how that negatively impacts browsing via facets with Solr. I also wondered out loud about resolving/aligning small discrepancies via text analysis.

Well, another way to tackle this problem is – after harvesting the metadata but before indexing it – to "make" facet-able terms via some sort of term extraction. While at DrupalCon 2012 in Denver, CO this week I went to a presentation where the presenter mentioned a project he'd worked on pulling in RSS feeds. In passing, he mentioned using OpenCalais to make a tag cloud. I totally forgot I had an API key for OpenCalais!

Anyway, now I see there are lots of similar web services. Which one is best in terms of term extraction and which one allows the most API hits per day is a matter for another day, but today – in my hotel now that the conference has ended – I thought I'd do a little scripting to get me on the path to really testing this.

Using the soon-to-be deprecated Yahoo Term Extraction Web Service I tested taking a sample Solr-compatible XML index file and sending the metadata in it to the service to retrieve new subject terms. While my test script doesn't do it here, the idea is that after retrieving from the API these new terms, the terms could be placed into the Solr-compatible index file. After indexing the updated file, these new terms could be exposed to the user as click-able facets.

I'll have to test this with lots of real-world metadata from across our test-set of metadata to see if the term extraction service can be used to produce nicer facets with disparate metadata than what we currently see, but for now I just needed to write a play/test script.

Below, I've pasted the Python script and the the output which explains a little what it's doing.

Actually, I've pasted the output first since people might not need or want to see the code. At the end, I've posted the "social tags" that OpenCalais would seem to generate for the same metadata – for comparison purposes.

The output:

Here's an XML file that can indexed by Solr (it was generated via harvesting data from the Library of Congress using Python and XSL).

<add>
  <doc>
    <field name="identifier">http://hdl.loc.gov/loc.mbrsmi/amrlv.4007</field>
    <field name="title">[Theater commercial--electric refrigerators]. Buy an electric refrigerator /</field>
    <field name="creator">AFI/Kalinowski (Eugene) Collection (Library of Congress)</field>
    <field name="subject">Refrigerators.</field>
    <field name="subject">Advertising--Electric household appliances--Pennsylvania--Pittsburgh.</field>
    <field name="subject">Trade shows--Pennsylvania--Pittsburgh.</field>
    <field name="subject">Silent films.</field>
    <field name="subject">Pittsburgh (Pa.)--Manufactures.</field>
    <field name="description">Largely graphic commercial for electric refrigerators in general and a refrigerator show, presumably in Pittsburgh, in particular.</field>
  </doc>
 </add>

-----

After using the Yahoo term extraction service we can create more <field> elements.

<field name="yahooTerm">electric household appliances</field>
<field name="yahooTerm">electric refrigerators</field>
<field name="yahooTerm">electric refrigerator</field>
<field name="yahooTerm">library of congress</field>
<field name="yahooTerm">silent films</field>
<field name="yahooTerm">collection library</field>
<field name="yahooTerm">pittsburgh pa</field>
<field name="yahooTerm">pennsylvania</field>

-----

If we place those new terms into the original XML file and reindex the item, we'll have new facets to play with.

This is a *potential* solution for creating practical, useable, and consistent(?) facets for metadata harvested from different institutions that use different subject terms and internal taxonomies, etc.

I think the basic Yahoo term extractor is deprecated(?), but there are other options such as their newer Context Analysis API, OpenCalais, and AlchemyAPI.com, etc.

The script:

#####
## merge all <fields> into one string; place in "context" variable.
SolrXML = '''
<add>
  <doc>
    <field name="identifier">http://hdl.loc.gov/loc.mbrsmi/amrlv.4007</field>
    <field name="title">[Theater commercial--electric refrigerators]. Buy an electric refrigerator /</field>
    <field name="creator">AFI/Kalinowski (Eugene) Collection (Library of Congress)</field>
    <field name="subject">Refrigerators.</field>
    <field name="subject">Advertising--Electric household appliances--Pennsylvania--Pittsburgh.</field>
    <field name="subject">Trade shows--Pennsylvania--Pittsburgh.</field>
    <field name="subject">Silent films.</field>
    <field name="subject">Pittsburgh (Pa.)--Manufactures.</field>
    <field name="description">Largely graphic commercial for electric refrigerators in general and a refrigerator show, presumably in Pittsburgh, in particular.</field>
  </doc>
 </add>
'''

from lxml import etree # see: http://lxml.de/ for this library.

SolrXML_parsed = etree.XML(SolrXML)
SolrXML_combined = SolrXML_parsed.findall(".//field")
SolrXML_combined.pop(0) #remove <field name="indentifier"> since we don't want
                        #a term generated from the URL; ideally this should be
                        #removed by having an attribute of "identifier" rather
                        #than by position, but this is just a test.

SolrXML_combinedList = []
for element in SolrXML_combined:
  SolrXML_combinedList.append(element.text)
context = (" ".join(SolrXML_combinedList))
#print context #test line


#####
## send XML example to Yahoo termExtraction service; print generated terms
## reference example: http://developer.yahoo.com/python/python-rest.html#post
import urllib, urllib2

url = 'http://search.yahooapis.com/ContentAnalysisService/V1/termExtraction'
appid = 'YahooTermTest'

params = urllib.urlencode({
  'appid': appid,
  'context': context,
})

yahooResultsXML = urllib2.urlopen(url, params).read()
#print yahooResultsXML #test line

yahooResultsXML_parsed = etree.XML(yahooResultsXML)
newSolrTerms = ""
for yahooTerm in yahooResultsXML_parsed:
  newSolrTerms = newSolrTerms + "<field name=\"yahooTerm\">" + yahooTerm.text \
  + "</field>\n"
 
#####
## print what the script is trying to do and the results ...
print "Here's an XML file that can indexed by Solr\
 (it was generated via harvesting data from the Library of Congress and XSL)."
 
print SolrXML

print "-"*5 + "\n"

print "After using the Yahoo term extraction service we can create more\
 <field> elements.\n"
 
print newSolrTerms

print "-"*5 + "\n"

print "If we place those new terms into the original XML file and reindex the\
 item, we'll have new facets to play with.\n"

print "This is a *potential* solution for creating practical, useable, and\
 consistent(?) facets for metadata harvested from different institutions that use\
 different subject terms and internal taxonomies, etc.\n"

print "I think the basic Yahoo term extractor is deprecated(?), but there are\
 other options such as their newer Context Analysis API, OpenCalais, and\
 AlchemyAPI.com, etc."

And here's what OpenCalais extracted as "social tags":

  • Business Finance
  • Entertainment Culture
  • Food storage
  • Food preservation
  • Home appliances
  • Pittsburgh
  • Refrigerator
--------------

Related Content:

Written by nitin

March 22nd, 2012 at 7:58 pm

Switch to our mobile site