blog.humaneguitarist.org

Full Metal Alchemyapi.com or "more term extraction crap and linky data crud"

[Sun, 25 Mar 2012 20:57:34 +0000]
As I mentioned before, I'm playing with the idea of using term generating APIs to build facets in a Solr index project that I'm working on with some people. The results seem really promising. If I wasn't in need of a nap before some more college basketball gets underway, I'd say more than I'm about to. Instead, I'm going to do three quick things here: 1. Provide a screenshot of the index UI using Calais [http://www.opencalais.com/] "social tags" for facets. 1. This is a local (my computer) copy of the index using a very small set of item metadata. That's to say we currently have about 37k items in the index, and I'm just using about 1k. 2. I'm only using Calais tags if the "importance" attribute is equal to "1", so I'm leaving out tags Calais considers less relevant because, well, some of the terms generated were making me think "WTF?". 3. Some of the terms with underscores like "War_Conflict" appear to be those used in the news industry and are potentially ones to throw out. 2. Post a small Python script to make a call to Alchemyapi.com [http://www.alchemyapi.com/], which is similar - and possible better - than Calais. 3. Post the Alchemyapi.com results XML document and talk a little about what I think it can be used for in our project. So, here's the Calais screenshot (you'll need to view the image [http://blog.humaneguitarist.org/uploads/calaisfication.png] at full-resolution to read it): IMAGE: "Calais Facets"[http://blog.humaneguitarist.org/uploads/calaisfication.png] Here's the Python script to call the Alchemyapi.com API: import urllib, urllib2 #set API url and API key url = 'http://access.alchemyapi.com/calls/text/TextGetRankedConcepts' apikey = '' #your API key goes here #get Alchemy API key from: http://www.alchemyapi.com/api/register.html #set some text for the API text = ''' Episcopal churches Churches Cemeteries Tombs and sepulchral monuments Postcards--North Carolina. Flat Rock (N.C.) Henderson County (N.C.) ''' #send data to API params = urllib.urlencode({ 'apikey': apikey, 'text': text, 'showSourceText': '1', #shows the original text sent to the API }) alchemyThis = urllib2.urlopen(url, params).read() #view results print alchemyThis And here's the output for the code above: <?xml version="1.0" encoding="UTF-8"?> <results> <status>OK</status> <usage>By accessing AlchemyAPI or using information generated by AlchemyAPI, you are agreeing to be bound by the AlchemyAPI Terms of Use: http://www.alchemyapi.com/company/terms.html</usage> <url/> <language>english</language> <text>Episcopal churches Churches Cemeteries Tombs and sepulchral monuments Postcards--North Carolina. Flat Rock (N.C.) Henderson County (N.C.)</text> <concepts> <concept> <text>North Carolina</text> <relevance>0.920839</relevance> <website>http://www.nc.gov</website> <dbpedia>http://dbpedia.org/resource/North_Carolina</dbpedia> <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f800000000002b62d</freebase> <opencyc>http://sw.opencyc.org/concept/Mx4rvViyspwpEbGdrcN5Y29ycA</opencyc> <yago>http://mpii.de/yago/resource/North_Carolina</yago> <geonames>http://sws.geonames.org/4482348/</geonames> </concept> <concept> <text>Tomb</text> <relevance>0.837256</relevance> <geo>29.855 31.219</geo> <dbpedia>http://dbpedia.org/resource/Tomb</dbpedia> <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f800000000007ff03</freebase> <opencyc>http://sw.opencyc.org/concept/Mx4rwQw2p5wpEbGdrcN5Y29ycA</opencyc> </concept> <concept> <text>Burial monuments and structures</text> <relevance>0.773605</relevance> <dbpedia>http://dbpedia.org/resource/Burial_monuments_and_structures</dbpedia> </concept> <concept> <text>Flat Rock, Henderson County, North Carolina</text> <relevance>0.718415</relevance> <geo>35.266666666666666 -82.45333333333333</geo> <website>http://villageofflatrock.org/</website> <dbpedia>http://dbpedia.org/resource/Flat_Rock,_Henderson_County,_North_Carolina</dbpedia> <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f80000000000ebc28</freebase> <yago>http://mpii.de/yago/resource/Flat_Rock,_Henderson_County,_North_Carolina</yago> </concept> <concept> <text>Henderson County, North Carolina</text> <relevance>0.615825</relevance> <geo>35.34 -82.48</geo> <website>http://www.hendersoncountync.org</website> <dbpedia>http://dbpedia.org/resource/Henderson_County,_North_Carolina</dbpedia> <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f80000000000a10b4</freebase> <yago>http://mpii.de/yago/resource/Henderson_County,_North_Carolina</yago> </concept> <concept> <text>Asheville, North Carolina</text> <relevance>0.610351</relevance> <website>http://www.ashevillenc.gov/</website> <dbpedia>http://dbpedia.org/resource/Asheville,_North_Carolina</dbpedia> <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f80000000000eb2ac</freebase> <census>http://www.rdfabout.com/rdf/usgov/geo/us/nc/counties/buncombe_county/asheville</census> <yago>http://mpii.de/yago/resource/Asheville,_North_Carolina</yago> <geonames>http://sws.geonames.org/4453066/</geonames> </concept> <concept> <text>Episcopal Church in the United States of America</text> <relevance>0.610029</relevance> <dbpedia>http://dbpedia.org/resource/Episcopal_Church_in_the_United_States_of_America</dbpedia> <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f8000000000015f1b</freebase> <yago>http://mpii.de/yago/resource/Episcopal_Church_in_the_United_States_of_America</yago> </concept> <concept> <text>New York</text> <relevance>0.592008</relevance> <geo>43.0 -75.0</geo> <website>http://www.ny.gov</website> <dbpedia>http://dbpedia.org/resource/New_York</dbpedia> <freebase>http://rdf.freebase.com/ns/guid.9202a8c04000641f800000000054dd5d</freebase> <opencyc>http://sw.opencyc.org/concept/Mx4rvViNs5wpEbGdrcN5Y29ycA</opencyc> <census>http://www.rdfabout.com/rdf/usgov/geo/us/ny</census> <yago>http://mpii.de/yago/resource/New_York</yago> </concept> </concepts> </results> As you can see, "New York" shows up but it has less than 60% relevance, so maybe that's a threshold to consider when indexing automated subject terms with Alchemyapi. That's just my theory and only lots of testing will help determine what the threshold really is - if there's one at all. As you can also see, there's a lot of potential for linked data with this output: to data from relevant dbpedia [http://dbpedia.org/About]pages, etc. One neat thing would be to make it so that if the user hovers over a facet, that the UI pops-up more information from these linked data sources like relevant websites, mapped geo-coords using the Google Maps API, definitions of the faceted term, and similar concept visualizations, etc. That's all. Sleepy time and B-ball starts soon ...