make you some facets, boy!

As I mentioned the other day in this [http://blog.humaneguitarist.org/2012/03/15/facet-mashing-a-tragedy-in-0-987-acts/] post, I've been working with some awesome people to harvest, index, and make searchable metadata for digital library collections from multiple institutions across the state of North Carolina, USA. In the post I just linked to, I talked about the problems of inconsistent metadata across institutions and how that negatively impacts browsing via facets with Solr. I also wondered out loud about resolving/aligning small discrepancies via text analysis. Well, another way to tackle this problem is - after harvesting the metadata but before indexing it - to "make" facet-able terms via some sort of term extraction. While at DrupalCon [http://denver2012.drupal.org/] 2012 in Denver, CO this week I went to a presentation where the presenter mentioned a project he'd worked on pulling in RSS feeds. In passing, he mentioned using OpenCalais [http://www.opencalais.com/] to make a tag cloud. I totally forgot I had an API key for OpenCalais! Anyway, now I see there are lots of similar web services. Which one is best in terms of term extraction and which one allows the most API hits per day is a matter for another day, but today - in my hotel now that the conference has ended - I thought I'd do a little scripting to get me on the path to really testing this. Using the soon-to-be deprecated Yahoo Term Extraction Web Service [http://developer.yahoo.com/search/content/V1/termExtraction.html] I tested taking a sample Solr-compatible XML index file and sending the metadata in it to the service to retrieve new subject terms. While my test script doesn't do it here, the idea is that after retrieving from the API these new terms, the terms could be placed into the Solr-compatible index file. After indexing the updated file, these new terms could be exposed to the user as click-able facets. I'll have to test this with lots of real-world metadata from across our test-set of metadata to see if the term extraction service can be used to produce nicer facets with disparate metadata than what we currently see, but for now I just needed to write a play/test script. Below, I've pasted the Python script and the the output which explains a little what it's doing. Actually, I've pasted the output first since people might not need or want to see the code. At the end, I've posted the "social tags" that OpenCalais would seem to generate for the same metadata - for comparison purposes. The output: Here's an XML file that can indexed by Solr (it was generated via harvesting data from the Library of Congress using Python and XSL). <add> <doc> <field name="identifier">http://hdl.loc.gov/loc.mbrsmi/amrlv.4007</field> <field name="title">[Theater commercial--electric refrigerators]. Buy an electric refrigerator /</field> <field name="creator">AFI/Kalinowski (Eugene) Collection (Library of Congress)</field> <field name="subject">Refrigerators.</field> <field name="subject">Advertising--Electric household appliances--Pennsylvania--Pittsburgh.</field> <field name="subject">Trade shows--Pennsylvania--Pittsburgh.</field> <field name="subject">Silent films.</field> <field name="subject">Pittsburgh (Pa.)--Manufactures.</field> <field name="description">Largely graphic commercial for electric refrigerators in general and a refrigerator show, presumably in Pittsburgh, in particular.</field> </doc> </add> ----- After using the Yahoo term extraction service we can create more <field> elements. <field name="yahooTerm">electric household appliances</field> <field name="yahooTerm">electric refrigerators</field> <field name="yahooTerm">electric refrigerator</field> <field name="yahooTerm">library of congress</field> <field name="yahooTerm">silent films</field> <field name="yahooTerm">collection library</field> <field name="yahooTerm">pittsburgh pa</field> <field name="yahooTerm">pennsylvania</field> ----- If we place those new terms into the original XML file and reindex the item, we'll have new facets to play with. This is a *potential* solution for creating practical, useable, and consistent(?) facets for metadata harvested from different institutions that use different subject terms and internal taxonomies, etc. I think the basic Yahoo term extractor is deprecated(?), but there are other options such as their newer Context Analysis API, OpenCalais, and AlchemyAPI.com, etc. The script: ##### ## merge all <fields> into one string; place in "context" variable. SolrXML = ''' <add> <doc> <field name="identifier">http://hdl.loc.gov/loc.mbrsmi/amrlv.4007</field> <field name="title">[Theater commercial--electric refrigerators]. Buy an electric refrigerator /</field> <field name="creator">AFI/Kalinowski (Eugene) Collection (Library of Congress)</field> <field name="subject">Refrigerators.</field> <field name="subject">Advertising--Electric household appliances--Pennsylvania--Pittsburgh.</field> <field name="subject">Trade shows--Pennsylvania--Pittsburgh.</field> <field name="subject">Silent films.</field> <field name="subject">Pittsburgh (Pa.)--Manufactures.</field> <field name="description">Largely graphic commercial for electric refrigerators in general and a refrigerator show, presumably in Pittsburgh, in particular.</field> </doc> </add> ''' from lxml import etree # see: http://lxml.de/ for this library. SolrXML_parsed = etree.XML(SolrXML) SolrXML_combined = SolrXML_parsed.findall(".//field") SolrXML_combined.pop(0) #remove <field name="indentifier"> since we don't want #a term generated from the URL; ideally this should be #removed by having an attribute of "identifier" rather #than by position, but this is just a test. SolrXML_combinedList = [] for element in SolrXML_combined: SolrXML_combinedList.append(element.text) context = (" ".join(SolrXML_combinedList)) #print context #test line ##### ## send XML example to Yahoo termExtraction service; print generated terms ## reference example: http://developer.yahoo.com/python/python-rest.html#post import urllib, urllib2 url = 'http://search.yahooapis.com/ContentAnalysisService/V1/termExtraction' appid = 'YahooTermTest' params = urllib.urlencode({ 'appid': appid, 'context': context, }) yahooResultsXML = urllib2.urlopen(url, params).read() #print yahooResultsXML #test line yahooResultsXML_parsed = etree.XML(yahooResultsXML) newSolrTerms = "" for yahooTerm in yahooResultsXML_parsed: newSolrTerms = newSolrTerms + "<field name=\"yahooTerm\">" + yahooTerm.text \ + "</field>\n" ##### ## print what the script is trying to do and the results ... print "Here's an XML file that can indexed by Solr\ (it was generated via harvesting data from the Library of Congress and XSL)." print SolrXML print "-"*5 + "\n" print "After using the Yahoo term extraction service we can create more\ <field> elements.\n" print newSolrTerms print "-"*5 + "\n" print "If we place those new terms into the original XML file and reindex the\ item, we'll have new facets to play with.\n" print "This is a *potential* solution for creating practical, useable, and\ consistent(?) facets for metadata harvested from different institutions that use\ different subject terms and internal taxonomies, etc.\n" print "I think the basic Yahoo term extractor is deprecated(?), but there are\ other options such as their newer Context Analysis API, OpenCalais, and\ AlchemyAPI.com, etc." And here's what OpenCalais extracted as "social tags": * Business Finance * Entertainment Culture * Food storage * Food preservation * Home appliances * Pittsburgh * Refrigerator