blog.humaneguitarist.org

discoveries in digital audio, music notation, and information encoding

Archive for the ‘Solr’ tag

trying to easily format Solr results as HTML with Python

leave a comment

Just a quick Saturday morning post …

One of the nice things about Solr is the ability to pass parameters that will return the results in various formats, including a Python dictionary.

I wanted to see if I could whip up a little function that would let me pass to it both the name of a Solr element (like "title") and then the HTML element I want it mapped to.

It doesn't seem that bad, and is a good reminder that building UIs is in large part about parsing data into HTML, upon which things like CSS and JavaScript can enter and act re: display and interface.

Anyway, so here's an example of some code that gets five results from a Solr instance and then uses the function I wrote to output some HTML elements:

import urllib2 as urllib

### first, get 5 Solr results formatted as a Python dictionary
query_url = 'http://data.twigkit.com/solr-gutenberg/select/?q=poe&version=2.2&start=0&rows=5&wt=python&explainOther'
solr = urllib.urlopen(query_url).read() #read the results
print type(solr) #returns that it's a string :-[
solr = eval(solr) #turns the string into a dictionary. yay.
print type(solr) #returns that it's now a dictionary!

### second, write a function that converts stuff to HTML
def pysolr2html(tagIn, tagOut):
    tagVal = solr['response']['docs'][i][tagIn]
    htmlVars = (tagOut, tagIn, tagVal, tagOut)
    return '<%s class="solr_%s">%s</%s>' %htmlVars

### third, iterate over the response
i = 0
for doc in solr['response']['docs']:
    print pysolr2html('id','span')
    print pysolr2html('title','p')
    i = i + 1

And here's the output from IDLE:

>>>
<type 'str'>
<type 'dict'>
<span class="solr_id">etext8893</span>
<p class="solr_title">Selections from Poe</p>
<span class="solr_id">etext9511</span>
<p class="solr_title">Several Works by Edgar Allan Poe</p>
<span class="solr_id">etext9512</span>
<p class="solr_title">The Works of Edgar Allan Poe, Volume 1</p>
<span class="solr_id">etext9516</span>
<p class="solr_title">The Works of Edgar Allan Poe, Volume 5</p>
<span class="solr_id">etext9513</span>
<p class="solr_title">The Works of Edgar Allan Poe, Volume 2</p>

I'll probably do this with PHP in the end and see how easy it might be to make a small Solr wrapper, kind of like Tempo which is super light-weight. But for now, I need to remind myself it's the weekend.

:P

--------------

Related Content:

Written by nitin

January 21st, 2012 at 12:03 pm

indexing and searching timed text with Solr

leave a comment

I'm still learning about Solr so maybe this post is much ado about nothing. But according to this nabble.com thread, one can't index a source XML document in Solr with it's native XML structure intact and then in turn search that structure as one can in an XML database like BaseX.

For most things, that's fine. I mean for indexing titles, creators, and descriptions, etc. I just need to index the value of a given element like <title> so that I can search for that element's value.

But for timed text, it's different. Or at least, it can be.

Say I have this DFXP snippet for an audio file with an "id" value of "XYZ".

<p begin="10.0s" end="30.0s">Hello world!</p>

I would need the user to search for the string "Hello world!" or part of it but I would also need to index at least the value of the "begin" attribute so that I can pass that to a page that will play the file "XYZ" starting at the 10 second mark – if the user clicks on the "Hello world!" line in their search result. And I don't want the "10" second value to be something they search against since they might be searching for the string "10" within the text itself.

So I'm wondering how to do that with Solr.

Maybe when I learn more I'll discover a better way to do this, but for now I'm thinking I could do the following:

First, I would pretty much index the timed text twice in Solr.

<doc>
  <field name="id">XYZ</field>
...
  <field name="timedText-stripped">Hello world!</field>
  <field name="timedText">Hello World! {10}</field>
</doc>

After indexing the "id" of the audio file this would index:

  • just the text "Hello world!"
  • the text of "Hello world!" with the "begin" attribute value in curly quotes.

I guess this way the user could be made to search across the "timedText-stripped" field but, via the XSL that can be passed to Solr to display results, the "timedText" field could be displayed in a manner that would make the text "Hello World!" linked to whatever file will play file "XYZ" starting at the 10 second mark. Basically, by planting the "begin" value in curly quotes, I can parse the string for the text and the "begin" value as separate things.

So, here's a really crappy XSL snippet that would do something like that. It assumes a variable "$id" exists that equals "XYZ", the identifier for the example audio file.

<xsl:for-each select="//field[@name='timedText']">
  <xsl:variable name="whole">
    <xsl:value-of select="."/>
    <!-- Gets entire element string -->
  </xsl:variable>
  <xsl:variable name="text">
    <xsl:value-of select="substring-before($whole,'{')"/>
    <!-- Gets text prior to seconds -->
  </xsl:variable>
  <xsl:variable name="begin">
    <xsl:value-of select="substring-before(substring-after($whole,'{'),'}')"/>
    <!-- Gets seconds value from end of string -->
  </xsl:variable>
  <a href="someMediaPlayer.php?id={$id)&amp;begin={$begin}">
    <xsl:value-of select="$text"/>
  </a>
  <!-- So, I'm saying that
  "someMediaPlayer.php?id=XYZ&start=10"
  would launch a player that would start file XYZ at the 10 seconds mark.
  -->
</xsl:for-each>

The search output would be some HTML code like so:

<a href="someMediaPlayer.php?id=XYZ&amp;begin=10>Hello World!</a>

It seems weird to index something twice, more or less, but as user Erick says in the nabble.com thread, "You've gotta take off your DB hat and not worry about duplicating data."

But now as I write this, I'm wondering if I can't just index as follows:

  <field name="text">Hello world!</field>
  <field name="begin">10</field>

and trust that for each "text" field, there will be a matching "begin" field and that they can't just be used in tandem to create the same HTML link as above. Sounds like I need to play around some more.

:)

Update, September 6, 2012: I wrote a related post to this yesterday in terms of searching across timed text with MySQL and in doing so I realized that the way I was thinking of doing it in Solr was off. Rather than doing it the way I outlined in the original post content (above) in which I was thinking to index all the timed text for a given recording in one Solr "doc" element, I think it makes much more sense to index each line in its own "doc" element as such:

<doc>
  <field name="id">someMediaPlayer.php?source=someFile.mp3&amp;begin=10&amp;end=30</field>
  ...
  <field name="startTime">10</field>
  <field name="stopTime">30</field> 
  <field name="timedText">Hello world!</field>
  <field name="source">someFile.mp3</field> 
</doc>

That way there's no need to post-parse any data fields to get the start and stop time. And, moreover, rather than construct the URL to launch that segment of audio you can just put the URL directly in the "id" field. You can always use Solr built-in support for facets to facet off of the "source" field or some descriptive metadata like "title".

I'll file the original post under the "thinking out loud yet poorly" category.

--------------

Related Content:

Written by nitin

October 16th, 2011 at 10:54 am

pOAIndexter: grabbing and indexing online metadata

leave a comment

As per usual, a good bit of my computer-y stuff at home relates to something that's come up at work. And as usual, I'm pretty ignorant of what I'm getting myself into, but I don't mind.

The other week, my boss and I met with some great people at digitalnc.org and we started talking about the idea of having a super simple, lightweight approach to providing a one-stop-shop search interface for collections across the state – provided those collections expose their metadata somehow. For now, we talked about limiting this to people who do so with an OAI feed and grabbing that metadata. But eventually, this thing should be metadata agnostic – in the sense that it isn't about a metadata format, but just the data itself.

By the way, I guess "grabbing" and "feed" aren't what I typically see with OAI – about which I admittedly don't know much – but I don't care. Same difference.

Of course, there's nothing new to this. I guess one could use Blacklight or VuFind to do this kind of thing, but I'm not sure, though even those are existing open souce projects, that doing so isn't overkill and won't in turn increase dependencies and maintenance overhead.

Actually, that's a topic for another time – I mean the idea that just because part of something is capable of doing what you want doesn't necessarily make it a better option than rolling one's own if using and updating said something entails more cost in the long run. Paved roads often get you there faster, but a willingness to get lost now and then is how you learn where all the really cool local bars are …

;)

Anyway, here's what I'm thinking. A small script would simply look at an XML setup file from which it would know which places to go grab metadata from, the type of feed, the last time the metadata was requested, and stuff like the resumptionToken if applicable. It would also store the appropriate XSL file to process the metadata with so that the metadata could be passed into Solr to be indexed and searchable. Anyone who's site doesn't provide metadata as XML could simply create a web service that does so, e.g. a RESTful MySQL to XML thingamajig. The outputted XML just needs to have an XSL that will facilitate passing it to Solr for that data to be part of the shared metadata store. And since XSL is the universal translator in this context, other metadata types such as RSS/ATOM feeds could be grabbed, too. All one needs to do is add to the XML config file so the script knows to retrieve metadata from that site and make sure there's an XSL file that can be used to facilitate passing the data into Solr. So in the end all this should take in terms of coding is a small script, one XML config file, and as many XSL files as needed.

For fun and to start learning about Solr, I just manually grabbed some OAI metadata from CalTech yesterday – it was for some oral histories. And then I ran them through an XSL file and then posted them to Solr. Within no time I had a searchable, local metadata store to play around with (screenshot below). Since I was using all the defaults from the Solr tutorial I had to map the <dc:creator> field to things like manufacturer, since the default is set up for an electronics store.

Solr screenshot

BTW if we use this, at some point I won't be able to call it "pOAIndexter" but for now I can.

Since I don't know if I'll do this in Python or PHP and since OAI is what we'll work on first, I guess it stands for "Python or PHP OAI Indexer".

Yes, I'm a dork.

--------------

Related Content:

Written by nitin

October 2nd, 2011 at 11:20 am

Switch to our mobile site