blog.humaneguitarist.org

musings on metasearch, part 2

[Mon, 21 Sep 2015 18:26:03 +0000]
A few months ago, I posted this [http://blog.humaneguitarist.org/2015/04/11/musings-on-metasearch-part-1/] with regard to how we handled some things with our "metasearch" federated search tool while I was at NC LIVE. I'd been meaning to do a follow-up and since so many months have gone by, I thought I had better get on that. As I mentioned in the previous post, the metasearch API returned a standardized response schema for all the search targets so as to streamline parsing of the data. Even though the schema was normalized, we had specific search strategies for different search targets because each search target was its own animal, returning a different type of "thing". Not unlike how chocolate protein powders look the same though the ingredients are diverse. I've got to say, I've seen a few real misfires with some library vendor products in regard to search. Sometimes people seem to assume that using a tool (Solr comes to mind from a recent conversation) equates to knowing how to use both it and metadata. Effective - or even decent - search is a reflection of understanding metadata, having informed workflows, and devising strategy far more than it is about the tools you use. Owning a cookbook doesn't make you a chef ... Anyway, I wanted to try and give a better overview of the structure of the API under the hood and some "goodies" as well - albeit in what will surely be a loose, almost stream-of-consciousness style. I also wanted to talk about how we used text similarity scores to highlight search results that users were perchance looking for and also why we used it as the primary sort method for some modules. Eventually, I'll put the code up though I won't include all the NC LIVE-specific modules that queried our custom MySQL database. There are no passwords, etc. in the code since all that got imported from other scripts, but I still wouldn't feel right about it since database, table, and field names are present in the queries. Overview of metasearch() function and modules When I said previously that there was an API "under the hood", what I meant was that there was a simple interface - a single, high level PHP function called "metasearch()" that was called by our UI, though it could certainly be called via the command line as well or serve as the basis for a public API, a "metaAPI", that could have allowed our member libraries to use its functionality. It took basic parameters: including the search phrase, the offset and limit of the results, the sort style, as well as the array of modules to search against. Each module equated to one search target with the level of granularity we assigned. That's to say a module could have been all of ProQuest's Summon index, a subset of that index such as all the Summon items deemed "articles" or "e-books" or whatever, a local SQL database, or any third party web service, and on and on. The only requirements were really that calling a third party search service was A) permissible to use and B) compatible with our result schema. Here's an example search below using the "metasearch()" function for the term "kittens" against the pretend modules "foo", "bar", and "baz". Example search: $results = metasearch(array("foo", "bar", "baz"), term="kittens"); For reading ease, this example omits the optional "offset" (default = 0), "limit" (default = 10), and "sort" parameters. The "sort" parameter is a Boolean, where [DEL: "true" :DEL] "false" sorts the module results by whatever the default sort method is for that module ("relevance" for Summon results) while [DEL: "false" :DEL] "true" sorted the results by whatever the alternative sort method was for that particular module ("date, newest to oldest" for Summon results). Many modules built on local data didn't even have an alternate search method and would have ignored this parameter if [DEL: "false" :DEL] "true". The reason the function took an array of modules was, of course, that it was a federated search tool that needed to query multiple targets. In the end, however, I believe (can't remember) that our web developer called the function once for each module via the UI so that asynchronous requests could be made. She placed the results for each primary module into its own "box" in a bento-box style UI. So the advantage of making asynchronous requests was so that each module box could "fill up" with results as soon as they came in from the call to the "metasearch()" function for a given module. Search Results Schema Our result schema was a simple one as our goal was to return (for a given search) basic metadata: Title, Author, Description, Link, etc. Since these were simple, non-repeatable text fields for reading by an end-user, the response was not designed to further be parsed programmatically, hence non-extensible and non-repeatable. That's to say if the authors for a given item were "Alvin", "Simon", "Theodore", and "Dave", the Author field would return a string like "Alvin, Simon, Theodore, and others" instead of returning four distinct author fields for each as that would be useless for the end user and would offload work unnecessarily to the web developer. And removing these types of burdens from the web development was one of the primary rationales for the API and a normalized schema in the first place. Here's a real result using the Medline Plus module showing the schema. Because our web developer didn't need to call the "metasearch()" function over the web, I didn't make a webservice interface for it. She just imported the API natively in her PHP code. That's why the function returns a PHP associative array and not JSON, etc. Array ( [0] => Array ( [medline_plus] => Array ( [_error] => [_total_results] => 959 [results] => Array ( [0] => Array ( [_match_score] => 0.22 [title] => Evaluating Health Information [image] => [image_m] => [link] => http://www.nlm.nih.gov/medlineplus/evaluatinghealthinformation.html [link_m] => http://www.nlm.nih.gov/medlineplus/evaluatinghealthinformation.html [date] => [description] => ... current, unbiased information based on research. NIH: National Library of Medicine [author] => [source] => National Library of Medicine ) ) ) ) ) In the example above, the module used is called "medline_plus" hence the value of the first array is "medline_plus". For a module named "foo", this value would be returned as "foo". Also, note that the "_error" field will only display a value should an error be caught. More important than the message itself is the presence (bad) or absence (good) of data in that field. The "_match_score" refers to the text similarity score that I'll talk about later in this post. As with the "_total_results" and "_match_score" fields, fields that are not metadata about the item returned but rather metadata about the item metadata are prefaced with an underscore. Most of the fields are self-explanatory, though a few explanations are in order. * field names followed by "_m" are meant to hold data specific for mobile devices such as links specific to mobile devices or images optimized for smaller devices and those with limited data plans. In practice, we didn't use mobile-specific data even if the module returned something. * if a field is not applicable to one module, it's returned anyway with a blank value for the sake of consistency and explicitness. + For example, in the results above, no "image", "image_m", "date", or "author" fields are extracted from the Medline Plus API, but the fields are returned with no values nonetheless in order to adhere to the schema. Dynamic Modules Aside from taking an array of existing modules to search against, the "metasearch()" function also allowed for a notation for what I called "dynamic modules" to be searched. Let me explain by example. First, here's the previous example of searching three modules for the term "kittens". $results = metasearch(array("foo", "bar", "baz"), term="kittens"); But what if I wanted to combine "bar" and "baz" in the response schema as if there was really a real module with the combined data from "bar" and "baz"? I could just do this: $results = metasearch(array("foo", array("bar", "baz")), term="kittens"); Now instead of three distinct module arrays ("foo", "bar", "baz") in the response, there would be two ("foo", "bar+baz"), the latter being a concatenation of the "bar" and "baz" modules. Now if one requested, say, the default limit of 10 results and there were 10 results returned by the "bar" module for the term "kittens" then there would no visible results from the "baz" module. Like I said, it's a concatenation so in this case the only way in which results from the "baz" module would appear is if there were less than 10 results from the "bar" module returned and at least one returned from "baz". Video search is an example of how we used this in the real world. As I said, we'd used a bento box style display for results and were really calling the "metasearch()" function per module. And for the video box, we really had two distinct targets: the NC LIVE video platform in which the metadata was in a local MySQL database and the Summon search for items that were deemed videos. Since we wanted to rank the NC LIVE videos above the Summon ones, using a dynamic module for both the NC LIVE video module and the Summon video modules worked. So, our video search would have looked like this on the backend: $results = metasearch(array(array("ncl_videos", "summon_videos")), term="kittens"); Module Structure Now I'll dig a little more into modules, specifically how they are set up and work. Each module was, of course, associated with a given search target: eBooks from the Summon API, articles from the Summon API, NC LIVE's list of databases it subscribed to, the list of full-text journals subscribed to, our Drupal-based website FAQ, etc. Each module resided in its own subfolder and consisted of 2 mandatory PHP scripts: 1. [ModuleName]/index__[ModuleName].php 2. [ModuleName]/search__[ModuleName].php The "index__" script did whatever needed to be done to prepare data for searching by the "search__" script. The "search__" script did whatever it needed to in order to search that module's data appropriately and to return the response per our normalized response schema. When one called "metasearch()" on a given array of modules, the "metasearch()" function actually called the "search__" scripts internally for each module in the array and wrapped the search result data in our response schema. When one called the other big, higher level function in the system called "metaindex()" on a given array of modules, it executed the "index__" script for each module in the array. It's what got used to set up overnight indexing via CRON. Module indexing of local data In the case of querying an external API such as MedlinePlus or Summon, the index scripts did nothing because the data was already prepared off-site by a third-party on some third-party server. But these "index__" scripts still existed so that all the modules were structured the same way and so that higher level functions wouldn't have to guess as to whether or not a module had an "index__" script to execute or not. For local data, the "index__" script typically was extracting data from our MySQL databases as to things like: * what databases we subscribed to (the subscription database module) * our website's FAQ (the website module) * the libraries we served and their contact information (the libraries module) * our locally hosted video content * etc. In those cases, the "index__" script would get the only retrieve the MySQL data needed to drive search and display sufficient metadata. It would then save the data in a delimited file which in turn was auto-dumped into a stand-alone SQLite database file with the naming convention "[ModuleName]/[ModuleName].db". While I said that data was typically extracted from our MySQL database, there were some notable exceptions. For our list of journals (the journals module), we'd received a spreadsheet from ProQuest and hadn't yet gotten it into our own MySQL database, so we cleaned up the sheet and exported it to a delimited list with OpenOffice Calc. Then the "index__" script read the delimited file and placed data into the SQLite file. We also had a "triggers module". This allowed us to create "triggers" for special search phrases such as "download ebooks" or "library card number", etc. When a user typed in a trigger phrase in (or something that would match well against the SQLite data) then the search tool would return the given Title, Description, URL, etc that we specified. So, if a user typed in "download ebooks" we could return prepared links to some page that would explain to the user how they could download eBooks, etc. By the way, I think the triggers for "download ebooks" and "library card number" have since been removed. For the triggers module we hand-maintained a delimited file which got placed into a SQLite file via the "index__" script. The delimited file contained a keyword column that specified the phrase we wanted the search tool to match against as well as the metadata that would be displayed if the tool returned data for that row. For example of a "trigger", as a simple demo I'd made a trigger for "prrr" (https://www.nclive.org/search-everything?search=prrr [https://www.nclive.org/search-everything?search=prrr]) that, to date, still returns a YouTube video of kittens in the top left "Best Bet" box. Note that the term "prrr" is not displayed via the response. Rather YouTube embed code for the kitten video I selected exists the in response's Title field as does a simple "Oh hai!" string. Here's what the row for the "prrr" trigger looks like in the delimited file: title image image_m link link_m date description author source keyword

Oh hai!

prrr What I mean to demonstrate is that, with this module (and some others), we were matching queries against data that itself was not returned in the response. Yes, there was high overlap between the metadata that was searched against and what was displayed, but ultimately they were separate concepts. By the way, I'm really hoping that nobody at NC LIVE deletes that "prrr" trigger anytime soon. For you CSS-ers: I never changed the YouTube embed code to fit the video which is why the YouTube embed "hangs" a little out of its given box. That's to say, our UI styling changed over the course of the project and I never updated the embed code styling to match the UI updates once I'd already created the "prrr" trigger. But do you really care? It's kittens for crying out, man. Let's keep our priorities straight, shall we? See also this [http://www.nclive.org/search-everything?search=meow]. Why delimited files? From above, one can see that whether the local data canonically existed in a spreadsheet, MySQL, or a hand-maintained delimited file, etc. all the data first was extracted as a delimited file prior to being placed in a SQLite database. Specifically, it was a tab-delimited text file. For modules where the data resided in MySQL, writing the data to a delimited file first and then placing that data into a SQLite database allow for some quality control since I could more easily open the delimited file in a text editor see the data for debugging purposes. Also, it allowed the overall workflow for all the modules based on our local data to be consistent. In other words, SQLite was the last step and before that must be a delimited file. If that delimited file was created by hand, fine. If it was created by querying MySQL, that was fine too. But for the sake of not having exceptions, there were always a delimited file first. The other reason of using a delimited file is because in some cases where we made the delimited file from a MySQL query we still wanted to set up some synonyms. Let's say that our MySQL database had a row for the resource "SimplyMap". We also wanted to support a match for that resource if someone typed in, understandably, "Simply Map" because "Simply Map", if queried for, won't match at all on "SimplyMap" in a SQLite database using vanilla relevance. Because the extracted delimited file was created via overnight automation we didn't want to edit it after the fact by hand. I also didn't want to add too much data alteration shenanigans in the "index__" code itself although I could have added an exception that "If you see 'SimplyMap' also index it as 'Simply Map'". Too many exceptions in code seem to me to say that the code is not itself exceptional. I think it's generally better to set up rules than exceptions. So what we did was manually maintain a second delimited file in the module folder containing the same display metadata for the item called "SimplyMap" but with the alternative spelling "Simply Map"as the phrase to be search against. That way we could maintain the second file as needed without any data-overwrite or code exception nonsense. The "index__" script simply read both delimited files and placed the data into the SQLite database. This allowed searching for "SimplyMap" or "Simply Map" to return to the user the exact same metadata as if they'd searched for the canonically spelled "SimplyMap". Since the display title for both was "SimplyMap" the search script simply deduped the results based on the display title so as to avoid any possibility of duplicate results. That way if someone searched for "SimplyMap Simply Map" they wouldn't see two results, both for "SimplyMap". Sure, if we got rid of "SimplyMap" from our database or changed some data we would have had to manually delete or alter the synonymous row from our second delimited file. This could have been avoided with a little more work in the "index__" script but I'm a strong believer that forcing some level of manual work can be an effective means to enforce some quality control. I mean I, for instance, brush my teeth much more effectively with a manual toothbrush than an electric one because I have to think more about my technique. Garsh, I'm not explaining this well, so here's the delimited row as extracted from our MySQL database for "Simply Map". Note that the "keyword" field is what's searched against though the "title" field is what is shown to the user. keyword title image image_m link link_m date description author source SimplyMap SimplyMap http://nclive.org/cgi-bin/nclsm?rsrc=247 http://nclive.org/cgi-bin/nclsm?rsrc=247 Web-based mapping application that lets you create professional-quality maps and reports using demographic, business, and marketing data. Geographic Research And here's the hand-maintained row for the synonym "Simply Map". Note that the "title" field still reads "SimplyMap". keyword title image image_m link link_m date description author source Simply Map SimplyMap http://nclive.org/cgi-bin/nclsm?rsrc=247 http://nclive.org/cgi-bin/nclsm?rsrc=247 Web-based mapping application that lets you create professional-quality maps and reports using demographic, business, and marketing data. Geographic Research That "explanation" still sucked didn't it? Sorry. Search strategies for local data Getting back to my dig at the top about how some folks just throw metadata around in search tools like Solr, etc. without thinking about search strategies … let me say that I specifically didn't go with Solr for local data modules because I didn't think it would let us tailor our search strategies against each module as we wanted. Even if it could, we were on a crazy deadline and I didn't have time to find out. I knew SQLite would work. Well, I was pretty damn sure at least. Plus, I liked how the SQlite database file would resides inside each module's own folder, so everything was nice and neat and we even didn't need to use another port for a query server. By "search strategies" I'm getting at the idea that a federated search searches against existing search interfaces where the search is theoretically tailored to effectively search against that particular data-set. That's to say that, with local modules, we approached it more from the perspective that the data happened to be local. So we went about devising indexing and search strategies for some modules very differently - almost as if they'd each been developed by completely independent third-parties. For our journals module, we indexed coverage data and considered coverage dates when sorting search results. Of course, the response schema didn't have a discrete field denoting the length of a resource's coverage or coverage start and stop dates. If we needed to display that data to the user, it would have ended up somewhere in something like the Description field in the response. Again, that's because the response schema was designed to be read by end-users in helping them make a selection. The concept of coverage just wasn't there for our website FAQ or our list of databases, etc. For our website we went with more generic keyword relevance. For our list of databases, we used text similarity scores (which I'll touch on later) as the primary sorting criteria. This is because we thought searches for databases might be more like what I call "name brand searches" where people are searching for a specific thing rather than a specific type of thing. Library databases and platforms have such odd names that don't necessarily do a great job of revealing their purpose, so we thought if someone typed in "NoveList", etc. then they were likely looking for that product. Contrast that with journal titles where real-world topical phrases or "subjects" often exist in the journal title. For example, the concept of "music theory" exists in the journal title itself when considering the Journal of Music Theory. So we employed more vanilla relevance in searching the journal module titles as opposed to the list of databases where we relied a little more on text similarity. For local modules, we even used different queries depending on whether we detected the query as being "basic" or "advanced" as mentioned in my first post on this project. And we also, depending on the module's needs, often tiered results (what I'm calling "tiered searching") so that an exact phrase matches' relevancy always outweighed that of AND-separated matches, which always outweighed that of OR-separated matches. I'd done something like this before [http://blog.humaneguitarist.org/2013/12/07/search-and-auto-complete-suggestions-with-a-little-solr-and-lots-of-sqlite/] with auto-complete suggestions and it worked well, so we employed it as needed in some of our modules, too. The bottom line is different modules seemed to work better with strategies tailor made for them as opposed to using the same approach for everything. Barring the need to create a federated search simply because one isn't permitted to locally index all the metadata from all the resources one wants to query, I don't see the logic in creating a federated search if one is going to use a cookie cutter approach to search without considering the inherent differences between different data sets. Text Similarity and "guessing" what the user might be looking for I've mentioned a few times that we used text similarity scores to highlight some search results and sometime even to help sort a module's results. In terms of highlighting results we wanted people to be able to type in "Wall Street Journal" and get a link, prominently displayed in the top, to that resource. The link was displayed in a box called "Best Bet" which was only visible if we displayed one of these prominent links. We stole the term "best bet" from our NC State colleagues upstairs. While we'd seen the "best bet" kind of thing before in federated tools, we didn't want to only rely on our "triggers module" and manually set up phrases as criteria for showing these results. In other words, we wanted the system to be proactive enough to detect that "Wall Street Journal" or maybe something close like "Wall St Jounal" (I intentionally misspelled "journal") might have been an attempt to find the resource called "Wall Street Journal online" by ProQuest, as in the screenshot below. IMAGE: "metasearch screenshot"[http://blog.humaneguitarist.org/uploads/ncl_metasearch_similarity.png] Now, relying on automation to guess what the user might be looking for will occasionally message something to the user that might not make too much sense to them (Um, why are you recommending this to me?) but we figured a user can ignore things easily as they are used to doing often with, say, spelling/grammar suggestions that aren't what they want. But we chose that as the lesser evil VS. the more limited approach of setting up everything manually. I mean how many spelling variations would we have to set up for even just one resource, "Wall Street Journal"? At least maybe "Wall St. Journal" and "Wall St Journal". Both of those, by the way, already got handled automatically by our text similarity technique. It's not perfect. Sometimes it over-reaches, sometimes it under-reaches. But it's grander in scope than trying to manage everything by hand. And both the algorithm and display criteria could always be improved. By the way, the reason "Wall St Jounal" results in a "Best Bet" box for "Wall Street Journal online" is because we'd used our synonym technique to equate "Wall Street Journal" to "Wall Street Journal online" (the canonical ProQuest title). So, in fact, "Wall St Jounal" isn't a great text similarity match against "Wall Street Journal online" but it is against our synonym "Wall Street Journal". Anyway, in terms of the threshold as to when to or not to display a special message: if we got a similarity score of at least 80 percent (if I remember well), we'd message the user with a "Best Bet" box. That percentile threshold just naturally came about from some testing/messing around I'd done about a year prior. Now, if the user selected a search term from the typeahead.js [https://twitter.github.io/typeahead.js/] autocomplete suggestions powered by a ProQuest auto-complete API, then we used a higher threshold of, I think, 90 percent. The idea was that since we'd actually provided the user a controlled term it seemed to make sense to use a higher threshold since we were the ones feeding them their query. By the way, I've also seen some people suggest using usage/click data to know what to recommend to users: they typed in X, and since most people who type in X go to resource Y, then we can guess they want to go to resource Y and message accordingly. First, I think you need sloads [http://www.urbandictionary.com/define.php?term=shitload&defid=558049] of click data to justify this but there's also the self-fulfilling prophecy issue. If you recommend that someone click something and then they click on it, then you risk feeding your own argument. Did they click on it because they wanted to or because you asked them to? How do you know? Hell, maybe the user doesn't even know why. At the heart of the text similarity calculator is the PHP "similar_text [http://php.net/manual/en/function.similar-text.php]" function. Our actual function took in four parameters. Of course, the two strings to compare but also if word order should matter and whether there were any stopwords to eliminate. Here's the start of the function so you can see what it does: function similarity_tools__calculate_similarity($string_1, $string_2, $sorted=0, $stopwords_csv=Null) { /* Returns a text similarity score between two strings; ignores word order if $sorted = 1; weighs comparison with/without optional comma-separated words in $stopwords_csv and uses the higher score. */ So the following would report back that the strings are identical: similarity_tools__calculate_similarity("foo bar", "foo bar"); similarity_tools__calculate_similarity("foo bar", "bar foo", 1); // discounts word order. similarity_tools__calculate_similarity("foo bar", "bar foo baz", 1, array("baz")); // discounts word order; removes the word "baz" from consideration. Depending on the module, sometime we discounted the word order and sometimes we wanted to discount certain stopwords. As an example of the former, in our "triggers module" if we'd set up a trigger for "download ebooks" and a user typed in "ebook download" the similarly score would discount word order and compare "download ebooks" to "download ebook" which would be high enough similarity to show a link in the "Best Bet" box. Again, this gets back to trying to have strategies aligned with each module. Conclusion Looking back - and even looking back through the code - it's almost laughable (a bit foolish?) how much we decided to take on with a tight deadline and also with my leaving just days after all the code got finalized for the backend API. My colleague, Sarah, who was our web developer and my partner in crime on this has also subsequently left NC LIVE for another opportunity. That's part of the reason for this post: we were going to write a paper to explain a few things we tried but we both moved on to other things so I wanted to document a little bit but it would have been much better if she was involved in the writing! This post took me hours and hours to write and I'm not happy with it at all, honestly. Our search tool and the logic is, I think, much better than this confused post of mine makes it seem. And while it's really fodder for another post, that's always the risk of building tools - people leaving, plans changing, etc. But that's why you document, have a clear vision of what you're trying to do, and take steps to try and keep things going even if you think you'll be gone soon, etc. In other words: create with intention ... and fear. In the end, I think we did a good job and ended up with an effective search tool that did some things that I think people should think more about. I really haven't seen a heck a whole lot - not that I looked hard - about the role of query parsing, "tiered searching", and text similarity in searching library data but I sure as hell have seen some platforms that I think need to be thinking more about these kinds of things. By no means am I suggesting we created some cutting edge piece of awesomeness, I'm just saying that maybe we threw a few things out there to think a bit more about. You know the big search engines are doing this kind of stuff (and a lot more, of course). I'm just saying libraries and library vendors can do more than most do in regard to search. That's all.