blog.humaneguitarist.org

search and auto-complete suggestions with a little Solr and lots of SQLite

[Sat, 07 Dec 2013 16:08:53 +0000]
Over the last few days, I've been working on how to throw back search and auto-complete suggestions against a Solr index for an eBook project at work. The idea is simple: a user starts typing and matches appear below the search box as the user types. The matches would "suggest" the best matching "title" and "author" values against the user query. Conceptually, it's simple and quite easy to do with an SQL database. Problem is, we've - for now at least - decided to use Solr because it offers a far better fulltext search experience versus anything else we currently have experience with. We're in a bit of a time crunch, so exploring other options isn't much of, well, an option. For now. Anyway, to do this I was just going to create a JSON API so our web developer could send the user's typed text to the API and get back the "best" matches. She's using typeahead.js [http://twitter.github.io/typeahead.js/] in the mockup we have and it seems to be working well against a very small test index. I should say that typeahead.js seems to support pre-loading a JSON file and doing a good job of showing suggestions against that file. The problem is that I don't see that scaling to potentially thousands of items, nor will that allow the most relevant matches to appear at the top. Pre-loading the data seems to simply follow the order of the items in the JSON file and it's impossible to know what the "best" order is for one user at any one time, let alone all users. So, the API has to decide what to return and how to order it. Anyway, I wasn't getting far with Solr's own "suggest" method - it really didn't seem to be what we wanted. I also saw some tutorials that, honestly, I didn't have the time to explore and test with. The search-suggestion/auto-complete thing is, in the end, low priority. Very much unlike our deadline. :P One tutorial here [http://solr.pl/en/2010/10/18/solr-and-autocomplete-part-1/] that was easy to work with and implement was based on using facets to return what I call those "perfect" matches. These are matches that are the (user string + a post-fixed wildcard). In other words, if the user types "the cat" a perfect match would be "the cat in the hat (title)", etc. At the end of the tutorial, note the warning about "the load caused by the use of faceting mechanism." Anyway, the perfect matches alone wasn't really doing it for us; we also wanted relevant matches so that if one typed in "cat in the hat" the API would likely return "the cat in the hat (title)" because, while not a perfect match, it should have a high relevance score. Well, in moments of crisis I like to remind myself of a little thing called SQLite and in-memory databases. So, here what appears to be working. And it should scale, limited only by the speed of Solr but not directly by index size. What I did was simply set the API to query against the "title" and "author" fields in Solr (with post-fixed wildcards). I'm using the default of 10 rows for a search, so the maximum matches returned is 20, aka 10 * (# of fields). We can always bump that up if we think it's needed. So after Solr returns the results, we have 20 total field values. Those are then fed into an in-memory SQLite database. So if the user typed in "the cat in th" the query would be a union of a post-fixed wildcard search and a fulltext search (with SQLite's built in support for Porter stemming!) against the 20 values. For the fulltext search each word in the user query would be separated by " OR ". Since we need the results ranked by relevance, I hard-code the "perfect" matches to have a relevance rank of "99999" and then use the length of the "offsets()" function in SQLite to get the relevance for the remaining values. After SQLite is done, the query results are simply placed into an array, with the field value added parenthetically - i.e. "the cat in the hat (title)". Then the array is JSON encoded and shipped out the door. By the way, I'm also making all the values lowercase for reading ease and to normalize things like "JANE DOE" and "Jane Doe". p.s. just because 20 values are returned from Solr, that doesn't mean SQLite would return 20 results because many of those values are, of course, not perfect matches nor hits against the fulltext search. The remaining values could be returned as well, but so far I'm not thinking we want that as some of the results won't make sense to the user. That's to say if one types in "the cat in the hat" the value of "dr. suess (author)" could come back if we wanted it to as it would simply be the rest of the values obtained by adding another "UNION" clause in the database - i.e. a query that returns everything in the database. But the question is not "how?" but rather "why?". Anyway, when/if a user clicks on a suggestion, the suggestion will be fed to a "search" method in the API that will sniff the field value out of the parenthesis and return results by using an " AND " separated search against that field in the index. My only outstanding question at this point is whether I should do what I do now: insert each value into the SQLite database one at a time OR if I should build one big INSERT statement and insert all the data into SQLite in one fell swoop. Anyway the code is below. Ultimately the meat of the logic here is Solr agnostic, since any data store with key/value pairs could be made to work with this. Which is good because, as I initially stated, we're only using Solr out of time constraints not necessarily because we think it's the best way to go. <?php function return_suggestions($term, $min_len=2) { /* Returns JSON auto-complete suggestions against Solr index of "author" and "title" fields for $term. Suggestions are lowercase and contain the field name within parenthesis. */ // force $term to be at least equal to $min_len. if (strlen($term) < $min_len) { exit; } // !!! these will eventually be handled by a commom library (query_tools.php). $term = trim($term); $term = str_replace("%20", " ", $term); // query Solr. include_once("includes/solr_tools.php"); $params = "&q=author:$term*+OR+title:$term*&fl=author,title"; $response = query_solr($params, "select", "+OR+"); // get "docs" field from Solr. $docs = $response["response"]["docs"]; //print_r($docs); //test line. if (!$docs) { exit; } // create SQLite database. $memory_db = null; $memory_db = new SQLite3(":memory:"); // create table; you must use "VIRTUAL TABLE" for fulltext (FTS3/4); see: http://www.sqlite.org/fts3.html#section_1_2. $memory_db->exec("CREATE VIRTUAL TABLE box USING FTS4 (id INTEGER PRIMARY KEY AUTOINCREMENT, suggestion, suggestion_type, tokenize=porter)"); // get the "suggestion" and the field it came from; insert values into database. $fields = (array_keys($docs[0])); //the fields are "author" and "title" because those are the ones requested per $params/URL query. foreach ($docs as $doc) { foreach ($fields as $field) { $suggestion_value = strtolower($doc[$field]); $insert = "INSERT INTO box (suggestion, suggestion_type) VALUES (\"$suggestion_value\", \"$field\")"; //echo $insert . "\n"; //test line. $stmt = $memory_db->exec($insert); } } // prepare for query; make database query; run query. $term_OR_separated = str_replace(" ", " OR ", $term); //for full text, need to separate words by "OR". $query = "SELECT suggestion, suggestion_type, 99999 as rank FROM box WHERE suggestion LIKE '$term%'" //"perfect" matches. . " UNION SELECT suggestion, suggestion_type, length(offsets(box)) as rank FROM box WHERE suggestion MATCH '$term_OR_separated'" //relevant matches. . " ORDER BY rank DESC"; //echo $query; //test line. $results = $memory_db->query($query); // append matches to $suggestions. $suggestions = array(); while ($row = $results->fetchArray()) { //print_r($row); //test line. $suggestion = $row["suggestion"] . " (" . $row["suggestion_type"] . ")"; //i.e. "jane doe (author)" instead of "jane doe". if (!in_array($suggestion , $suggestions)) { array_push($suggestions, $suggestion); } } $memory_db->close(); // exit if $suggestions is empty. if (count($suggestions) < 1) { exit; } // write and output JSON. include_once("includes/make_json.php"); $output = array("suggestions"=>$suggestions); echo make_json($output); } // execute return_suggestions(). if (isset($_GET["q"])) { return_suggestions($_GET["q"]); } ?> Update, later in the day: So, I thought more about augmenting the query, but not to return all the remaining results but just ones for which the last word (or word fragment) entered by the user surrounded by wildcard characters yields something. Here's the part of the code I changed compared to that above (starting at line 47). // prepare for query; make database query; run query. $term_OR_separated = str_replace(" ", " OR ", $term); //for full text, need to separate words by "OR". $term_last = explode(" ", $term); //geting last word in query; see: http://stackoverflow.com/a/11029470. $term_last = $term_last[count($term_last)-1]; //echo $term_last; //test line. $query = "SELECT suggestion, suggestion_type, 99999 as rank FROM box WHERE suggestion LIKE '$term%'" //"perfect" matches. . " UNION SELECT suggestion, suggestion_type, length(offsets(box)) as rank FROM box WHERE suggestion MATCH '$term_OR_separated'" //relevant matches. . " UNION SELECT suggestion, suggestion_type, 0 as rank FROM box WHERE suggestion LIKE '%$term_last%'" //wildcard against last word only. . " ORDER BY rank DESC"; //echo $query; //test line. $results = $memory_db->query($query);