musings on metasearch, part 1

You know you're desperate for content when you're posting about older things ... Background Before I left NC LIVE [https://www.nclive.org/] in November 2014, our web and UX person Sarah and I decided around September/October that, despite the crazy timeline with my leaving, we'd go ahead with a project we'd always talked about. Internally, we called it "metasearch". Basically, it was/is a federated search that took a user query and ran it across our local data (which databases and journals we provided, etc.) and external data sources, namely our Summon API (based on Solr). Rather than build a soup-to-nuts UI that talked directly to our local and external data, I'd always held that there needed to be a single, centralized API under the hood that queried each resource and then returned the response to the UI with a normalized schema. The reason for this was twofold. First, it allowed our web person to worry about design and not metadata/data - separation of concerns and all. As a bonus, if we ever decided to built something else on top of the API we could ... for example something mobile-specific or even open the API up to our member libraries. Secondly, there were a few things I wanted and needed to do on the backend, a couple of which I'll discuss below. Backend Thing #1: Indexing and Searching Local Data As I mentioned, we had to search across local and external data. Our local data was/is masterfully maintained by my former co-worker and all-around-cool-dude, Scott. And, in large part, because I didn't want to add to his plate, I needed to get around one big problem. See, we were using an older version of MySQL for which InnoDB tables couldn't be full-text searched. And I wasn't about to ask if we could switch things to MyISAM just to facilitate a full-text search. And, yes, full-text was required as far as we were concerned. Instead, I opted to query the MySQL tables for the data we needed, place the data into a tab-separated file (TSV), and then index the TSV files into a standalone SQLite database. Besides having a simple, standalone SQLite database file, SQLite also gave me out-of-the-box support for Porter stemming which our version of MySQL didn't support. I kinda love SQLite. Not in the way I love my bartenders from Raleigh, NC, but still. As for why I didn't go straight from MySQL to SQLite, let me just say that using the TSV file gave me a stop-gap in terms of making sure my queries were working. I'm hoping to mention the other reasons I went with SQLite and TSV in a follow-up post. ... If you're still with me you can now see that we had a single search API that was going to query SQLite databases for local data - regardless of whether their source was MySQL or even a native TSV file, etc. - and it was going to also query external data, namely the Summon API although it could certainly work with any external API that was compatible with our response schema. As a proof to staff, I created a search target for Medline Plus using its API. Backend Thing #2: Query Parsing One of the problems with a single API is that the different search targets can use different syntax to support user attempts at passing advanced query syntax in their query. Another is that targets have different levels of tolerance for certain characters, etc. So we needed a query parser that could return an associative array (this was coded in PHP) that outlines the following: 1. whether the query was Basic or Advanced 2. if the query was Basic then: 1. what an Exact Phrase search looked like 2. what an "AND" separated search looked like 3. what an "OR" separated search looked like For example, for a search for "academic search" (without the quotation marks) the parser would return the following array: Array ( [type] => basic [solr] => Array ( [exact] => academic+search [and] => academic+AND+search [or] => academic+OR+search ) [sqlite] => Array ( [exact] => academic search [and] => academic AND search [or] => academic OR search ) ) In order to determine whether the query was Basic or Advanced the parser would ask things like: 1. Are there capital "AND", "OR", or "NOT" strings in the query? If so, this might be and Advanced search (after some more checking). 2. Are there an even number of double-quotes in the query? If so, this might be [DEL: an Advanced :DEL] a phrase search. 3. Are there an odd number of double-quotes? If so, this might be a misguided attempt at [DEL: an advanced :DEL] a phrase search. 4. etc. Besides all that, the parser would cleanse the query of certain characters and stop words, etc. Creating phrase, "AND" separated, and "OR" separated strings was especially important for the targets using SQLite because once these query strings were created, I actually ran, for Basic searches, an SQL UNION of three queries: one for the phrase, one for the "AND" separated search, and one for the "OR" separated search. I'd adjust the relevancy score for the results so that phrase matches were always more relevant than "AND" separated searches which, in turn, were always more relevant than "OR" separated searches. That way having an extraneous or misspelled word in the query wouldn't punish the user who still types in some useful keywords. After all, why should a user who types in "Wall Street Jounal" be penalized completely because they misspelled the last word? An "OR" separated search would still retrieve an item called "Wall Street Journal" whereas an "AND" separated search would not. I'm pasting the query parser script below. It's not truly complete in that it called some external functions, but I hope the main ideas get across. I'm hoping to write a follow-up post that explains the other thing we wanted to do and how we did it. And that has to do with using text similarity scores to try and highlight results that might be "just" what the user was looking for. I'll probably post all the code at some point, but while there are no database passwords, etc. in the code, I still want to remove the NC LIVE-specific target scripts, because you could see what the databases are called and what the table fields are. <?php function query_tools__query_cleaner($text) { /* Returns an array of sanitized queries and query type (advanced|basic) upon analysis of $text. */ // start with the assumption of a basic query. $query_type = "basic"; // remove empty double quotes and parenthesis. $text = str_replace('""', "", $text); $text = str_replace("()", "", $text); // if the number of double quotes is not even, then penalize the query by lower-casing the query // and removing all double quotes. $test_double_quotes = (substr_count($text, '"')) % 2; if ($test_double_quotes > 0) { $text = strtolower($text); $text = str_replace('"', "", $text); }; // if the number of opening and closing parenthesis don't match, then penalize the query by // lower-casing the query and removing all parenthesis. if (substr_count($text, "(") !== substr_count($text, ")")) { $text = strtolower($text); $text = str_replace(array("(", ")"), "", $text); } // trim query; remove duplicate whitespace. // this must be done after replacing extraneous double quotes and parenthesis so that a query // starting with either would get trimmed. $text = trim($text); $text = preg_replace("/\s+/", " ", $text); // if $text is all CAPS then penalize the query by lower-casing the query. if (strtoupper($text) == $text) { $text = strtolower($text); }; // determine if first and/or last word in query hints at a faux-advanced query. $operators = array("AND", "OR", "NOT"); $words = explode(" ", $text); // lower-case first word if in $operators. if (in_array($words[0], $operators)) { $words[0] = strtolower($words[0]); } // lower-case last word if in $operators. if (in_array(end($words), $operators)) { $words[count($words)-1] = strtolower(end($words)); } // put $text back together again ... all the King's horses and all the King's men. $text = implode($words, " "); // determine if query is "advanced" or "basic". $operators = array(" AND ", " OR ", " NOT "); // if words in query are in $operators; then query is advanced. foreach ($operators as $operator) { $pos = strpos($text, $operator); if ($pos !== False) { $query_type = "advanced"; break; } } // create empty array for all parts of query. $query = array(); // split query at whitespace with double quote enclosure; add query parts to $query. $splits = (str_getcsv($text, " ", '"')); foreach ($splits as $split) { if (strpos($split, " ") !== False) { $split = '"' . $split . '"'; } array_push($query, $split); } // create empty output array. $query_array= array(); // add query type to output along with keys for Solr and SQLite search engines. $query_array["type"] = $query_type; $query_array["solr"] = Null; $query_array["sqlite"] = Null; // prepare and escape "exact" query string for different search engines. $query_string = implode(" ", $query); $query_array["solr"]["exact"] = str_replace(" ", "+", $query_string); $query_array["solr"]["exact"] = string_tools__escape_solr_chars($query_array["solr"]["exact"]); $query_array["sqlite"]["exact"] = str_replace("'", "''", $query_string); $query_array["sqlite"]["exact"] = string_tools__escape_sqlite_chars($query_array["sqlite"]["exact"]); // create AND and OR separated versions of query string if query type is "basic". if ($query_type == "basic") { // create Solr versions of query. $query_solr = str_replace(" ", "+", $query); $query_solr = array_map('strtolower', $query_solr); $query_solr = array_map('string_tools__escape_solr_chars', $query_solr); $query_array["solr"]["and"] = implode("+AND+", $query_solr); $query_array["solr"]["or"] = implode("+OR+", $query_solr); // create SQLite versions of query. // AND/OR/NOT operators require support for Enhanced Query Syntax; see: http://www.sqlite.org/fts3.html#section_3_1 $query_sqlite = str_replace("'", "''", $query); $query_sqlite = array_map('strtolower', $query_sqlite); $query_sqlite = array_map('string_tools__escape_sqlite_chars', $query_sqlite); $query_array["sqlite"]["and"] = string_tools__implode(" AND ", $query_sqlite); $query_array["sqlite"]["or"] = string_tools__implode(" OR ", $query_sqlite); } //print_r($query_array); //test line. return $query_array; } ?>