blog.humaneguitarist.org

keyword vs. phrase searching of the Soundboard, a GFA publication

[Sat, 05 Jan 2013 17:35:54 +0000]
As I mentioned before [http://blog.humaneguitarist.org/2012/07/04/the-gfa-2012-charleston-and-me/], last summer I went to the Guitar Foundation of America [http://www.guitarfoundation.org/] convention in Charleston. I also mentioned [http://blog.humaneguitarist.org/2012/07/04/the-gfa-2012-charleston-and-me/#Friday_June_29] that I'd asked some questions about whether the GFA journal, "Soundboard" was full-text indexed. Via the FlippingBook [http://flippingbook.com/] software the GFA uses to display current issues online (membership required), there is full-text searching capability because the content is indexed as far as I can tell. But as I was saying [http://blog.humaneguitarist.org/2012/07/04/the-gfa-2012-charleston-and-me/#Friday_June_29], I don't think one can search across *all* online Soundboards simultaneously - i.e. fire off one query and get results across all online Soundboards. I could be wrong about that. In contrast, the PDF back issues [https://guitarfoundation.site-ym.com/store/view_product.asp?id=926298] sold on a DVD-ROM are not full-text indexed nor full-text searchable with Adobe Acrobat Reader as far as I can tell. And I think this is where there's real confusion - perhaps on my part - about what we mean when we use terms like "keyword" searching. To me, keyword searching means full-text and not a "find" (as in Acrobat Reader). The Webopedia site differentiates these as "keyword [http://www.webopedia.com/TERM/K/keyword_search.html]" and "phrase [http://www.webopedia.com/TERM/P/phrase_search.html]" searches, respectively. The GFA is using a different meaning, per the "How to search Soundboard back issues.pdf" file that comes with the DVD, for "keyword" searching: "These issues have been processed both to reproduce the page-by-page appearance of the originals on your computer screen, and to apply an "optical character recognition" (OCR) process to the text, so that every page of every issue is now keyword searchable." In my experience, however, the search provided internally via Adobe Acrobat Reader (and Foxit Reader, too) is what I'd just call a "find" (i.e. the same as Ctrl-F on your browser). In fact, in my version of Acrobat Reader and per the screenshot in the "How to search Soundboard back issues.pdf" file, Adobe also uses the phrase "find" and not "search" in their application. Their "Advanced Search" adds options really dealing with what to search (comments, all files in a folder, etc.) but not really how to search (in the algorithmic sense) - so, it's still a "find", though more feature-rich. Now, if you have Acrobat Pro (admittedly I do through work) you apparently can create an index [http://help.adobe.com/en_US/acrobat/pro/using/WSC28D4DBB-6A78-4027-9E04-F50FE411CFB9.w.html] and then actually do a full-text search, but that doesn't help people who don't have the pro version and won't/can't buy it. Granted, I can index the PDF with my operating system (Windows) and do a full-text search, but I don't really get much useful information other than what files match. I don't get useful information on where the passage exists (page number, etc). Consider the following passage from Soundboard Volume 1, Number 1, 1974: "Mr. Llois Mauerhofer, Elizabethstrasse 93, 8010 Graz, Lustria, was reported working on a doctoral dissertation at the University of Graz on Leonard von Call, early 19th c. guitarist active in Vienna who is best remembered for his serenades for guitar and strings." A "find" won't match that passage if you search for "Graz University" or "University Graz" or "strings Vienna" but a real keyword search likely would. Of course, a demonstration is in order, so using a tool called Apache Tika [http://tika.apache.org/] to extract the text from the aformentioned PDF scan of Soundboard v.1, #1, 1974; a little Python software script I wrote to output the data to a database-friendly file; and an online database, I indexed the data and made a little API - all that means is that there's page you can go to, throw some search terms at it, and get the results back as structured data (um, usually not fun to read through). By the way, I normally use more technical jargon in my posts but I have some guitarist buddies who I want to read this page. Anyway, here are the three searches mentioned above that don't yield results in Acrobat Reader but do using a full-text search (you can see the search terms in bold in the links below). Don't worry if you can't read the output, just focus on the fact that something comes back (provided my database isn't down at the moment!). http://blog.humaneguitarist.org/uploads/Soundboard/currentVersion/search/?q=Graz+University http://blog.humaneguitarist.org/uploads/Soundboard/currentVersion/search/?q=University+Graz http://blog.humaneguitarist.org/uploads/Soundboard/currentVersion/search/?q=strings+Vienna For a more user-friendly version, try going here: http://blog.humaneguitarist.org/uploads/Soundboard/currentVersion/soundboard_search.html [http://blog.humaneguitarist.org/uploads/Soundboard/currentVersion/soundboard_search.html] Try typing in the three searches mentioned above. Then try some more searches for fun. For simplicity's sake, I hard-coded the system to never return more than 10 results. Of course, this should all scale to indexing the text of all the PDFs on the DVD, but exposing those openly on the web wouldn't be appropriate. But my point with this demo is to say that this is more like what I meant by "keyword" searching at the GFA convention. There's probably a way to ingest the old PDFs into the FlippingBook software or at least something else like the Internet Archive book reader [http://archive.org/details/BookReader]. That would probably require re-OCRing the images so that the coordinates of the words could be indexed as well, allowing one to see where on a page the results are, just as with the current issues via FlippingBook. Ok, if you're still here and are a geek, here's the Python script, "soundboardToTabDelimited.py". ''' usage example: $ python soundboardToTabDelimited.py V01-n1-1974.pdf This yields "V01-n1-1974.xhtml" and then "V01-n1-1974.txt" Note: you must have the lxml module installed (which isn't always fun). You can get it here: http://lxml.de/ ''' import codecs, subprocess, sys from lxml import etree ##### globals tab = "\t" br = "\n" ##### run Apache Tika on the file passed via the command line soundboard = sys.argv[1].replace(".pdf", "") command_string = "java -jar tika-app-1.2.jar %s > %s" %(soundboard + ".pdf", soundboard + ".xhtml") command = subprocess.Popen(command_string, shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE) command.wait() #wait until the subprocess finishes. ##### write file headers (this needs to be deleted if you're going to later import the file via PHPMyAdmin). tab_delimited = codecs.open(soundboard + ".txt", "w", "utf-8") #output file tab_delimited.write("journal_id" + tab + "volume" + tab + \ "issue" + tab + "year" + tab + \ "page_id" + tab + "text_id" + tab + "text" + br) ##### extract volume, issue, year from filename volume = int(soundboard.split("-")[0].replace("V", "")) issue = int(soundboard.split("-")[1].replace("n", "")) year = int(soundboard.split("-")[2]) journal_id = "%04d_%04d_%04d" %(volume, issue, year) ##### parse xhtml file soundboard_parse = etree.parse(soundboard + ".xhtml") root = soundboard_parse.xpath(".") div_tags = root[0].xpath("//xhtml:div[@class='page']", namespaces={"xhtml":"http://www.w3.org/1999/xhtml"}) ##### extract text from each div/p tag and write data to file page_id = 1 for div_tag in div_tags: text_id = 0 p_tags = div_tag.xpath("xhtml:p", namespaces={"xhtml":"http://www.w3.org/1999/xhtml"}) for p_tag in p_tags: p_text = p_tag.text if p_text !=None and p_text !="": p_text = p_text.replace(br, "") p_text = p_text.replace(tab, " ") p_text = p_text.strip() if p_text != "": tab_delimited.write(str(journal_id) + tab + str(volume) + tab + \ str(issue) + tab + str(year) + tab + \ str(page_id) + tab + str(text_id) + \ tab + p_text + br) text_id = text_id + 1 page_id = page_id + 1 tab_delimited.close() # fin