I recently wrote about NCSU adding a search engine to its online catalog. But after talking to librarians who asked me, “So what did they get for doing that?” I realized I need to back-pedal and explain how a search engine makes an online catalog easier to use (or, as Andrew Pace puts it, "Why OPACs Suck").
Cream Rising to the Top
I'll start today with relevance ranking—the building block of search, found in any search engine, from Google to Amazon to Internet Movie Database to little old Librarians' Internet Index.
At MPOW (My Place Of Work), as we say on the blogs, we're evaluating new search engines. Every product I've looked at offers relevance ranking, and every search-engine vendor tells me, bells and whistles aside, relevance ranking works pretty much the same everywhere.
By default, when a user conducts a search in a search engine—say, a search for the term million—the search engine should return best matches first. That's relevance ranking: the cream of the search results rising to the top. We're so used to this we don't even think twice when Google's first page of hits for the term million returns satisfying results.
But compare that same search in your typical online catalog. Today I picked two dozen online catalogs from around the country and conducted keyword searches for the term million. Call me picky, but the first page of hits—often the first or second hits—for those catalog searches should not include:
- Hog heaven: the story of the Harley-Davidson empire
- The rock from Mars: a detective story on two planets / Kathy Sawyer
- The Johnstown Flood
- Mosby's 2006 drug consult for nurses
- Hotel Rwanda
- Teens cook dessert
An OPAC that Got Game
You don't have to be a rocket scientist to see these catalogs aren't using relevance ranking. But you shouldn't have to be a rocket scientist to use a library catalog in the first place. Compare those results with the same search for million in the NCSU library catalog, powered by the Endeca search engine. Here are the first seven hits:
- 12 million black voices
- Million man march
- million dollar directory
- Black religion after the Million Man March
- Le Million
- Million dollar prospecting techniques
- Green groatsvvorth of vvit, bought with a million of repentance
So How Do You Make Cream, Anyway?
Relevance ranking is actually fairly simple technology. It's primarily determined by the magic of something every search-engine vendor will talk your ears off about: TF/IDF.
TF, for term frequency, measures the importance of the term in the item you're retrieving, whether you're searching a full-text book in Google or a catalog record. The more the term million shows up in the document—think of a catalog record for the book Million Little Pieces—the more important the term million is to the document.
IDF, for inverse document frequency, measures the importance of the word in the database you're searching. The fewer times the term million shows up in the entire database, the more important, or unique, it is.
Put TF and IDF together—the importance of a term in a document, and the uniqueness of the same term in an entire database—and you have basic relevance ranking. If the word million shows up several times in a catalog record, and it's not that common in the database, the item should rise to the top, as Endeca presents them in the NCSU catalog.
The users who complain that your online catalog is hard to search aren't stupid; they are simply pointing out the obvious. Relevance ranking is just one of many basic search-engine functionalities missing from online catalogs. NCSU worked around it by adding a search engine on top of its catalog database. But the interesting questions are: Why don't online catalog vendors offer true search in the first place? and Why we don't demand it? Save the time of the reader!
Technorati tags: library, library catalog, library catalogs, Online catalogs, OPAC