Jason Vaughan Discusses Web Scale Discovery Systems

By Daniel A. Freeman |

For people in the library systems world, it's no secret that Web Scale Discovery Systems are a big deal. As Jason Vaughan explains in the new issue of Library Technology Reports, " These services are capable of searching quickly and seamlessly across a vast range of local and remote content and providing relevancy-ranked results in the type of intuitive interface that today’s information seekers expect."

Jason's report provides never-before-seen insight into these services and what their potential to transform library systems. The report describes in detail the content, interface, and functionality of web scale discovery services developed by four major library vendors: OCLC, Serials Solutions, Ebsco, and Ex Libris, and provides context and background on each vendor.

Jason talked to us about his research, its practical implications, and what this all means for the library of the present and the library of the future.

You can purchase Jason's issue of Library Technology Reports at the ALA Store, and read Chapter 1 for free on our MetaPress site.

Dan Freeman: So what was the origin of your work? Where did your study start?

Jason Vaughan:  In early 2009, the UNLV Libraries held a “Discovery Mini Summit” where library staff could share their ideas about how to enhance information discovery.  Over a dozen staff created posters or other presentations to share ideas with their colleagues, and web scale search technology was one of the ideas that surfaced.  Search technology is evolving rapidly for our profession, and end users expect systems and services to be easy to use yet comprehensive in scope.  They want it all, and they want it now.  And while the amount of content in the information universe is growing exponentially, the amount of time students have to conduct research and sift through results isn’t increasing.  So, the UNLV Libraries started an internal review in late Summer 2009 to take a look at the emerging web scale discovery services.  The Dean of Libraries appointed a “Discovery Task Force” and asked me to chair.  This was a group comprised of library staff from all across the Libraries, to help ensure everyone had a voice at the table.  A web scale discovery service has implications for both “front of the house” and “back of the house” library operations.  Naturally, our first order of business was to firmly grasp what the concept of web scale discovery was – we weren’t even using that phrase at the beginning, nor, necessarily, was the library profession.  We took a look at the vendor marketplace and identified only two services which had been commercially released – OCLC WorldCat Local and Serials Solutions Summon.  We learned of three others that were under development and scheduled for release in 2010.  Given the infancy of these new services, and the associated lack of any substantive scholarship, this looked like a great idea to write about to help other libraries that were doubtless conducting their own evaluations, or planning to in the near future.  There’s a lot to think about in considering these services, and there was just nothing out there, no recipe, no cookbook.

DF: Ok, so for Library Systems lightweights, can you give a quick explanation of Web Scale Discovery Services? 

JV:  In my view, and put succinctly, it’s a combination of technology and content that holds great potential for those conducting research.  Library vendors have entered into agreements with publishers and aggregators to pre-index content metadata and/or the fulltext.  In a sense, web scale discovery is a natural – but fundamental – evolution.  First were OPACs, which generally provide access to things such as books and journal titles (not journal articles).  “Next generation library catalogs” arrived on the scene around the middle of the last decade.  Such systems took things quite a bit further – in terms of interface design and content covered.  The interfaces were built on more open technologies, and included design cues and features users have come to expect – like faceted browsing.  In addition, these next generation catalogs often had the capacity to harvest other local collections into the same interface – like a library or institution’s digital collections and institutional repository materials.  Now on the scene we have web scale discovery services – I think of these as the real heavyweights.  Considering that the majority of these services were just commercially released in 2010, they are still very new.  These services provide a common interface, receptive to the end user, providing access to both locally hosted content AND a tremendous amount of remotely hosted content – to the journal article level, not just the journal title level, or to the newspaper article level and not just the newspaper title level.  So, we’re talking hundreds of millions of items; in fact, a few of the services claim their centralized index has now surpassed a half billion items, and that’s before any local content is incorporated.  So you now have a single interface -- one search box, one set of relevancy ranked results, that covers not only your local resources, but lots of content that users are typically used to getting from individual journal searches or A&I and fulltext databases.  And given that most of these services harvest and pre-index the content ahead of time into a large centralized index, searches are very fast and lend themselves well to relevancy ranking. 

DF: So what about these services is so transformative? 

JV:  There’s been a lot of substantive, survey based research about what researchers want – from reputable organizations like OCLC, the Library of Congress, and Ithaka.  No surprise to anyone – students today love Google.  It’s often their first stop in research, and, more often than we’d like to admit, their last.  Google is great, but by no means should it be considered the alpha and omega for a search.  Libraries value accurate, comprehensive information, and also value instilling information literacy skills into their students – so they can be lifelong learners and separate the wheat from the chaff.  These web scale discovery tools have a lot to offer.  First, the vendors have done quite a bit of usability studies to help inform their interface design.  These tools offer desirable elements like faceted browsing, shopping carts, a multitude of export options, and pretty seamless delivery of fulltext content in a lot of cases.  So, they are easy to use, and that’s important if you want something to get used and adopted.  Second, the content searched by these new services is often “vetted” content, at least compared to what might come up through a search using a regular web search engine.  A large part of the content indexed by these new services are scholarly journal articles, content from reputable open access repositories, primary source newspaper articles, ebooks, and of course the books, digital collections, and so forth held at the local institution.  Most of these discovery tools have a box or facet you can check to limit results to only peer reviewed or scholarly publications.  So overall, the quality and relevance of materials retrieved by these new tools is likely higher than what’s retrieved in a generic query using a standard web search engine.  So, in short, it’s an easy to use service that’s searching a huge amount of high(er) quality content.  Quite frankly, it’s what libraries have long wanted, as evidenced by the trend of federated search technologies which are now over a decade old.  Those technologies offered a lot of promise, but experience (and research) has shown they have quite a few drawbacks, such as having to maintain sources and targets, slow response times, and poor relevancy ranking capabilities.  For many institutions, those drawbacks outweighed the potential benefit.  These new web scale discovery services alleviate many of the shortcomings exhibited by the older federated search technologies. 

DF: So in a practical, day-to-day sense, how would this change library services?

JV:  I can see several changes.  From the end user perspective, these tools will hopefully help the student or lifelong learner, and that’s one of the main reasons we’re all here.  Libraries have often had many confusing avenues for the end user in their quest of finding information.  You have the online catalog, maybe a digital collection management system, and maybe an institutional repository.  Some libraries may be running both a legacy catalog and a newer next generation catalog side by side.  Many libraries have webpages listing out dozens if not hundreds of databases in an A-Z list or organized by topic.  Many libraries subscribe to and host a journal title A-Z list of thousands if not tens of thousands of electronic journals.  It can really be intimidating, and finding what you’re looking for can take some time.  From a day to day perspective, I would hope libraries would think of this as a new, powerful tool in their arsenal.  It’s not going to stop students from using Google, nor should it.  But the discovery services can be A STOP on their information quest.  Depending on how well marketing efforts work, such services may be a first stop, at least for some students.  Reference staff will still need to maintain proficiency at all the other avenues of information discovery.  Web scale discovery tools, at least at this stage, won’t replace the hundreds of databases, the online catalog, etc.  But they definitely have a place, and it’ll be important for librarians to learn over time what place that is.  For example, in an academic environment, perhaps the web scale discovery tools are a good first stop for many undergraduate research needs.  For a graduate student in – pick a field – engineering – there may be an engineering database that’s a more appropriate first stop.  So, librarians will need to understand what role the discovery service plays, and this role can vary depending on the user or the research need. 

There may also be some changes in other library operations, but the jury is out.  ILL requests may go up as the link resolver (utilized as part of the discovery service) surfaces materials not owned by the library.  Cataloging staff may change some practices, such as which fields are utilized in the ILS or digital collections record, given that some fields may be harvested and incorporated into the discovery service index and interface, and others not. 

Longer term and beyond the day to day, from a collection development viewpoint, it’ll be interesting to see if these new services eventually lead to the library cancelling some of their existing databases or other resources.  For any given journal article, the article can often be sourced from a variety of resources – this publisher, that aggregator, etc.  Acknowledging that, there is the potential that the library may be able to trim some of their existing subscriptions and save some money.  While in many cases it’s challenging to do, why buy or license the same content multiple times if you can avoid it.  

DF: In your study, what were the key differences you found between the different vendors?

JV:  While vendors may quibble, I think ultimately there are perhaps more similarities between the services than differences.  That said, especially at this stage, there are quite a few differences, and I’ll take just a moment to highlight a few of them.  As these products evolve, differences will likely shrink – just like differences have shrunk with the well established integrated library systems.  With established integrated library systems, they all basically do the same thing, they learn from their competitors, and when one product evolves, others evolve with the result that things are generally kept on the same playing field.

For web scale discovery, the amount of content indexed is, of course, important.  Generally all vendors now claim indexes numbering in the hundreds of millions of items.  The two services out of the gate early (WorldCat Local and Summon) have a head start on the amount of content.  The services released later are catching up, and over time, things will probably equalize –with the indexes of all services generally growing at the same rate as newly published content is added.  But for today, the size of each index is a differentiator. 

Another difference is what descriptive content the discovery service vendors may be getting from the various publishers and aggregators.  Some may be getting more detailed metadata, some may have greater access to index the fulltext.  Regardless of what they get from the publishers, some vendors have in house staff who carry out further work on the metadata – enriching the records even further to optimize discovery – in the hopes that what’s retrieved will be more relevant for a given search. 

As you might expect, components originating from the same vendor may integrate a bit more seamlessly with each other.  For example, a customer purchasing an ILS and a discovery service from the same vendor may find things are a bit better integrated, maybe for the end user display, or maybe with some backend staff workflow.  That said, these new discovery services are more open compared to the turnkey ILS systems of a generation ago (and which many libraries are likely still using).  A library would probably be doing itself a disservice if it overemphasized the importance of getting a discovery service from a particular vendor just because it already had, for example, an ILS from that same vendor.  Ask the questions and do the homework, but don’t lock yourself into a predetermined selection.  This ties into another point, and that’s how customizable these services are at the local library level.  There is quite a bit of variation here.  While each service can be customized to a degree, some services can be greatly customized, in terms of the interface, what functionality is offered, incorporation of widgets, and so on.  In some cases, APIs exist which allow the local library, if they wish, to create and design their own interface from scratch, and populate the results, in part, from the discovery index.  This isn’t just vapor – there are other libraries that have already done this, such as North Carolina State.  Of course, libraries vary broadly – some will likely have the staffing levels to design and maintain customized interfaces, and others won’t, preferring a more out of the box solution. 

Many of the interface design elements and end user capabilities are common across all services.  Still, there are differences, some apparent, some subtle.  For the former, some discovery services offer social community features like user tagging and reviews, some don’t.  Some offer optional user accounts allowing users to save materials for later retrieval.  Perhaps the majority of differences are quite subtle.  For example, faceted browsing.  They all have it, but there are some differences when you take a closer look.  Some (not all) allow the library staff to define new facet categories.  Some allow you to choose which out of the box facet categories appear in the interface, or the position of the facet category in the facet list.  Some provide the number of matching results for each facet category in parenthesis from the brief results view, some don’t.  Some have a more granular “include/exclude” facet refinement capability.  Some provide more detailed statistics on which facet categories users are clicking on, to help inform design decisions.  So, all of these differences, and that’s just with faceted browsing.  And of course, pricing is a big variable.  At UNLV, we obtained price quotes from five vendors, and while those quotes are confidential, I can say that while there was some level of consistency, there were definitely some outliers as well, whether looking at it from a flat dollar amount or as a percentage difference.  Considering that these services are subscription services – an ongoing cost – pricing can be a big deal for some if not many libraries. 

DF: Since there are a lot of libraries out there who can't upgrade their systems right now because of budget problems, what are the key points they should know about Web Scale Discovery Services for making decisions a few years down the road?

That’s a good question and a bit tricky to answer.  On a positive note, these services are new, and evolving rapidly.  So if a library has to wait a few years to adopt such a service, at least they know that when they can afford to take a look, the discovery services should be better than ever.  If a library were currently evaluating some other large system or service – say, something having to do with journal rights management, or a full-scale migration to a different ILS vendor, than they would do themselves a favor by thinking about how web scale discovery may fit into their future environment.  There may be some efficiencies gained – either in dollars saved or staff workflow – by sooner rather than later considering the overall larger discovery environment and the various components that enable this magic to happen.  This magic often includes the underlying local content repositories (such as the ILS), link resolvers, and rights management systems such as an electronic resource management system or journal holdings service.  Perhaps there is a discount if multiple products are purchased  (or subscribed to) at once from the same vendor (such as an ILS and discovery service from the same vendor).  All that said, I stick to what I said earlier – libraries shouldn’t feel locked into one vendor just because they already have products from that vendor. 

As far as collection development goes, if libraries are considering entering into any new multi-year, large dollar amount agreements with new content providers, they may want to ask the provider if their content is indexed by one or more of the discovery services.  If not, gauge whether it’s on the content provider’s roadmap, or if they appear fundamentally adverse to the new discovery services.  Ideally, content the library acquires/licenses through existing or newly purchased/licensed databases and publisher packages would be included in a discovery service, whether from the same provider or another source. 

Finally, major redesigns of a library website often involve a lot of time and a lot of staff – it could easily take a year as you work your way through the various design, vetting, prototyping, publicity, and feedback stages.  If a library thinks they may be acquiring a web scale discovery service in the next year or two, it’s not too early to consider how such a service may fit into – and impact – the website’s design.  What webpages would you embed the search box on?  Would you tweak your mobile library website to incorporate the discovery service search box?  What marketing might you put in place to help advertise the service?  Would you need to create any tutorials or tweak bibliographic instruction classes?  On a final note, catalogers and metadata experts may want to assess the record quality in their existing systems, as libraries may often find the quality can be uneven.  These new discovery services harvest information from your existing underlying content repositories – and can expose errors or discrepancies that weren’t always highly visible.  So, some cleanup projects may be warranted.