Libraries, Linked Data and the Semantic Web: Positioning Our Catalogs to Participate in the 21st-Century Global Information Marketplace
Jane Rosario, University of California-Berkeley
The symposium was held Friday, January 20, 2012. All presentation slides are online at http://alamw12.scheduler.ala.org/node/22
Libraries in the Web: Weaving a Web of Data
Eric Miller, Cofounder and President of Zepheira
In his opening address, Miller explained that the idea of the semantic web started in 1989, with a memo written by Sir Tim Berners-Lee to his boss at CERN entitled “Information Management: A Proposal.” In effect, Berners-Lee was trying to describe the web before we understood it, using the web as an information manager. Most people think of the web as having a huge technological impact, but its social impact is far greater. The web is, according to Miller, the most successful commerce and communication platform ever conceived.
Currently, most of the web is in written pages and links, designed for direct human consumption. Linking is limited, and data is hidden. To create the semantic web, we need to move to contextual architecture in a platform that allows graceful evolution into this new paradigm. If we share data by using the semantic web, by sharing connecting data and creating more linked data, our information will have more exposure; data can be left where it resides and connections made using the web as architecture.
RDF (Resource Description Framework) is a common model for representing web data. “Us + identifiers + RDF” creates the semantic web. We create context around a bit of data using identifiers, and that data can be used over and over, in different ways; it is “wrapping and exposing” data to reassemble and combine it to meet individual user needs. The slogan of the semantic web is, “I want your data, my way.” Right now, data resides in disconnected silos. The semantic web will allow users to pull data from many different sources and combine it in new ways, eliminating silos.
Of course, the human element is the important part in assigning identifiers to data. Humans see and understand value where computers simply cannot. The communities of web users value trust in sources and persistence in identifiers. Libraries are one of the most trusted sources. Thus, we are in a good position in terms of viability.
The Library of Congress is currently working on a new bibliographic framework. There has been an increase in funding for this kind of work, as demonstrated by the European Union investing in the Open Data Strategy for Europe (http://europa.eu/rapid/pressReleasesAction.do?reference=IP/11/1524). There is also an increased understanding of the concept of “Return On Investment” (ROI). An example is the retailer Best Buy’s web site: by exposing its data in a way that consumers can combine it for their own uses, it has seen a 30 percent increase in searches (although not necessarily sales). The library community must seriously consider ROI.
The semantic web extends web architecture to express content; it allows for expression of different points of view. More communities are “surfacing” data, and the trust of connections will be increasingly important. We must think beyond the record to the underlying data and services we manage, take advantage of our cooperative nature, and extend this to our patrons. We must identify sets of data to expose as linked data, foster a discussion about licenses, identify and expose trusted identifiers, develop policies, and share infrastructure. We must document our best practices and guidelines, and continue to clearly identify the benefits to our administrations and patrons.
Thinking Beyond Our Collections: Making Our Models Linked and Linkable
Ross Singer, Interoperability and Open Standard Champion at Talis
Tim Berners-Lee posited four rules for the semantic web:
- Use URIs (Uniform Resource Identifiers, e.g., ISBNs, ISSN, LCCN, etc.) as names
- Use HTTP (HyperText Transfer Protocol) URIs so people can look them up
- Provide useful information using standards
- Include links to other URIs
RDF is built by statements called triples; triples are constructed as “subject-predicate-object.” An aggregation of triples is a “graph.” Tools for creating triplets include Dublin Core (http://dublincore.org/), Friend of a Friend (FOAF) (http://www.foaf-project.org/), Bibliontology (BIBO) (http://www.bibliontology.com/), SKOS (Simple Knowledge Organization System) (http://www.w3.org/2004/02/skos/), Creative Commons (http://creativecommons.org/), Music Ontology (http://musicontology.com/), and more. RDF is unambiguous and decentralized: there is no notion of static “record data;” data can be distributed everywhere. Using RDF, there will be no provenance of data, only triples. We must know what we are describing, who is the intended audience, and anticipate how the information will be consumed.
Singer presented two case studies of data models: IFLA FRBR and SKOS concepts. It is unclear which model will gain mainstream acceptance. Singer ended with examples of datasets to consider modeling around DBpedia (http://dbpedia.org), Geonames (geonames.org), Musicbrainz (musicbrainz.org), Open Library (http://openlibrary.org/), Bibliontology, and schema.org.
Singer concluded that linked data gives us potential to integrate into the larger web, but we must not insist on incompatible models.
Are We There Yet?
Karen Coyle, Consultant
Coyle addressed the issue of fear and panic in the library world in the advent of moving into a new data format. How do we get from here to there? The good news: “It’s doable.” Libraries have already been through the transition from card catalogs to MARC format. But MARC was a mark-up of text deriving from catalog cards rather than a new data format altogether. We are now faced with FRBR (Functional Requirements for Bibliographic Records) as a new data format, and we will have to think differently. Coyle listed the “6 stars” for the new format:
- Data, not text
- Identifiers for things
- Statements, not records
- Machine-readable schema
- Machine-readable lists
- Open access on the web
Data is for machines to read; text is for humans to read. Data is much easier to input, which saves time and is more consistent. In a MARC record, the fixed fields are data, and it is important to fill these in. (Some catalogers routinely skip them.) The body of the bibliographic record is for humans to read; it is not important to machine processing. Nontext identifiers are essential. A human can distinguish between dog (canine) and dog (hotdog), but machines cannot. Identifiers cannot be language-based. In a statement, every bit of information is complete. It is strong. Records are fragile; remove any information and it is useless. There is much descriptive data that exists for people, places, events, topics, resources, physical formats, and extents on the web, none of which is unique to library material. Librarians do not need to invent this data. (Libraries could use bibliographic data already in Amazon, for example.) We will need controlled vocabularies in the semantic web and librarians are experts at this. Data on the semantic web will be open, but you will be able to choose what content to make available to the public. You will be able to control it and assure privacy. (Banks do this now.)
Are we there yet? Not really, but we are at a tipping point. As Library of Congress creates the new bibliographic format, we will be closer. Changes will be implemented gradually. MARC was brilliant; it carried the previous data over from cards to computers. With the semantic web, we cannot carry over old formats; we must think differently. We must come up with new standards, such as RDA, and RDA is only a step, not an endpoint. This is a moment of opportunity. Let’s grab it.
Of Cataloging and Context: Metadata and Metadata Experts on the Linked Data Web
Corey Harper, Metadata Services Librarian at New York University Library
Harper quoted Emanuelle Bermes: “Don’t ‘just’ publish data, try to think about actual uses, for end users.” This should be our goal as we work towards linked data on the semantic web. Harper gave examples of what can be accomplished using linked data, such as
- Thinkbase (http://Thinkbase.cs.auckland.ac.nz), which has a user interface that allows users to manage Freebase data (http://www.freebase.com/)
- Europeana (http://thedatahub.org/dataset/europeana-lod), which currently contains metadata on 3.5 million texts, images, videos and sounds, contributed from providers encompassing around 300 cultural institutions from 17 countries
- the Linked Open CoPAC and Archives Hub (LOCAH) (http://blogs.ukoln.ac.uk/locah/), which is a project working to make data from Copac (http://copac.ac.uk/) and the Archives Hub (http://archiveshub.ac.uk/) available as linked data
- the Digital Public Library of America (DPLA) (http://cyber.law.harvard.edu/research/dpla)
Linked Heritage (http://www.linkedheritage.eu/) allows small museum data to be input into Europeana, where you can build online exhibits using Viewshare (http://viewshare.org/). These projects are examples of a distributed information ecosystem creating navigable, browsable information landscapes, which in turn build relationships between data, weaving context, and enriching the user’s experience.
Harper imagines new roles for catalogers in data management, analysis, and mapping. Catalogers and coders should be working together. This issue will be discussed at several conferences this year, including Code4Lib and ALA. Librarians may want to look into “Code Year” from Codeacademy (www.codeacademy.com). It is a free course designed to teach lay people (including librarians) how to become familiar with code. You can also follow #catcode on Twitter.
Harper gave Jobs4Lib (https://vimeo.com/32848765) as an example of a successful implementation of semantic principles: one can take information from job listing web sites from all over and combine data as needed.
Breaking the Catalog: Navigating Books on Shelves
Peter Brantley, Director of the BookServer Project at the Internet Archive
Brantley, who is struggling with linked data and is ambivalent about it, gave a more skeptical perspective. He questioned Miller’s example (in the first presentation) about Best Buy using linked data on its web site successfully as a business model, as the company is currently on the brink of bankruptcy.
Brantley noted that bibliographic data traditionally underperforms, leading to poor discovery. He gave an example of searching the Barnes & Noble web site for the word “Lincoln.” The search pulled up nothing that he wanted to find, which was the recently published book, Killing Lincoln. One of linked data’s challenges is contributing to discovery, which Brantley described as “metadata contextualized by human desire.” An “open culture” search (like the one on the Barnes & Noble database) is ignorant of the context of user desires. In contrast, when Brantley searched Amazon for the word “Lincoln,” he got what he wanted, Killing Lincoln, as the top link. Amazon increases relevancy by incorporating recent retrievals. Brantley noted that no one thinks linked data is a panacea, but that it can be useful in some contexts.
Linked open data domains assume unbounded sharing. But the issue of rights is complex and unresolved; data may be restricted “downstream.” Brantley stated that for linked data to work well, “we need to aggregate and hold data on a single network platform to the greatest possible extent because that will drive use and obtain intentionality information.” Brantley would like to see a common open platform for linked data be developed, and feels that the most powerful opportunity for linked data might be in building central repositories.
The speakers covered several issues.
Q: Will the same identifiers be used for everything?
A: No. Independent communities would use their own identifiers, but identifiers from different communities can relate to each other. After all, we do not have only one identifier for data now; we use ISBNs, ISSNs, etc.
Q: Can RDF triples go in one or two directions?
A: RDF triples can only go in one direction.
Q: How strictly should we adhere to FRBR?
A: The FRBR structure is completely valid, but should be in the background, not foreground, of description; one should not strictly adhere to it to the point where it does not make sense to the users.
Q: Do we have best practices, or are we experimenting?
A: The short answer is that we do not. We need to develop them. Our modeling will reflect the assets we care most about, but in the larger community, our description must maximize the chance for the aggregation of our data. We must document patterns and reach out to other communities for advice. We will be outside the library comfort zone; we cannot know what it will look like. If our assumptions are wrong, they will be very hard to iterate.
Q: Why can’t we simply use MARC code? MARC XML?
A: MARC worked well for forty-plus years, but the transport syntax and interchange standards we face now are new. MARC format should not be confused with an underlying data model. We could use MARC to create linked data, but the problem is that the data will not make sense from a machine’s point of view. There is not a huge gap; true MARC is abstracted away from us; what we work with is mostly a text representation of MARC.
Q: Can these new ideas fit into the current workflow of already overtaxed librarians in a form we can integrate and use? The models are confusing.
A: This is only the beginning; we are not there yet. No one knows what the RDA/cataloging interface will look like.
Q: How is it practical to implement these changes in hard economic times?
A: Crossing communities is driving down costs; codifying to the cloud drives down costs.
Q: What does “open” mean? And what about licensing?
A: Data licensing is in its infancy and the issue is complex. Issues are still not clear in the courts. Linked data does not require openness; some data may be released as public, some kept securely private. All-or-nothing protection is a “record construct.”
To close the symposium, each panelist responded to the question, “What is most important for the library community to do in the next six months?”
Coyle: Make sure the bibliographic format is open.
Brantley: Not sure. The conversation should be broader. Libraries will lose a lot of control over their metadata.
Singer: What, how, and what we should not all be describing. How we relate to our communities. Do we all catalog the same way? Push redundant work to the aggregate level.
Harper: Engage special collections and museums, communities we do not usually engage. Identify gaps and needs.
Miller: Think holistically across libraries, archives, and museums. Identify sets of materials for early exposure. Show the value of linked data, its values and benefits to patrons.