Out of the Secret Garden: The RDA/DC Initiative

By Karen G. Schneider |

(If you're at ALA Annual Conference while you're reading this, the RDA Update Forum is Saturday, June 23, 4:00-5:30 at WCC 206.)

"Libraries have lost their place as primary information providers, surpassed by more agile (and in many cases wealthier) purveyors of digital information delivery services. Although libraries still manage materials that are not available elsewhere, the library's approach to user service and the user interface is not competing successfully against services like Amazon or Google."

-- Karen Coyle and Diane Hillman, "Resource Description and Access: Cataloging Rules for the 20th Century"

You may not think you care about AACR2 (Anglo-American Cataloging Rules) or its successor, RDA (Resource Description and Access). That may seem like boring old-school stuff, not nearly as fun or glitzy as romping in Second Life or, as I am wont to do, posting the details of your afternoon snack on Twitter.

But the next time you complain about the limitations of library data—the gazillions of records we have created about the physical items in our libraries—and wonder why none of the cool new applications leverage the millions of library records shared worldwide, or why your expensive catalog can't integrate with a nifty new social software tool, or you wonder why there's no Google mashup to connect readers and books, consider this: to a large extent, it's because our data suck.

Not only that, it's our fault our data suck. Fixing this problem is not simply a matter of pointing at library vendors and saying, "Do better!" In many cases, vendors aren't doing too badly, considering what they have to deal with: our funky, inexplicable, old-fashioned, library-specific data that are the product of our cataloging rules.

We have built a mighty empire filled with standards and rules such as AACR2 and MARC that, long before the rest of the world was online, allowed us to do some amazing things within and across our institutions. If you've ever watched an interlibrary loan librarian buzz through hundreds of libraries with the flick of a wrist, hunting down a book for a patron, you have some idea of how important and powerful our standards have been for us.

Given the potency of library data, it's not surprising that there are many communities online that have expressed interest in our enormous data sets. But we can't share our data with them (let alone explain it to ourselves half the time), because our library data are plagued by an aging, conflicted, poorly elucidated witch's cauldron of practices that are written down on paper but are not embedded within the structure of our data.

Double, double, toil and trouble

Another day in Tech Services Because of this, cataloging is not so much a science as a dark art, driven by informal, implicit understandings rather than clear schema and vocabularies. MARC, despite its name, is only nominally machine-readable, and is not easily usable within the context of modern programming languages. People outside of library software programming have never seen anything like it. It's not all that human-readable, either, as this 045 field demonstrates:

045 2#$bd186405 $bd186408 

Did you catch that this means May – August, 1864?

Even worse, as Karen Coyle and Diane Hillman warned us earlier this year in an article with the sotto voce humorous subtitle, "Cataloging Rules for the Twentieth Century," RDA, rather than pushing us to cataloging rules compatible with 21st century requirements, repeats many of the anachronisms found in earlier editions of AACR.

The most profound limitations with RDA to date have to do with its lack of compatibility with machine-manipulable data--that is, data that can be read, and processed by, computers. RDA may be ponderous—the latest draft proposes 14 chapters and 4 appendices, with a couple of chapters weighing in at over 120 pages--but like the giant reptiles who died millions of years ago, it does not make up for its girth with intelligence.

Squeeze this onto the Semantic Web!Coyle and Hillman cite the mixed language for "number of units," pointing out that phrases such as "12 posters" are not easily machine-readable, and that many of the rules are still based on the "linear, card-based model" that, incredibly, continues to be the foundation of modern cataloging. One of the most telling anachronisms in RDA is its continuation of notions such as "primary" and "secondary," which as Coyle and Hillman point out, are concepts designed for effective use of space on a 3 x 5 card. What possible relevance do "primary" and "secondary" have in the online world, where all access points are created equal?

In other words, RDA keeps library data in a walled garden, barely manipulable by our own complex tools and unusable outside the library community.

Though we can commend our profession for being out there early in the world of online sharing—MARC, in its heyday, was an amazing invention—we have to admit that developers worldwide are not flocking to our obscure, poorly articulated standards. Talk to libraries struggling to implement any "cool tool" from outside the library universe, from Endeca and FAST and Siderean to just something as simple as describing a library record with a URI, and we're the odd ones out, trying to fit our square pegs into the world's round holes.

Meanwhile, as software engineers worldwide build applications that acknowledge the typical web user's discovery workflow--which begins with a search engine--we in LibraryLand need to plead, lure, and "educate" people to cross the moat and go through the thick doors of our proprietary library databases--never mind enabling ourselves and others to do powerful and interesting things with our data.

Tunneling out of the walled garden

But on May 1, 2007, the moon was in the seventh house and Jupiter aligned with Mars. At least, that's how it appeared to catalogers and other metadata mavens when they learned that the Dublin Core and RDA communities had agreed to pull library data out of its silo and into the Semantic Web.

That's not exactly how the agreement was described at the meeting, but before I start unfolding the catabiblish (that is, librarian language specific to catalogers), some background information is in order.

The concept of the Semantic Web should come naturally to librarians. Wikipedia (so help me) says that "the semantic web is an evolving extension of the World Wide Web in which web content can be expressed not only in natural language, but also in a form that can be read and used by software agents, thus permitting them to find, share and integrate information more easily."

Harry Potter ponders RDAWeb pages are designed to be read by people, not machines. Imagine a child seeking a copy of Harry Potter and the Deathly Hallows at her local library. Let's pretend our library data were unambiguous, explicit, and truly machine-readable. The information about that Potter book, rather than being hidden behind the walls of outdated library lingo, could be read by computers and presented on the screen. In other words, a child looking for a book would be able to search the Web and find that book within the larger context of "I am searching the Web for things that interest me," rather than interrupting her workflow, exiting the Web proper, and entering searches into the library-specific databases we call OPACs.

You may be wondering why the Semantic Web is necessary. Why not just export our catalog to the Web, or make a Web page for every record? But this is where we librarians know something about the universe worth sharing with others. Simply exporting our data to the Web is to turn our back on the very important work catalogers have contributed to librarianship (and really, the world) by thinking structurally about data in the first place. Where the world sees primordial soup, we see well-chiseled points of description. It's not that important that we thought up the "title proper"; it's really significant that we know why it's important to have that data in fields in the first place.

We know order matters. Re-expressing our data so they can be read by the Semantic Web is an avenue for retaining that which is good about our view of data--that metadata and structure are useful and meaningful, enrich the discovery process, and (theoretically) allow us to play well with others--while leaving behind the weak, antiquated, solipsistic characteristics of our encoding practices.

It could well be that positioning RDA so it is compatible by 21st century standards doesn't just make our data more explicable and usable; it could be what saves us as a profession, by clarifying to the world that we contribute a body of thought to information science that truly matters.

Free Harry Potter

How do we get Harry Potter out of the garden and onto the Web? The two communities (RDA and Dublin Core) have agreed to work together to accomplish the following:

Make our data structure explicit and machine-readable. The RDA people call this "developing an RDA Element Vocabulary." Think of it as "putting our data structure in standard, consistent recipes that computers know how to cook." Right now, even when our standards are in writing, they are not easily used by computers. If you've ever worked in a library run by unwritten rules that were hard to interpret by new staff, you know the problem with not having explicit data structures.

Cataloging is a demanding skill, but we make it even harder than it should be by not being fully explicit about our data sets. Try to find a URL leading to an explicit definition of "title proper." It's all buried in the heads of catalogers, who, brilliant mavens that they are, need to follow the advice of human-computer expert Donald Norman and put their information "in the world"--and not just for human consumption, but also so our data can be understood more broadly, within the framework of the Semantic Web.

Clarify our terms. The catabiblish for this is "expose RDA value vocabularies." We have a lot of very specific and yet undefined language in our cataloging framework.  Rather than explaining our language explicitly, we share this knowledge through education and practice, creating impossibly high hurdles for people outside our profession (or for any non-cataloging librarian) to fully understand our terms.

For example, Chapter 3 of the current draft of RDA lists "carriers" such as computer chip cartridge, microfilm slip, and stereograph wheel. But these terms aren't explained or defined, only listed; they aren't implicitly clear. People have to be taught to implement these terms properly--the sign of a system that isn't explicit. (It doesn't make us "smart" that so much of our knowledge is implicit and is not formally explicated.) We need to explain what we mean by these and other terms so that others (including the next generation of catalogers—and the next generation of software) can understand them.

Describe what we're trying to do. That's done through developing an Application Profile, or AP, which serves as a kind of letter of instruction for conveying intention, building documentation, and enabling interoperability. An AP declares "which metadata terms an organization, information resource, application, or user community uses in its metadata." The AP doesn't tell others how to use our data elements; it just makes them reusable, ensuring that when we go to exchange data we understand the basics behind each other's records.

I'm not going to go in depth about how the AP should be based on FRBR (Functional Requirements for Bibliographic Records) and FRAD (Functional Requirements for Authority Data), because if you're a cataloger you probably already "get it," and if you aren't, your head will explode. But part of the reason we're moving from AACR2 in the first place is that our rules and practices stand in the way of doing some things that have become important since the late 1940s, such as make it easy for OPAC displays to group like items, so that a book will appear next to its CD, DVD, large print, and online versions.

Not everyone is wild about Harry

For those of us not acquainted with the cataloging world, moving RDA to a Semantic Web model doesn't sound threatening. Isn't this an improvement? Don't we want to play well with others? But the idea of change stirs fear in some hearts (some of them fairly highly placed in the ALA hierarchy, by the way), and explains why the May 1 RDA/DC agreement was historic.

One rumor is that the plan is to dumb down library data and put catalogers out of work. The Dublin Core Metadata Initiative (DCMI) is partly to blame for this misconception; people are more familiar with the famous "15 elements" used for Simple Dublin Core, and that has raised fears that the ulterior motive is to move us to a simple cataloging model based around this limited element set. 

But the reality is that Dublin Core can support very robust schema, and Dublin Core is in many ways incidental to this discussion anyway. It's simply a building-block model for getting our cataloging language modernized, structured, explicit, and usable by others. The significance of the RDA/DC agreement is that the Dublin Core Metadata Initiative has been very involved in attempting to think through what interoperability really means, including but going past the Semantic Web. It's simple but powerful: at sum, it pretty much boils down to formal expression to limit the ambiguity of language, and URIs for identification.

Hug a cataloger today

The key here is to understand that the RDA/DC agreement, if it leads to the actions above--and people such as Diane Hillman and Gordon Dunsire are working top-speed on this initiative--will ultimately make it possible to get over that moat and get our data out onto the Web in new, interesting, findable, and user-friendly ways, without abandoning our classic commitment to enrichment of information--and in fact, by demonstrating proof of concept why we are committed to these practices in the first place. Whether we succeed or fail in this effort may well determine the future of our profession.