Metadata, Schema.Org, and Getting Your Digital Collection Noticed

By Patrick Hogan | Editors Note: This post is an excerpt from Improving the Visibility and Use of Digital Repositories Through SEO, by Kenning Arlitsch and Patrick S. OBrien. The authors, along with Montana State colleagues Jason Clark and Scott Young, will be teaching the online course/workshop Search Engine Optimization (SEO) for Libraries, which starts July 17.

Metadata schemas are powerful frameworks for organizing content, and libraries have long used them to describe their holdings (think MARC). Numerous schemas exist for academic disciplines: CDWA is used for art, Darwin Core for biology, EML for ecology, DDI for social sciences, and so on. Dublin Core is probably the most heavily used schema in digital libraries, and it is perfectly adequate for many applications, but the problem with any metadata schema is that most website developers don’t use any at all, and search engines can’t count on the metadata being applied consistently in those that do. The result is that general-purpose search engines like Google tend not to use the metadata even where it is applied appropriately.

Some specialty engines, like Google Scholar, do make extensive use of metadata. Google Scholar, however, wants metadata schemas that can express bibliographic citations specifically and accurately, which Dublin Core does not do very well.

Because search engines crawl the web pages that are generated from databases (rather than crawling the databases themselves), your carefully applied metadata inside the database will not even be seen by search engines unless you write scripts to display the metadata tags and their values in HTML meta tags. It is crucial to understand that any metadata offered to search engines must be recognizable as part of a schema and must be machine-readable, which is to say that the search engine must be able to parse the metadata accurately. For example, if you enter a bibliographic citation into a single metadata field, the search engine probably won’t know how to distinguish the article title from the journal title, or the volume from the issue number. In order for the search engine to read those citations effectively each part of the citation must have its own field. Making sure metadata is machine-readable requires patterns and consistency, which will also prepare it for transformation to other schema. This is far more important than picking any single metadata schema.

Introducing Schema.org

We invest a great deal of time and money creating digital collections, and we usually create web pages that describe the collection’s purpose, what it contains, its contributors, and so on, to give visitors some context they can use to understand the collection. We also take great pains in creating metadata that describe each object in the collection to give it meaning and allow users to reference or discuss the item. While humans can understand and associate the concepts they read, search engines have a very limited capacity for interpreting the meaning of the information we so painstakingly provide.

To help search engines understand the context and meaning of our digital objects we must provide structure to our content using additional tags in our HTML. These tags will say to search engines directly, for example, “this information describes a specific digital object as a scholarly paper, written by an author who works at an academic institution, published by an organization on a certain date.” Sounds easy enough, but communicating with a machine requires an up-front agreement on the specific language and precise vocabulary being used to communicate. The word “bloody” has very different meanings to a person raised in the United States and a person raised in the United Kingdom. Search engines do not understand the regional variations, sarcasm, humor, hand gestures, facial expressions, body language, tone of voice, inflection, and so on that humans rely on heavily to communicate meaning.

Enter schema.org. In 2011 Google, Bing, Yandex (the largest Russian search engine), and Yahoo! “joined forces to create a common set of schemas for structureddata markup on web pages” with the aim of helping search engines to better understand websites. Originally, schema.org was planned to use only HTML microdata as the mechanism, or language for implementing schema.org structured data vocabularies. But it has also recently added support for RDFa as an alternative “language” that developers using “RDFbased tools and Linked Data” can use to implement the schema.org vocabulary.

We think it’s important for repository managers (and especially catalogers) to be aware of these developments because they hold great promise for fulfilling the potential of the semantic web. Sites that already offer microdata provide a great benefit to Google’s users through its “rich snippets,” which display additional details about web pages in the search results. Another example of Google’s use of microdata appears in its “recipe search,” where metadata about recipes provide a faceted navigational search. If Google can do this for recipes, imagine what it could do for library digital repositories that already have rich metadata describing the objects. The bridge that will get that rich metadata to be understood by search engines is the techniques recommended by schema.org, and putting those techniques into place in digital repositories is the responsibility of librarians and archivists.