A Report from the ALCTS/CCS/SAC/Subcommittee on Metadata and Subject Analysis
July 1999
CONTENTS
1 Summary of Recommendations
1.1 Vocabulary, Semantics, and Syntax
1.2 Application
1.3 Systems Design
2 Introduction
2.1 Audience
2.2 Premises and Characteristics of Subject Data in the Metadata Record
PART I: VOCABULARY, SEMANTICS, AND SYNTAX
3 Subject Vocabulary
3.1 Keyword vs. Controlled Vocabulary
3.2 Controlled vocabulary
3.2.1 Sources of Controlled Vocabulary
3.2.2 Semantics
3.2.2.1 Specificity of Terminology
3.2.2.2 Synonym and Homograph Control
3.2.2.3 Term Relationships
3.2.3 Syntax
3.2.4 Metathesaurus and Harmonization of Vocabularies
4 Classification Data
4.1 Inclusion of Classification Data in Metadata Records
4.2 Choice of Classification Schemes
PART II: APPLICATION
5 Implementation Issues
5.1 Verbal Representation
5.1.1 Choice of Vocabulary
5.1.1.1 General Controlled Vocabularies
5.1.1.2 Specialized or Subject-Specific Vocabularies
5.1.2 Specificity and Depth of Analysis
5.1.3 Consistency
5.1.4 Placement of Non-Topical Data
5.2 Classification Data
5.2.1 Choice of Classification Schemes
5.2.2 Depth and Breadth of Scheme
5.2.3 Notation
PART III: SYSTEMS DESIGN
6 Online Systems Features
Glossary
References
Subcommittee Members
Diane Dates Casey, Chair; Julianne Beall; Frank Cervone; Lois Mai Chan; Eric Childress; Becky Culbertson; Bonnie Dede; Lynn El-Hoshy; Aimee Glassel; Jane Greenberg; Karen Greever; Shelby E. Harken ; Shannon L. Hoffman; Pat Kuhr; Sara Shatford Layne; Katha Massey; Rebecca Mugridge; Gregory New; Robert E. Pillow; Sandra K. Roe; Ann M. Sandberg-Fox; Bruce Trumble; Marie Whited; Gregory Wool; Liaison from CC:DA/MARBI Metadata Task Force: Sherman Clarke
1 Summary of Recommendations
For subject data in the metadata record, the ALCTS/CCS/SAC/Subcommittee on Metadata and Subject Analysis recommends the following:
1.1 Vocabulary, Semantics, and Syntax (sections 3 and 4)
- A combination of keywords and controlled vocabulary should be used to allow users the choice of simple free-text indexing as well as complex controlled vocabulary indexing. (3.1)
- Use of multiple vocabularies should be accommodated. For a general vocabulary covering all subjects, the Subcommittee recommends the use of LCSH or Sears with or without modification. (3.2.1)
- In order to achieve the desired level of specificity, controlled vocabulary terms assigned to the metadata record could be supplemented and complemented by keywords and other subject-related elements, such as title, abstract, statement of content, etc. (3.2.2.1)
- Synonyms should be handled by system design implementation of the controlled vocabulary or thesaurus. If this is not available, an alternative is to include all identified synonyms and related terms, along with the keywords, in the metadata record. (3.2.2.2)
- Tools such as online thesaurus display should be developed to provide access to controlled vocabulary structures, showing both hierarchically (broader and narrower) and horizontally related terms. (3.2.2.3)
- The metadata record, and the subject element in particular, should be as simple or as complex as desired. Trained catalogers may choose to continue to apply LCSH to the metadata records in the same manner as those assigned to MARC records. For those not trained in subject cataloging, the Subcommittee recommends a simplified syntax. (3.2.3)
- The development and refinement of methods for harmonization of subject terms from different controlled vocabularies should be undertaken, and investigation of the feasibility of developing a general metathesaurus or expanding the medical metathesaurus to include indexing terms covering all subject areas should be encouraged. (3.2.4)
- Classification data should be included in the metadata record by those who have the expertise to do so. For those not trained in the use of classification, further development and improvement of mechanisms for automatic assignment of classification data from different schemes and sources should be encouraged. (4.1)
- The use of as many existing classification schemes (DDC, LCC, NLM, etc.) as useful and feasible even within a particular implementation should be allowed. Multiple class numbers should be allowed in the same record to bring out different topics and aspects treated provided that they are properly designated and coded. (4.2)
1.2 Application (section 5)
- In the Dublin Core metadata record, the Subcommittee recommends the inclusion in the SUBJECT element of both free-text and controlled terms, where appropriate and feasible, in order to achieve optimal recall and precision in retrieval. (5.1)
- For the sake of semantic interoperability, the Subcommittee recommends adopting an existing vocabulary or vocabularies with or without modification. (5.1.1)
- The adoption or adaptation of Library of Congress Subject Headings or Sears List of Subject Headings (for subject representation on a broader level) as the basis for subject data in the Dublin Core metadata records for a general collection is recommended. (5.1.1.1)
- Criteria for choosing specialized vocabularies should be based on subject matter, the intended audience, term specificity, and syntax. (5.1.1.2)
- Each implementing agency should establish policies regarding the appropriate level of subject representation for its collection. At the appropriate level, the most specific subject terms provided by the chosen controlled vocabulary should be assigned. (5.1.2)
- Within a specified digital collection or project the application of subject analysis should be consistent; in other words, the same semantics and syntax should be applied throughout. Compatibility with other metadata schemes is also desirable. When a controlled vocabulary is used, the version of the vocabulary should be indicated along with the date on which the subject data are created. (5.1.3)
- With regard to syntax, the use of full LCSH subject strings, if feasible (i.e., if time and trained personnel are available), particularly in the OPAC environment, should be encouraged. For the Dublin Core, the Subcommittee endorses the use of other elements (type, coverage) in addition to the SUBJECT element to accommodate different facets related to subject: topic, place, period, language, etc. Deconstructed subject strings should be so designated. (5.1.4)
- For classification data, the Subcommittee recommends adopting an existing scheme with or without modification. Criteria for choosing classification schemes should be based on subject domain, the nature and scope of the collection being described, and the user community being served. (5.2.1)
- Classification data at the most exhaustive or specific level should be encouraged. (5.2.2)
- Classification notation should be included. However, item (non-topical Cutter) numbers are not necessary because classification data are not used as a shelving device in this context. Multiple classification numbers should be allowed -- either of various classification types or multiple numbers within the same scheme. In the metadata record, captions (i.e., the text accompanying the class numbers) need not be included. If desired, captions could be built in through systems design. (5.2.3)
1.3 Systems Design (section 6)
The development and refinement of the following online system features are highly recommended:
- automatic keyword indexing based on word occurrences in the full-text resources, using natural language processing methods;
- automatic generation of classification data based on the resource itself;
- automatic extraction of subject and classification data from records for similar items;
- availability of online access to controlled vocabularies and classification schemes for creators of metadata records;
- automatic mapping from user input free-text terms to controlled vocabularies and classification data; and,
- availability of online tools and assistance, designed particularly for non-catalogers, to derive appropriate subject terms and/or class numbers.
2 Introduction
The ALCTS/SAC/Subcommittee on Metadata and Subject Analysis was established in 1997 with the following charge: “Identify and study the major issues surrounding the use of metadata in the subject analysis and classification of digital resources. Provide discussion forums and programs relevant to these issues.”
The following report addresses the issues relating to subject data in metadata schemes used for the description and representation of electronic resources, with particular focus on the Dublin Core. It addresses both broad principles and specific implementation as well as systems design issues. The report is presented in three parts. Part I Vocabulary covers general issues relating to the nature of subject vocabulary, semantics and syntax of controlled vocabularies, and scheme-based classification data. Part II outlines procedural issues relating to the implementation and application of controlled vocabularies and classification data in the metadata record, with specific focus on the Dublin Core. This part contains specific references to the choice of vocabulary and scheme, specificity and depth of indexing, and consistency. Part III covers some of the issues relating to systems design for facilitating subject data assignment. A glossary is included at the end of the report.
2.1 Audience
This report is aimed at a broad spectrum of metadata scheme users, who come from different environments, ranging from library and information professionals who are experts in sophisticated and intricate methods of providing bibliographic data to those who are not trained in such methods. For example, in order to accommodate a wide range of users, the intention of the Dublin Core was restated at the second Metadata Workshop held in 1996:
“The Dublin Core is intended to be usable by non-catalogers as well as by those with experience with formal resource description models.” (The Dublin Core: A Simple Content Description Model for Electronic Resources: Metadata for Electronic Resources 1999)
”The Dublin Core… is intended to be sufficiently rich to support useful fielded retrieval but simple enough not to require specialist expertise or extensive manual effort to create.” (Dampsey and Weibel 1996)
In its deliberations, therefore, the subcommittee took into consideration user groups with different objectives for using metadata and a wide range of application expertise.
2.2 Premises and Characteristics of Subject Data in the Metadata Record
Ideally, metadata schemes should be semantically interoperable across diverse disciplines, objectives, and language boundaries. They should accommodate the multilingual user community and should be flexible and adaptable to simple as well as elaborate description. Consequently, as articulated in the document outlining the characteristics of Dublin Core, a metadata scheme should: (1) be extensible to accommodate a wide range of implementations; (2) be flexible and be able to transcend language and disciplinary boundaries; (3) reflect international consensus; and, (4) be simple to implement and to use. (The Dublin Core: A Simple Content Description Model for Electronic Resources: Metadata for Electronic Resources 1999)
The subject data in the metadata record should be adaptable and flexible to accommodate simple schemes such as the Dublin Core as well as elaborate ones such as AACR2R/MARC. Simplicity requires that the methods and tools be easy to use and to comprehend, particularly by those not trained in subject indexing and classification. Flexibility requires that the methods and tools and procedures are scalable and extensible in order to be viable in environments ranging from the simplest application to the most complex and detailed implementation. Semantic interoperability mandates that the vocabularies used in different communities should lend themselves to harmonization, in order to enable users to search across a wide range of metadata schemes. International consensus would require that, ideally, the methods and tools chosen should be capable of cross linking and mapping.
Based on these principles, the functional requirements of subject data in the metadata record may be stated in the following terms. Schemes for supplying subject data should, to the fullest possible extent:
- be simple and easy to apply and to comprehend;
- be intuitive so that sophisticated training in subject indexing and classification, while highly desirable, is not required in order to implement them;
- be scalable for implementation from the simplest to the most sophisticated;
- be logical so that it requires the least effort to understand and to implement; and,
- be appropriate to the specific discipline and subject, and to the domain of implementation such as libraries, museums, archives, information services, the scientific community, and personal knowledge management.
PART I: VOCABULARY, SEMANTICS, AND SYNTAX
3 Subject Vocabulary
This section of the report focuses on subject data with regard to the following aspects: keyword vs. controlled vocabulary, sources of vocabulary, semantics, syntax, and scheme-based classification data. The Subcommittee, based on its deliberations on the issues involved with each aspect, presents the following recommendations.
3.1 Keyword vs. Controlled Vocabulary
Issue statement:
Verbal representation can be free-text or controlled vocabulary. Controlled vocabulary has served the information community long and well. However, it is also more expensive and requires expertise to implement. Basically, there are three options:
(1) Keywords (free-text) only
(2) Keywords and Controlled vocabulary
(3) Controlled vocabulary only
While the third option is theoretically available, it is hardly practical, particularly in the web environment.
Recommendation:
The Subcommittee considers the use of a mixture of keywords and controlled vocabulary to be the most viable approach. It would allow users the choice of simple free-text indexing as well as complex controlled vocabulary indexing. For example, the Library of Congress, the National Library of Medicine, and ERIC all use both controlled and uncontrolled vocabularies in their records.
3.2 Controlled Vocabulary
The use of controlled vocabulary, in particular, involves a complex set of issues. These are outlined below:
3.2.1 Sources of Controlled Vocabulary
Issue statement:
The use of controlled vocabulary implies a structured scheme of indexing terms. There are three options for the source of the scheme: (1) use existing scheme(s); (2) adapt or modify existing schemes; or (3) develop new scheme(s). Another related question is whether multiple schemes within a particular implementation should be encouraged. If so, how can terms from different vocabularies be harmonized to improve recall with minimal adverse effect on the precision offered by the use of a particular vocabulary?
Recommendation:
Use of multiple vocabularies should be accommodated. In many cases, they complement and supplement each other, for example, the use of Library of Congress Subject Headings (LCSH) to cover all subjects and Medical Subject Headings (MeSH) for medical resources. The Subcommittee feels that it is important to be able to properly identify or code different schemes. Development of mechanisms for harmonization, such as crosswalk and automatic linking or mapping (Buckland and others 1999) should be encouraged. In cases where existing schemes are not found to be satisfactory or suitable, adaptation or modification may be considered. Developing new schemes for specific subject domains (for example, the simplified subject categories used by the California Digital Library for the many electronic journals) may be feasible. On the other hand, developing a new scheme covering all subject fields, while ideal, is probably not practical, and most certainly is not economical. For a general vocabulary covering all subjects, the Subcommittee recommends LCSH or Sears with or without modification. Details regarding modification will be discussed in part II below.
3.2.2 Semantics
Several aspects of the semantics of a controlled vocabulary warrant consideration: specificity of terms, synonym and homograph control, and term relationships.
3.2.2.1 Specificity of Terminology
Issue statement:
There are some very specific subjects on the Web. The question is, in a particular situation, whether the chosen vocabulary provides sufficiently detailed terms for their representation. For example, does LCSH, a scheme focusing on book literature, have enough vocabulary to cover the minute details of the web? What level of specificity is most desirable and suitable? What is the best balance between a simple vocabulary that is easy to use and a complex one that is more expressive of the subject content of the resources? It is recognized that a simple controlled vocabulary would be easier to translate to other languages and facilitate interoperability.
Recommendation:
In order to achieve the desired level of specificity, controlled vocabulary terms assigned to the metadata record could be supplemented and complemented by keywords and other subject-related elements, such as title, abstract, statement of content, etc. For example, the level of specificity of LCSH would be a good basis, but terms will need to be added. Thus, a combination of keywords and controlled vocabulary could provide the desired level of specificity.
3.2.2.2 Synonym and Homograph Control
Issue statement:
Without synonym and homograph control, e.g., the use of cross references and qualifiers, the power of a controlled vocabulary would be greatly diminished, and consequently recall and precision would suffer. While homographs are typically resolved by including qualifiers or contextual terms in the subject headings or descriptors in question, synonyms are typically not part of the controlled valid terms. Synonyms are normally accommodated in the OPAC by means of cross references. How should they be implemented in the web environment?
Recommendation:
Synonyms should be handled by system design implementation of the controlled vocabulary or thesaurus. Users need access to these terms through the controlled vocabulary structure. Automatic mapping from entry vocabularies to valid subject terms should be in place. To facilitate subject assignment, particularly by non-catalogers, access to synonymous terms and related terms through an online thesaurus is desirable. If this is not available, an alternative, not totally satisfactory but perhaps preferable to no synonym control at all, is to include all identified synonyms and related terms, along with the key terms, in the metadata record.
3.2.2.3 Term Relationships
Issue statement:
Term relationships have proven to be a useful tool in subject retrieval, particularly in broadening searches for improved recall and helping searchers identify and focus on the most appropriate terms for improved precision. While hierarchically structured subject directories provide a measure of assistance in identifying hierarchical relationships, they are typically one-dimensional and fail to display polyhierarchical relationships. Furthermore, there is also a need to show associative (lateral) relationships, particularly across different hierarchies. Term relationships in a controlled vocabulary provide a complementary method of navigation to the use of classification, perhaps even more useful to the end user. While terms related hierarchically or otherwise are included and displayed in most controlled vocabularies, display of term relationships has not been a prominent feature in online systems. In fact, most search engines have ignored this aspect. The question is whether such relationships remain useful in the electronic environment and to what extent they should be retained and implemented.
Recommendation:
The Subcommittee reinforces the importance of users’ access to term relationships and recommends its implementation in online systems and search engines. Tools such as online thesaurus display are needed to provide access to controlled vocabulary structures, showing both hierarchically (broader and narrower) and otherwise related terms.
3.2.3 Syntax
Issue statement:
Syntax of subject index terms ranges from single-concept descriptors to complex subject heading strings. In other words, terms may be precoordinated or postcoordinated. While single-concept descriptors are easy to assign, this approach sacrifices term relationships and context. Precoordinated subject heading strings place elements in a proper order so as to provide a context to ensure optimal precision in retrieval. When the subject string is broken up, the context and the relationship between elements in the string will be lost. Within a particular implementing environment, a mixture of the two approaches would invite inconsistency. Another consideration is that assignment of subject heading strings requires extensive training. Not all metadata creators, among them many web site developers, can be expected to be able to apply a controlled vocabulary with a complex syntax, such as the syntax used with LCSH in MARC records. Furthermore, a simplified syntax makes mapping and semantic interoperability much easier. The question is, then, which should be preferred?
Recommendation:
The metadata record, and the subject element in particular, should be as simple or as complex as desired. For example, librarians may choose to continue to apply LCSH to the metadata records in the same manner as those assigned to MARC records. For those not trained in subject cataloging, the Subcommittee recommends a simplified syntax. There are two aspects to simplicity - semantics (the words used to represent the subject) and syntax (how the words are put together). The syntax of a controlled vocabulary can be simplified without sacrificing the richness of the terms. For instance, non-topical elements in the LCSH strings can be broken out of the string. Even within the same system, elements in a subject string may be displayed separately or together. For example, the National Library of Medicine (NLM) recently deconstructed the MeSH strings; the 1999 MeSH headings no longer carry any physical form in the subject headings. For NLM's own Integrated Library System, geographic subject subdivisions are stored in the 651 field; and, also, form/genre subdivisions are placed in the 655 field and are no longer part of the subject string. The objective is to make searching in their catalog as close to the Medline Index as possible.
If, in the same environment, some records contain complex subject strings and some do not, it is recommended that they be designated as such, because they would require different approaches in authority control.
3.2.4 Metathesaurus and Harmonization of Vocabularies
Issue statement:
To allow the flexibility of using different controlled vocabularies, even within the same application, reconciling different terminology and syntax poses a particular challenge. Some progress has been made, for example, the metathesaurus of medical terms developed by the Unified Medical Language System (UMLS), the Omni File (H.W. Wilson Company's subject harmonization project), and Multi-lingual Access to Subject headings (MACS), a European project on multilingual access to subject authority files and data currently under investigation.
Recommendation:
The development and refinement of methods for harmonization of subject terms from different controlled vocabularies should be undertaken, and investigation of the feasibility of developing a general metathesaurus or expanding the medical metathesaurus to include indexing terms covering all subject areas should be encouraged.
4 Classification Data
Classification is being used for two different purposes in the web environment. The first is the use of classificatory framework to organize large collections of web resources and to serve as navigational and pathfinder tools, typically resulting in multi-level hierarchical subject categories or subject guides such as those devised by web access providers and many libraries. The ALCTS/CCS/SAC/Subcommittee on Metadata and Classification has completed its investigation and evaluation of the use of the Dewey Decimal Classification (DDC), the Library of Congress Classification (LCC), and the National Library of Medicine Classification (NLMC) as pathfinders in library web sites.
The second is the inclusion of classification data in metadata records, similar to their use in cataloging records and bibliographies. The advantages of using classification are many, including logical grouping of related materials in bibliographies and catalogs, shelf ordering, and as access points to metadata records. Class numbers, such as Dewey and LCC, address the multilingual challenge of subject analysis. They can be used effectively as switching and mapping devices among subject vocabularies in different languages. Therefore, the inclusion of classification data in metadata records warrants serious consideration.
4.1 Inclusion of Classification Data in Metadata Records
Issue statement:
While the advantages of classification data are many, the major drawback to its application is that it requires expertise in their application. Authors and creators of web resources cannot be expected to be able to supply such data.
Recommendation:
The Subcommittee recommends the inclusion of classification data in the metadata record by those who have the expertise to do so. For those not trained in the use of classification, mechanisms to assign class numbers automatically, for example OCLC's Scorpion, which is able to generate DDC numbers automatically by the computer (Vizine-Goetz), have been under development. Such tools are designed to help non-catalogers assign appropriate class numbers. The Subcommittee recommends that further development and improvement of mechanisms for automatic assignment of classification data from different schemes and sources be encouraged.
4.2 Choice of Classification Schemes
Issue statement:
Similar to the choice of controlled vocabularies, there are a number of options in the choice of classification schemes. The first choice is between using existing schemes and creating new scheme(s). Should we encourage users to adopt, adapt, or modify existing schemes or develop new ones? How suitable are existing schemes for use in metadata records? Should each application be limited to one scheme only? Or, is the use of multiple schemes desirable?
Recommendations:
The Subcommittee recommends the use of as many existing schemes (DDC, LCC, NLM, etc.) as useful and feasible even within a particular implementation. Similar to subject terms, multiple class numbers may be assigned to the same record to bring out different topics and aspects treated provided that they are properly designated and coded.
PART II: APPLICATION
5 Implementation Issues
Part II of the report outlines the issues and the Subcommittee's recommendations regarding the implementation of subject data in the metadata records, with particular focus on the Dublin Core scheme.
Dublin Core has been designed to be usable by non-catalogers as well as trained catalogers. The characteristics of the Dublin Core that distinguish it as a prominent candidate for description of electronic resources include the following: simplicity, semantic interoperability, international consensus, and flexibility. These characteristics were taken into account as the following recommendations were developed.
5.1 Verbal Representation
Issue statement:
The SUBJECT element in the Dublin Core metadata record is designed to accommodate both free-text and controlled vocabulary terms. Each of these may be used alone, or they may be used in combination.
Recommendation:
The Subcommittee recommends the inclusion of both free-text and controlled terms, where appropriate and feasible, in the SUBJECT element in the Dublin Core metadata record in order to achieve optimal recall and precision in retrieval.
5.1.1 Choice of Vocabulary
Issue statement:
For controlled vocabulary, there are many existing subject headings lists and thesauri, and they cover both general and specialized or subject-specific vocabularies.
Recommendation:
For the sake of semantic interoperability, the Subcommittee recommends adopting an existing vocabulary or vocabularies with or without modification. Naturally, the most important criterion is subject domain; the vocabulary used must accommodate the nature and scope of the collection being described.
5.1.1.1 General Controlled Vocabularies
Issue statement:
There are relatively few comprehensive controlled vocabularies covering all subjects. Some examples are:
Library of Congress Subject Headings
Sears List of Subject Headings
The question is whether the existing schemes are suitable for description of electronic resources.
Recommendation:
Because Library of Congress Subject Headings (LCSH) has been translated or used as a model to build subject heading lists and thesauri in many languages around the world, it represents the closest realization of an international consensus. The Subcommittee recommends the adoption or adaptation of LCSH as the basis for subject data in the Dublin Core metadata records for a general collection. Sears List of Subject Headings, which is built on similar principles but with simpler terminology, may also be considered for subject representation on a broader level. The use of LCSH or Sears, even with a modified syntax, in metadata records would ensure a measure of compatibility with the enormous store of MARC records in OPACS.
5.1.1.2 Specialized or Subject-Specific Vocabularies
Issue statement:
There are numerous controlled vocabularies covering specific subjects or designed for special collections. Some examples are:
Art and Architecture Thesaurus (AAT)
GeoRef Thesaurus
Legislative Indexing Vocabulary (LIV)
Medical Subject Headings (MeSH)
NASA Thesaurus
Thesaurus of ERIC Descriptors (ERIC)
Others (cf. Chan and Pollard 1988)
There are also multilingual subject thesauri, for example:
OECD Macrothesaurus (Social and economic sciences; English, Spanish, French, German)
Scimp/Scanp Business and Economics Thesaurus (LIRN project; English, French, and Portuguese)
Recommendation:
Criteria for choosing specialized vocabularies should be based on subject matter, the intended audience, term specificity, and syntax.
5.1.2 Specificity and Depth of Analysis
Issue statement:
In indexing, subject terms are typically assigned to correspond as closely as possible to the content of the resource; in other words, the most specific terms provided by the controlled vocabulary are assigned. Two issues are involved here. The first is summarization vs. exhaustive indexing. The question is whether specificity here refers to the overall content of a resource or its subordinate concepts also. The second issue is the level of indexing. The question is at what level index terms should be assigned: for large information units such as entire web sites and databases (analogous to books and periodicals) or for their component parts such as individual records and/or individual web pages (analogous to book chapters and journal articles).
Recommendation:
The level of specificity depends on the application. Author/creator generated metadata would naturally be based on the resource itself. In the case of metadata collections, each implementing agency should establish policies regarding the appropriate level of subject representation for its collection. At the appropriate level, the most specific subject terms provided by the chosen controlled vocabulary should be assigned. Concepts too specific for the controlled vocabulary may be represented with free-text terms in the SUBJECT element in the Dublin Core record or in other elements such as DESCRIPTION. Free-text keywords and sources of controlled vocabulary terms should be clearly labeled or designated. Devices should be developed to assist end users to find the desired level of specificity. However, this could be a design issue for a search engine or database system.
5.1.3 Consistency
Issue statement:
Consistency in application has always been a guiding principle in subject indexing in the more structured environments such as the OPAC and commercial databases because consistency ensures predictability, which in turn improves recall and precision. In the largely unstructured web environment, the question is to what extent consistency is feasible or even desirable.
Recommendation:
While recognizing the difficulty in achieving total consistency in the web environment, the Subcommittee recommends that, within a specific digital collection or project the application of subject analysis should be consistent; in other words, the same semantics and syntax should be applied throughout. Compatibility with other metadata schemes is also desirable. When a controlled vocabulary is used, the version of the vocabulary should be indicated along with the date on which the subject data are created.
5.1.4 Placement of Non-Topical Data
Issue statement:
Subject-related data such as geographic, chronological, language, and form data may be placed with topical data or separated from them. In MARC records, all such data appear in the subject fields, typically in complex strings, in other words, in a precoordinate approach. The Dublin Core puts in place the accommodation of such data in different elements, thus allowing a faceted, postcoordinate approach.
In the Dublin Core, the element SUBJECT is defined as:
The topic of the resource. Typically, subject will be expressed as keywords or phrases that describe the subject or content of the resource. The use of controlled vocabularies and formal classification schemas is encouraged.
Before we consider the issue of free-text vs. controlled vocabulary, we need to consider a number of other “subject-related” elements in the Dublin Core (http://purl.oclc.org/dc/about/element_set.htm) (emphasis mine). At least seven of the fifteen elements have something to do with the content:
A. Content indicators -
1 Title Label: TITLE
3 Subject and keywords Label: SUBJECT
4 Description Label: DESCRIPTION
B. Form data -
8 Resource Type Label: TYPE
9 Format Label: FORMAT
C. Language data -
12 Language Label: LANGUAGE
D. Spatial or temporal data -
14 Coverage Label: COVERAGE
Taken together, these elements imply a faceted, postcoordinate, approach to subject representation. Specifically, form, language, place, and time are separate from topical representation in element 3.
Thus, the Dublin Core allows the flexibility of placing non-topical data in the SUBJECT element either in a string or as separate descriptors. Data manifesting different facets (topic, space, time, form) may also be placed in different elements (type, coverage, language, etc.) in the Dublin Core scheme. Each approach has its own advantages and drawbacks. While the faceted, postcoordinate approach is perhaps more compatible with other types of databases and more amenable to current web search engines, the full string approach provides greater compatibility with OPACs and MARC databases.
Recommendation:
The Subcommittee recommends two options:
(1) Using LCSH subject strings, if possible (i.e., if time and trained personnel are available), particularly in the OPAC environment.
(2) Making use of other Dublin Core elements (type, coverage) in addition to the SUBJECT element to accommodate different facets related to subject: topic, place, period, language, etc. Deconstructed subject strings should be so designated.
Between the two options, the Subcommittee endorses the second option, i.e., the use of separate Dublin Core elements for form, type, time, and space, particularly in situations where non-catalogers are involved in the creation of metadata records.
5.2 Classification Data
The SUBJECT element in the Dublin Core includes provision for including data derived from formal classification schemas.
5.2.1 Choice of Classification Schemes
Issue statement:
There are many existing classification schemes, some covering all subjects and others specializing in specific subject areas or designed for special collections:
General Classification Schemes
There are relatively few comprehensive classification schemes covering all subject areas. The most prominent examples are:
Dewey Decimal Classification (DDC)
Library of Congress Classification (LCC)
Universal Decimal Classification (UDC)
Specialized or Subject-Specific Classification Schemes
There are numerous specialized classification schemes covering specific subject areas. Some examples are:
ACM Computing Classification System (CCS)
International Classification of Diseases (ICD)
Mathematics Subject Classification (MSC)
National Library of Medicine Classification (NLMC)
The first question is whether these existing schemes are suitable for representing electronic resources. The second is: which scheme to choose.
Recommendation:
For the sake of interoperability, the Subcommittee recommends adopting an existing scheme with or without modification. Similar to controlled vocabularies, the first criterion is subject domain; other considerations include the nature and scope of the collection being described and the user community being served.
5.2.2 Depth and Breadth of Scheme
Issue statement:
Classification data may be assigned at a broad level for the purpose of creating clusters of resources based on subject to facilitate retrieval. They may also be assigned at a specific level, similar to subject index terms, for the purpose of representing the true content of the resources. Do we need close classification or will broad classification serve the purpose just as well?
Recommendations:
The Subcommittee feels that the more exhaustive the classification data, the more people can get out of it. Since existing classification schemes, by their nature, provide precoordination, they serve to supplement subject terms that provide postcoordination.
5.2.3 Notation
Issue statement:
Classification data can be assigned without the accompanying notation (numbers). Since users typically cannot decipher the meaning of class numbers, the question is, then, is it necessary to include class numbers in the metadata records?
Recommendations:
Classification notation should be included. However, item (non-topical Cutter) numbers are not necessary because classification data are not used as a shelving device in this context. Multiple classification numbers should be allowed -- either of various classification types or multiple numbers within the same scheme. In the metadata record, captions (i.e., the text accompanying the class numbers) need not be included. If desired, captions could be built-in through systems design.
PART III: SYSTEMS DESIGN
6 Online System Features
In addition to the recommendations outlined in Parts I-II, the Subcommittee also considered systems design issues relating to subject data in the metadata record. It feels that the following features can greatly facilitate the assignment of subject data and the retrieval of electronic resources through such data in the metadata records:
- automatic keyword indexing based on word occurrences in the full-text resources, using natural language processing methods;
- automatic generation of classification data based on the resource itself;
- automatic extraction of subject and classification data from records for similar items;
- availability of online access to controlled vocabularies and classification schemes for creators of metadata records;
- automatic mapping from user input free-text terms to controlled vocabularies and classification data; and,
- availability of online tools and assistance, designed particularly for non-catalogers, to derive appropriate subject terms and/or class numbers.
Glossary
Controlled vocabulary. A controlled vocabulary is a subset of a language, consisting of pre-selected words and phrases designated as index terms. In a controlled vocabulary, each subject is represented by one valid term only; and, conversely, each term represents only one subject. References are made from equivalent or synonymous terms not selected as valid index terms. Homographs are disambiguated. In addition, a controlled vocabulary contains links among hierarchically or otherwise related terms. Examples of controlled vocabularies include Library of Congress Subject Headings, Thesaurus of ERIC Descriptors, and Medical Subject Headings. The term “controlled vocabulary” is often used in a broad sense to include scheme-based classification data, which also manifest rigorous structures and embody relationships among concepts.
Crosswalk. A crosswalk is a program or algorithm to map elements in different metadata schemes. An example is the Dublin Core/MARC/GILS Crosswalk designed by the Library of Congress.
Harmonization. Harmonization refers to the process of making disparate entities or systems work together. Its purpose is to resolve conflicts and to remove obstacles by overcoming idiosyncrasies of individual systems. Within the context of subject access, harmonization implies efforts to make terms from different controlled vocabularies work together for the benefit of improving retrieval results. Differences may occur in semantics and/or syntax, and among multiple languages. Harmonization provides the ability to accommodate two or more different systems, schemes, or standards to facilitate searching across databases. Methods of harmonization include linking and mapping.
Keyword. A keyword, broadly defined, is any individual word that is searchable. It is often used specifically to refer to uncontrolled vocabulary in free-text searching. In this document, the latter definition is used.
Linking. Linking refers to the process of making connections between or among different entities or elements, including systems, vocabularies, index terms, etc. Examples include hyperlinks in online systems; linking among authority files, bibliographic files, and index files; and cross references between and among individual terms.
Mapping. Mapping refers to a special form of linking, with efforts to identify equivalence or establish one-to-one and, in some instances, one-to-many relationships. Mapping facilitates automatic switching between systems or languages. Recent developments include efforts to match elements in the MARC record with those in other metadata records and efforts to identify equivalent terms among different controlled vocabularies or different languages. Examples of mapping of subject entries include the Omni File (based on the indexes to individual WILSONLINE databases) and MACS (Multi-lingual Access to Subject headings, a European project on multilingual access to subject authority files and data to develop a prototype for the mapping of subject entries based on three controlled vocabularies: Library of Congress Subject Headings [LCSH], RAMEAU, and Schlagwortnormdatei [SWD]).
Metathesaurus. A metathesaurus is, in a sense, a “thesaurus of thesauri,” serving as a framework within which diverse controlled vocabularies are harmonized for the purpose of facilitating cross-file searching. An example is the UMLS(Unified Medical Language System) Metathesaurus developed and maintained by the National Library of Medicine, in which “alternate names [from different source vocabularies] for the same concept (synonyms, lexical variants, and translations) are linked together. Each Metathesaurus concept has attributes that help to define its meaning, e.g., the semantic type(s) or categories to which it belongs, its position in the hierarchical contexts from various source vocabularies, and, for many concepts, a definition.” (National Library of Medicine 1999).
References
Buckland, Michael, and others. (January 1999). Mapping Entry Vocabulary to Unfamiliar Metadata Vocabularies. D-Lib Magazine, 5 (1) (http://www.dlib.org)
Burnard, Lou, Eric Miller, Liam Quin, and C.M. Sperberg-McQueen. Syntax for Dublin Core Metadata - Recommendations from the Second Metadata Workshop, (undated): (http://info.ox.ac.uk/~lou/wip/metadata.syntax.html)
Chan, Lois Mai and Richard Pollard. 1988. Thesauri Used in Online Databases, An Analytical Guide. New York: Greenwood Press.
Dempsey, Lorcan and Stuart Weibel. (July/August 1996). The Warwick Metadata Workshop: A Framework for the Deployment of Resource Description. D-Lib Magazine. (http://www.dlib.org)
The Dublin Core: A Simple Content Description Model for Electronic Resources: Metadata for Electronic Resources. 1999. (http://purl.org/DC/index.htm)
Dublin Core Metadata Element Set: Reference Description. 1997 (http://purl.org/DC/about/element_set.htm)
National Library of Medicine. 1999. Fact Sheet: UMLS ® Metathesaurus ®. (http://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html)
Resource Discovery Workshops: Final report from the Archaeology Data Service. 1997. (http://ads.ahds.ac.uk/project/metadata/workshop1_final_report.html. Last modified: 4 August 1997)