Metadata by Crawling E-Publications

Siew-Phek T. Su, Yu Long, and Daniel E. Cromwell

This paper presents a system called E-pub to MARC (E2M), which automatically generates MARC-formatted metadata by crawling e-publications. The functions of its two key components, the Web Crawler and the MARC Converter, are introduced. The paper presents the methods and tools used for building the system. The process of crawling and gathering pertinent metadata stored in the e-publications and the transformation of the metadata into MARC-formatted records are described in detail.  The complexity of the crawling and the record generation processes are also described. A comparison between the cataloging process of e-publication using the computer-aided E2M process and manual cataloging is presented to illustrate that the E2M process is a more cost effective and efficient method of organizing and proving access to e-publications.

The proliferation of scholarly electronic publications (e-publications) on the Internet has posed a challenging problem for catalog librarians. The process of manually cataloging information in this medium is not only time consuming but also human-resource intensive. In light of this challenge, the impetus of the project team was to develop a more efficient and effective method to catalog this type of material with the aid of a computer.

Bibliographic control of Web resources and its related issues have been widely discussed and written about. 1 The issue of automating the e-publication cataloging process is an important one. However, little has been done in developing systems to automate the entire labor-intensive cataloging process using WebCrawler technology and techniques for automatic data conversion and loading. Although WebCrawlers have been used to extract information from Web pages, they are not programmed to extract the specific metadata needed for constructing catalog records and for loading them into bibliographic databases. For example, the two notable crawlers of the popular search engines, Google and Internet Archive, crawl the entire Web and extract keywords from Web pages to generate indexes for accessing relevant Web pages. 2 Meta-crawlers, such as MetaCrawler and Dogpile, integrate the search results obtained from different search engines. 3 Site-specific crawlers, such as WebSPHINX, allow users to specify site-specific crawling rules and perform so-called personal crawling. 4 The Hermes notification service system uses a component called wrapper to extract bibliographic data from HTML documents on publishers' Web sites and generate XML documents that contain bibliographic data. 5 The bibliographic data are typically the journal's table of contents (TOC). A commercially available tool for cataloging Web resources is the MARCit system. 6 The system provides a template for users to fill in such cataloging information as URL, author, title, and subject headings to convert the information to standard MARC-formatted records, which can be loaded into the local library management system. However, it does not have a WebCrawler component to automatically access and harvest the Web page metadata.

In contrast to the above systems, the E2M system described in this paper deals with the entire e-publication cataloging process. It starts with the automatic extraction of metadata from Web pages and goes on to the conversion of metadata into MARC-formatted records. Next, these records are loaded into the local system for authority verification. The final stage of the process consists of exporting verified MARC records to the Online Computer Library Center (OCLC) catalog for sharing with the bibliographic information community.

Project Domain

The e-publications housed in the Extension Digital Information Source (EDIS) database of the Institute of Food and Agricultural Sciences (IFAS) at the University of Florida (UF) is used as the project domain. 7 The EDIS database is the official electronic database of IFAS' current extension service and research publications. The reason for choosing the EDIS database as the project domain is a utilitarian one. It has been UF Library policy to catalog all IFAS publications. Currently there are more than five thousand documents in the EDIS database, out of which nearly four thousand are in electronic format. 8 Moreover, thirty to forty new e-publications are being added monthly. The benefits of automating the cataloging process for this increasingly large database are obvious. Furthermore, the structure of the IFAS e-publications is somewhat standardized, making it ideal to develop a WebCrawler to automatically harvest the meta-information.

Components and Operational Process of E2M

Figure 1 shows the components of the E2M system and its operational process.

Figure 1

The WebCrawler accesses and scans Web pages, extracting relevant metadata from them. The extracted metadata are represented in the form of data field names and values. They constitute the input to the MARC Converter. The converter transforms the metadata into MARC-formatted records, which are then loaded into the library's online management system. For this task, we use the FULOAD program. 9 The computer-generated records undergo an authority verification process before being loaded into the OCLC database as acceptable MARC records. We use the BatchBAM program as the authority verifier to do authority verification. 10 To transfer records to OCLC, we use the upload feature of the CLARR program developed by Gary L. Strawn of Northwestern University Library. 11 The FULOAD and the BatchBAM programs were also written by Strawn specifically to assist the UF library staff in the automated processing of bibliographic records. This paper describes only the two key components of the E2M system: the WebCrawler and the MARC Converter.

WebCrawler and Metadata

What is a WebCrawler and why is it ideal to use it for harvesting metadata of e-publications? A WebCrawler is "a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced." 12 Crimmins presents an excellent review of Web crawling in which he examines various strategies and approaches used for developing different types of crawlers, such as scalable crawlers, agent-based crawlers, and crawlers that are designed for finding specific information. The design of a WebCrawler depends very much on its application. 13 As mentioned earlier, the WebCrawler developed for the E2M (E-pub to MARC) project is for finding specific metadata information in each e-publication page, such as author, title, publisher, date of publication, and notes necessary to generate a MARC record.

The e-publications residing in the EDIS database have a somewhat standardized structure. Each e-publication has all or part of the following metadata: titles, author, section titles, subsection titles, summary, bibliography, footnotes, and copyright information, in that order. However, the specific format and the placement of these metadata vary in different e-publications. The challenge for the crawler is to be able to detect these differences by scanning the HTML representation of each e-publication in which these data fields are not explicitly tagged. This, of course, is more difficult than scanning and extracting metadata from an XML-formatted document in which metadata are tagged by field names. Our solution is to make the best use of the tags in HTML to find the pertinent metadata we need to construct the bibliographic record in MARC format. The following examples illustrate the challenge and the solution:

The title of the e-publication always appears as the first data element in an IFAS e-publication. For authors, which usually come after the title, we search for the information between the tag "</h1>", which is the end tag for the section containing the title information, and the tag "<sup>", which is the beginning tag for the superscripted footnote number that always follows the last author's name.

A publication number (PN) is located in the footnote section of the e-publication and has the following possible formats:

  • Alpha-numeric string with the alpha part always in the upper case, such as:
    • NEY-250
    • FCS 8155
  • Name followed by number, such as:
    • Bulletin 810
    • Fact Sheet ENH-88

Using the same method for locating the author information, the program looks for the feature of the PN, starting with a letter in upper case and ending with a numeral.

For the publication date, which also resides in the footnote section, the program searches for the four-digit year number.

In the process of harvesting the relevant information from e-publications, the crawler occasionally encounters errors such as incorrect URLs, broken URLs, incorrect HTML tags, or some type of system-generated errors. When an error occurs, the crawler records the error in an error file. The search and extraction process will continue, leaving those errant e-publications to be dealt with individually by the staff. Thus, the process will not be blocked when a noted error is encountered.

The Crawler uses a breadth-first-search algorithm to locate all linked pages. 14 The search starts from a root page, fetches all the URLs on that page, and puts them into a first-in-first-out (FIFO) queue. Next, it uses the first URL in the queue to locate the next page, extracts all its URLs, and puts these newly fetched URLs into the queue again. Whenever it finds a URL for an e-publication, it starts to extract the relevant metadata. It is called a breadth-first-search method because it will finish searching all the URLs on one level before searching for the URLs on the next level. To illustrate the algorithm more clearly, we provide the flowchart of the crawling process in figure 2.

Figure 2

The user enters the starting URL of a Web page (i.e., the root page) through the WebCrawler interface (shown in figure 3). The system implementation and user interfaces are described in detail in the later part of this paper.

Figure 3

MARC Conversion Process

The MARC conversion process deals with the automatic generation of a standard MARC-formatted record based on the extracted metadata of an e-publication that is embedded in the crawling process part illustrated in figure 4. In the conversion process, two types of data need to be combined to form a MARC record, the crawled data and the constant data. The crawled data are extracted from e-publications. Data, such as author and title, vary from e-publication to e-publication. The constant data are data that should appear in all generated MARC records. Some examples of constant data are given below:

  1. Publisher information in the 260 field: "[Gainesville, Fla.] : University of Florida Cooperative Extension Service, Institute of Food and Agricultural Sciences, EDIS"
  2. Technical note in the 538 field: "Internet access required."
  3. At head of title in the 500 field: "University of Florida, Cooperative Extension Service, Institute of Food and Agricultural Sciences, EDIS."

The constant data consists of two subtypes: the same data that appear in every IFAS e-publication (see examples 1 and 3), and the data that are necessary to form valid MARC records according to cataloging rules (see example 2). The user input the constant data through the RecordFormat interface shown in figure 5.

Figure 4

Figure 5

The crawled data are stored in memory as an ElecRecord object and the constant data are stored as a MarcFormat object. These data are used to construct a MARC-formatted record based on the structural specification of the MARC bibliographic record. In order to achieve better performance, we store the generated MARC record in an in-memory buffer called OutputRecord instead of writing it to a file immediately after its generation. The accumulated MARC records in the buffer are written to a file only when the buffer is full. After the output operation, the buffer is cleared and reused for the next batch of records. Each batch contains the number of records set by the user through the WebCrawler user interface as the value of the variable named "Records per file" (see figure 3). In this way, we can reduce the number of time-consuming output operations. However, the value of Records per file should not be set too high because some computers that run the E2M system may have a limited amount of main memory space. From our experience, we recommend that the value should not be set higher than one hundred records.

For creating a record with the proper MARC format, we first determine the various data fields and values that are to appear in the finished product. We then create a field-by-field list of cataloging rules to be used by the converter in placing data fields and values in the proper format of a valid MARC record. The generated record can then be confidently used by the bibliographic community. We base the rules on the MARC format standard specification, the Anglo-American Cataloguing rules (AACR2), the International Standard for Bibliographic Description (ISBD[ER]) and OCLC Bibliographic Formats and Standards. 15 For example, the rule for the author/title field is provide only one author access point and observe the following rules:

  • Use the first-named author as the main entry (100:1 )
  • If more than three authors, use title as the main entry (245:0 ) and the first-named author as an added entry (700:1 )

In addition to these rules, we have to introduce additional rules for handling metadata elements that do not conform to the data field restriction of the local online library system. For example, the 520 field of the NOTIS system has a length restriction of one thousand characters. However, the summary of an e-publication may exceed this limit. It does not make sense to simply truncate the summary after the thousandth character. A reasonable rule for the converter to follow is to find the last sentence that fits the thousand-character limit and put the remainder of the summary in an additional 520 field(s).

The upper and lower cases of letters in words that appear in an IFAS e-publication pose a problem in converting them to the correct cases in a MARC record. For example, in an IFAS e-publication, all words in a title begin with uppercase letters. However, according to the AACR2, only the first word in the title should be in uppercase, with the exception of acronyms, proper nouns, and directional words followed by proper nouns. To deal with this problem, we developed a proper name table, which consists of a set of acronyms, proper nouns, and directional words. The converter uses this table to identify those words whose uppercase letters should not be changed in the process of conversion. Some examples of words in the table include:

  • geographic names, such as "Florida";
  • directional word before a geographic name, such as "South Florida"; and
  • acronyms, such as "EDIS."

Before the process of automatically generating the MARC record could take place, we also had to decide on the type of record we would like the converter to generate. We opted for encoding level K (less-than-full level) records to avoid having to assign subject headings and call numbers, and to establish entries for secondary authors. To make up for the lack of subject headings, we decided to include section and sub-section titles in content notes and the summary as summary notes to provide keyword access.

Another problem we had to solve was how to handle different versions of an IFAS e-publication. Unlike printed documents, for which each edition is separately published and cataloged, only the current version of an IFAS e-publication is made available, replacing the old version. The new version has the same URL as the old version and has a notation in the text indicating that the publication date is the revision date. The policy of keeping only the current version creates a problem and a challenge for maintaining the accuracy of a bibliographic record. Technically, an existing bibliographic record should be modified to contain the publication date of the new version. However, it is quite costly not only to update the date each time a new version appears but also to track when the new version appears. The approach we have taken to deal with the current version problem is to code the type of date and publication status as "m" (a range of dates) and leave an open-ended publication date. We also include a note to indicate the date that the crawler viewed the Web page. An example of this is "Title from Web page viewed on July 25, 2001." The view date is the date that the metadata was harvested. In this way, we can use a single bibliographic record to describe the potentially changing content of the document.

The maintenance of the crawled URLs is another issue we had to consider. The unpredictable mobility of Internet resources creates a serious problem for librarians because it compromises their services to the users and imposes a burden on catalog maintenance. The Persistent URL (PURL) resolution developed by OCLC serves as a general solution to this problem. 16 We decided to use the PURL server maintained by the Florida Center for Library Automation (FCLA) to create two PURLs for each IFAS e-publication; one pointing to the HTML version of the Web page and the other to the PDF version.

Even though the basic structure of the e-publications in the EDIS database is somewhat standardized, the formats of the data values in these e-publications may vary, as mentioned earlier. The inconsistent data formats present a real challenge to write a general program to extract the correct data and to generate MARC records in adherence with strict cataloging rules. An example of the complexity can be illustrated by the extraction of the author information. The author information can appear as:

  • P.J. van Blokland" and "van Blokland, P.J."
  • "John Smith, Ph.D." or "John Smith, Jr."
  • two or more authors separated by "and," by comma, or by space

In the first example, it is not possible for the program to distinguish correctly which character string constitutes the first name and which the last name. In the second example, unless a specific rule is written for the program to ignore titles such as "Ph.D." and "Jr.," it will not know that these characters are not a part of the name. In the third example, there is no way for the program to count the number of authors when there is not a consistent way of separating each author unless all the possible separators are made known to the program.

Our current solution is to identify as many different patterns and structures of metadata as possible and to program the converter to recognize them. However, this approach requires that the program code be modified or extended if more deviant patterns and structures are found. A better solution is to introduce a rule language and write the deviant patterns and structures as rules. The converter can then be programmed to interpret the rules during the parsing of the character string to correctly extract the metadata. The development of such a rule-driven parser for the converter is contemplated.

There are a number of Spanish language e-publications in the EDIS database. The use of diacritics in the Spanish language text posed a unique challenge. The representations of such diacritics as acute, tilde, and umlaut in the HTML source code are different from the MARC 21 Specification for Character Sets needed to construct MARC records. 17 We have programmed the system to perform the conversion of acute, tilde and umlaut, which are the most common diacritics in the Spanish language. Future work can be done to set up a lookup table for special characters in different languages.

The screen-shots in figures 6, 6b, 7, and 8 serve to summarize the transformation process starting from an e-publication to MARC communication format record to the final product of the OCLC record. Figure 6 and 6b show the Web pages of an IFAS e-publication that the WebCrawler extracted; the image has been split to show the beginning and ending of the document. The complete document can be found at the following Web site: (Accessed April 30, 2002). 18 Figure 7 shows the corresponding harvested record in MARC communication format and figure 8 shows the corresponding OCLC MARC record.

Figure 6

Figure 6b

Figure 7

Figure 8

The transformation of the author's name can be seen as the record evolved. As shown in figure 6, the form of the author's name in the e-publication Web page is "Garret D. Evans, Psy.D." The harvested MARC record in figure 7 shows that the form of the name has been transformed to "Evans, Garret D." with the elimination of the title, "Psy.D." The final product, as shown in figure 8, presents the authoritative form of the name of the author, "Evans, Garrett D, 1965- ," after the record has gone through the authority verification process. A cataloger manually changed the form of the name to the established one. An NAR (Name Authority Record) is created if the correct form has not been established. The number of our NAR submissions to NACO (Name Authority Cooperative) Program has increased due to the loading of these E2M-generated records. Another example to illustrate the transformation of the e-publication to the MARC record is in the title proper. In figure 6, the title displays as "The ÎFool-Proof' Time-Out." E2M transformed the title into "The Îfool-proof' time-out," with the proper lower-case letters as shown in figures 7 and 8. The second 500-field, with the quoted note as shown in figure 8 and its corresponding MARC communication record in figure 7, is extracted from the Footnotes section of the corresponding e-publication ending page shown in figure 6: "This document is Fact Sheet FCS 2113, a series of the Department of Family, Youth, and Community Sciences, Florida Cooperative Extension Service, Institute of Food and Agricultural Sciences, University of Florida. Publication date: April 1997. First published as HE 2101 June 1996. Reviewed: April 1997."

From the Footnote information, the WebCrawler also harvested pertinent data such as the publication date (260 |c) and the publication number (246 |b). Additional information in other fields and subfields, such as the 043, 246 (except for the publication number), 260 (except for the publication date), 500 "At head of title" note, and 538 fields, are the constant data taken from the record template, as shown figure 5.

System Implementation and User Interfaces

The key components of the E2M system, the WebCrawler and the MARC Converter, are both written in the Java programming language. 19 We chose Java not only because its platform-independent feature allows us to install the system on any Java-enabled computer, but also because its object-oriented feature is easy to program, debug, and update.

To crawl for metadata of e-publications, different library institutions or users may want to extract different metadata and convert them into the desired MARC records. To avoid reprogramming the converter each time a different MARC record is desired, we use a form-driven approach to implement the converter. A record template that contains a standard set of data fields for monographs is predefined and accessed as a form by the user through a browser (see figure 5). The user completes the form by assigning values to the data fields shown in the template. All these values will appear in each of the generated MARC records. For example, in figure 5, the Format (FMT) field will always be coded as "B," the Record Status (STAT) coded as "n," and the Encoding Level (E/L) as "K" to indicate that the generated MARC record is a less-than-full level bibliographic record.

After filling in the first form, the user accesses the second form shown in figure 3. This form allows the user to enter the URL of a Web page (i.e., the root page), from which the Crawler would do a breadth-first search for all the hyperlinks as described in the WebCrawler and Metadata section of this paper. The user can also enter the URL of a particular e-publication for accessing the document directly.

The user interface shown in figure 3 also allows the user to specify a range of dates for traversing a subset of hyperlinks that fall into the specified range, or a set of specific hyperlinks for accessing the e-publications pointed to by them. For example, in the Web page that contains the hyperlinks to EDIS' new documents, the hyperlinks are partitioned by dates as shown in figure 9. The user may want the crawler to crawl for hyperlinks within the date range from September 21, 2001, to September 14, 2001. (Note that the dates are reversed because of their position on the Web page. See figure 9.) Another option the user has is to enumerate those hyperlinks from which the crawler should perform the search (e.g., IG148 to AN110). The reason for providing these two search options is to give the user the added flexibility to specify which e-publications are to be processed.

Figure 9

The "Records per file" field in figure 3 is for setting the desired number of records for each output file. The "035 suffix" field is for inputting a unique suffix to be included in the 035-field (system control number field).

To facilitate the editing of the proper name table, we developed a form consisting of three function-buttons: "Add," "Find," and "Delete" for adding, finding, and deleting entries from the table, respectively (see figure 10).

Figure 10


Discussions and Ongoing Work

We implemented and fully tested the system using a sample of approximately five hundred e-publications. Experimental results indicated that the system was an effective and efficient method to catalog the e-publications residing in the EDIS database. We then put this system to use in the real library environment in July 2001. Using the E2M system, we generated and loaded over 2,500 MARC-formatted records in our local online database and uploaded more than two thousand records to OCLC. The process of harvesting the relevant metadata of the e-publications, converting the crawled metadata to the MARC-formatted records, and loading the MARC records into the local database was very fast. The most time-consuming part was the manual review of each record and the follow-up work resulting from the authority verification process to make sure that the records met acceptable standards for sharing with the bibliographic universe. The individual tasks and their throughput time per hundred records were:

  • Crawl for metadata and convert them to MARC records: 10 minutes
  • Load records into online library management system: 2 minutes
  • Authority verification and corrections: 180 minutes
  • Create PURLs: 15 minutes
  • Transfer records to OCLC: 60 minutes
  • Total time: 267 minutes (2.7 minutes per record)

It is important to note here that the above processing times are the average times taken from the results of system testing at different times of the day. System performance can vary depending on network bandwidth and workload.

We trained a student assistant to conduct the entire process, with the exception of the authority verification process and the final review, which are performed by higher-level support staff or a catalog librarian. Thus, the human resources needed to catalog this type of e-publications is minimal.

A comparison between the cataloging process of IFAS e-publications using the computer-aided E2M process and manual cataloging would certainly be useful. When we cataloged the IFAS publications in the past, we used the full-level encoding code, including assigning call numbers and subject headings. The throughput time for such original manual cataloging is about two records per hour (thirty minutes per record). It is not fair to compare this time with the time it takes using the E2M process because in the latter we use a less-than-full encoding code. Even if the time to do manual cataloging by using less-than-full encoding level is, say, 25 percent of the full-level encoding time (i.e., 7.5 minutes per record), the use of E2M is still more advantageous (i.e., 2.7 minutes per record versus 7.5 minutes per record).

A more direct and fair comparison is with the manual cataloging of theses and dissertations, which uses the less-than-full encoding level "K" code. Like the E2M process, the thesis and dissertation cataloging process also does not entail the assignment of subject headings, and several fields in the record also have constant data. Its authority verification process and the record transfer process to OCLC are quite similar to those of the E2M process. A student assistant is also involved in the theses and dissertation cataloging process. Our performance measure for cataloging theses and dissertations manually is about 7 records per hour (i.e., 8.57 minutes per record) as compared with the 2.7 minutes per record cataloging IFAS e-publications using the E2M process. It is obvious that the E2M cataloging process is more efficient and effective, thus a less costly method of cataloging this type of e-publication. A side benefit for using the E2M system is that it allows us to contribute original bibliographic records to the OCLC database and gain monetary credit of $4.05 per record for the library.

All the E2M-generated records loaded to OCLC are standard MARC less-than-full encoding level records that follow AACR2. To avoid loading records with errors, these harvested records went through manual review as well as through an authority verification process in which changes were made where necessary. These records are also new and unique additions to the OCLC database. Before we did the initial loading of the E2M-generated MARC records to OCLC, we checked in the OCLC database for possible duplications. We found that there are several records for IFAS extension documents, but they are for older paper documents and not for e-publications. More than a decade ago, IFAS published paper documents and distributed copies to other land grant university libraries and the National Agriculture Library (NAL). The output of publications from most land grant institutions, including UF, was prolific, causing a problem of cataloging these documents in a timely manner. To help alleviate this problem, NAL encourages that each home institution takes the responsibility for cataloging its own publications. Thus, it is unlikely that a library will catalog other institutions' extension publications. In the meantime, IFAS abandoned publishing extension documents in paper format in favor of e-publications. Hence, the possibility of these documents landing easily in the hands of catalogers at other institutions is slim because these documents now only reside on the Web.

During the period of July through October 2001, we loaded more than two thousand E2M-generated MARC records into the OCLC database. In mid-March 2002 we did a search in WorldCat to determine whether other institutions were using these records as the basis for their own cataloging. As we suspected, the search result shows that only a handful of records have one other institution's holdings symbol besides UF. We believe that it is unlikely that other institutions would download many of the E2M-generated records in OCLC because of the nature of the publications. The importance of these records going into the OCLC database is that they are now available on FirstSearch and WorldCat to be used by researchers, not so much to be downloaded into other library databases. From our many years of experience in cataloging extension publications in OCLC, we have found that there has not been a pattern of libraries aggressively seeking to catalog other institutions' electronic extension documents, except selectively. We also believe that, in general, if other libraries were to derive our E2M-generated MARC records, some of them may improve these records by adding subject headings and perhaps classification numbers. Otherwise, the standard description that we provided as a less-than-full encoding level record would be sufficient. This proves to be true from the search we did in mid-March 2002. We checked the library online database of a deriving institution and found that subject headings have been added locally to the derived record. None of the records that have other institutions' holdings symbols attached have been enhanced in OCLC. It would be interesting to track the usage of these records in the OCLC and WorldCat databases by doing a periodic search similar to the one
conducted in March 2002 to determine if the initial analysis remains the same. There is a possibility that the result of the analysis would be different due to the aging of these records, an increased chance that these records would be found and used.

In addition to the AACR2 MARC-quality records, and the cost-effective and efficient method of organizing and providing access to the IFAS e-publications, the E2M project demonstrates two of the action steps identified by the Library of Congress in its "Bibliographic Control of Web Resources: A Library of Congress Action Plan." 20 The E2M system is the product of a collaborative effort among the IFAS Information Technology group, the researchers in the Database Systems Research and Development Center of the Computer and Information Science and Engineering Department, and the UF librarians. The collaboration leads to the ongoing exploration of better methods for organizing Web resources and development of other information access and dissemination mechanisms. This project has developed and demonstrated WebCrawler technology, techniques for automatic metadata extraction from e-publications
and automatic generation, and loading of standard MARC-formatted records into the local library online management system and the OCLC database. The computer-generated records include such additional descriptive information as tables of contents and summaries to enable keyword searching. The records also provide for explicit linking from the bibliographic record to the e-publication via two PURLs.

Although the E2M project is a work in progress, it has already proven to be effective in automatic cataloging. A few tasks can be done to further improve the performance and flexibility of the system. For example, batch transfer to OCLC, instead of record by record, will improve the data transfer time. The creation of PURLs can be automated and programmed as a part of the WebCrawler's function. Our ongoing work also includes developing a rule-driven parser for the crawler so that it can extract metadata from other types of e-publications without having to modify or reprogram the crawler. A look-up table for special characters in different languages will allow the system to process non-English e-publications. All these tasks are part of our ongoing work to make E2M a more generic software system.


We would like to thank Jimmie Lundgren, Betsy Simpson, Diana Hagan, and Howard Beck for their helpful comments and valuable discussions on this project. We would also like to thank Professor Stanley Su for his support and advice on the technical aspect of this project.


   1. Abdus Sattar Chaudhry, "A Study of Current Practices of Selected Libraries in Cataloging Electronic Journals," Library Review 50, no. 9 (Oct. 2001): 434-43. Martin Dillon and Erik Jul, "Cataloging Internet Resources: The Convergence of Libraries and Internet Resources," Cataloging and Classification Quarterly 22, no. 3/4 (1996): 197-238. International Federation of Library Associations and Institutions, Functional Requirements for Bibliographic Records. Accessed Feb. 27, 2002,, March 1998. Kristen H. Gerhard, "Cataloging Internet Resources: Practical Issues and Concern," The Serials Librarian 32, no. 1/2 (1997): 123-37. Yiu-On Li, "Computer Cataloging of Electronic Journals in Unstable Aggregator Database: The Hong Kong Baptist University Library Experience," Library Resources and Technical Services 45, no. 4 (Oct. 2001): 198-211. Proceedings of the OCLC Internet Cataloging Colloquium, San Antonio, Jan. 19, 1996. Accessed Feb. 27,
2002, University of Virginia, "Proceedings of the Seminar on Cata_ loging Digital Documents." Accessed Feb. 27, 2002, "Tools for Cataloging Internet Resources." Accessed Feb. 27, 2002,

   2. Google. Accessed Dec. 10, 2001, Internet Archive. Accessed Dec. 10, 2001,

   3. Metacrawler. Accessed Dec. 10, 2001, Dogpile. Accessed Dec. 10, 2001,

   4. WebSPHINX: A Personal, Customizable WebCrawler. Accessed Dec. 10, 2001,

   5. D. Faensen et al., Hermes--A Notification Service for Digital Libraries. Accessed Sept. 30, 2002,

   6. MARCit. Accessed Feb. 27, 2002,

   7. EDIS Web Site. Accessed Dec. 10, 2001,

   8. Information on the EDIS Database. Accessed Dec. 10, 2001,

   9. Gary L. Strawn, "Instructions to Accompany FULOAD.EXE: A Record Management Program Written for the University of Florida." Accessed Dec. 10, 2001,

   10. Gary L. Strawn, "Batch BAMming: A Method for the Automatic Validation of Headings in Defined Sets of Bibliographic Records." Accessed Dec. 10, 2001,

   11. Gary L. Strawn, "User's Guide to Accompany CLARR, the Cataloger's Toolkit." Accessed Dec. 10, 2001,

   12. M. Koster, "The Web Robots Pages." Accessed Dec. 10, 2001,

   13. Francis Crimmins, "WebCrawler Review." Accessed Dec. 10, 2001,

   14. M. Najork and J. Wiener, "Breadth-First Search Crawling Yields High-Quality Pages," in Proceedings of the Tenth International World Wide Web Conference, Hong Kong, May 2001. Accessed Dec. 10, 2001,

   15. Library of Congress, MARC Standards. Accessed Dec. 10, 2001, American Library Association, Anglo American Cataloguing Rules, 2d ed. 1998 rev. (Chicago: ALA, 1998). International Federation of Library Associations and Institutions, ISBD(ER): International Standard Bibliographic Description for Electronic Resources. Accessed Dec. 10, 2001, OCLC, Bibliographic Formats and Standards, 2d ed. (Dublin, Ohio: Online Computer Library Center, Inc., 1996).

   16. Keith Shafer, Stuart Weibel, and Erik Jul, Introduction to Persistent Uniform Resource Locators. Accessed Dec. 10, 2001, L/INET96.

   17. MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media CHARACTER SETS: Part 3, 2000. Accessed Dec. 10, 2001,

   18.Garret D. Evans, The "Fool-Proof" Time-Out. Accessed Apr. 30, 2002,

   19. Thom Blum et al., Writing a WebCrawler in the Java Programming Language. Accessed Jan. 23, 2002,

   20. Library of Congress, Bibliographic Control of Web Resources: A Library of Congress Action Plan. Accessed Feb. 27, 2002,

   Siew-Phek T. Su ( is Associate Chair for Central Bibliographic Services, George A. Smathers Libraries and Yu Long ( is Graduate Student, Electrical and Computer Engineering Department, University of Florida, Gainesville; Daniel E. Cromwell ( is LMS Field Specialist, Technical Services, Florida Center for Library Automation, Gainesville.