A Resource Description Device Used for More Efficient Library Services

Markos Dendrinos and Stelios Bakamidis

A special portable device designed for retrieving concise library resource descriptions by the user is presented in this article. The device reads a bar code attached to the back of the resource and searches for the corresponding information stored in a nonvolatile, rewritable memory. The information retrieved by the device can also be used in the loaning process. Resource descriptions can be simultaneously visual and dictated through a speech synthesizer, constituting a valuable tool for individuals with special needs, including the deaf and blind.


This article discusses a device designed to provide a concise description of each resource located in a certain section of a library. Description information is retrieved through the reading of a unique bar code attached to the back of each resource. This information can also be used for automating the loaning process, since the elements of the resource are already recorded and the only information to be added are the elements of the user and the loan and delivery dates. Resource descriptions can be simultaneously visual and dictated through a speech synthesizer, constituting a valuable tool for individuals with special needs, such as the deaf or blind.

What must be stressed here is the great importance of such a system for visually impaired persons. IFLA's Section of Libraries for the Blind (SLB) was established in 1983 as a forum for libraries for the blind. SLB participates in the annual IFLA conference and also in a biannual IFLA preconference for the section. The 2001 preconference, which took place in Washington, D.C., focused on increased information choices through Web-based technologies, future library services for blind students, digital delivery for the blind, and mainstreaming library services for blind and print-disabled users. 1

Despite growing technological developments in the information and communications area, only a small percentage of documents are actually made available to the blind in accessible formats, which include speech output, braille output, tactile devices, or even simple adjustments to a browser. 2

Integration of blind and visually impaired persons into schools, universities, and training centers is being considered through projects such as BrailleNet. 3 BrailleNet concerns Internet document delivery and aims to achieve integration through Web-accessible assistive technologies and teaching materials. The delivery of these special books is further enabled through cooperation with publishers, adaptation centers, and printing centers.

National Library Service for the Blind and Physically Handicapped (NLS), Library of Congress, is making use of the Internet to deliver a number of its services. 4 A continuously growing number of Web-Braille titles (currently 3,800 titles) has been made available to 1,500 users.

Speech input/output through speech recognition/synthesis is the most user-friendly interface for offering information in the case of visual inefficiencies, contributing greatly to a more efficient interoperability of the previously mentioned initiatives.

Presentation of the Operations of the Resource Description Device

The library is organized in a series of thematic sections. Each section consists of a number of resources, such as books, serials, images, videos, audio, and maps. Each resource is described through certain bibliographic fields along with a summary text. The whole description is stored in a special bibliographic database, allowing any field to be retrieved through tools included in related management software packages. Each library section is to be related to a certain device. This infrared device will be connected to the computer server with the mentioned data‚ base. In this way the resource descriptions corresponding to a section can be downloaded and stored in the device's nonvolatile, rewritable memory. Obviously, these data can be changed or deleted, following relative modifications in the real world state.

The library user can get the device from the information desk and go to the part of the section he or she is interested in. As the device is passed over the back of each resource, the unique bar code attached to the resource will trigger the device's sensor, which will retrieve the relative part of the memory containing the resource description. The retrieved information will be displayed in a low-energy-consumption LCD display and at the same time dictated through a speech synthesizer. The visual part is of great importance for persons with acoustic problems, while the acoustic part is greatly supportive for persons with visual problems (see figure 1).


Figure 1
image

Speech synthesis preprocess includes an automatic language detection and routing module followed by a multilingual text-to-speech (TTS) module. TTS is based on specific rules dependent on the certain language. The system is to operate with all the official European Union (EU) languages. The inclusion of Greek in such a system is of national strategic importance, aiming at the equal linguistic participation of EU partners independently of the corresponding populations. The speech synthesizer is described below.

Lastly, the device is to be equipped with all necessary navigation, control, and scroll buttons as well as an acoustic and visual help guide, offering an additive value in the cases of blind and deaf people.

After the user is informed about the resource, he can get the resource to read visually or acoustically (through ear phones) in an area of the library. Another alternative is to loan the resource. All the infor‚ mation already retrieved by the description device can be integrated with the elements of the borrower, the loan date, and the delivery date, thus creating a fulfilling of each loan action.

The Speech Synthesizer

A number of commercial and laboratory prototype systems have been presented for text-to-speech (TTS) synthesis. The majority of them are based on one of the three most popular paradigms:

  • Rule-based speech synthesis. 5
  • Speech synthesis based on time-domain techniques. 6
  • Speech synthesis based on articulatory models of the human speech production system.

Each method possesses quite different characteristics that renders it more suitable for specific application areas. Where speed of execution is most important, a time-domain technique is the prime candidate. For memory-sensitive application environments, formant-based techniques present a distinct advantage.

Modeling the human speech production system is a demanding task, since the incorporated articulatory models require intense calculations. This fact severely inhibits the implementation of articulatory models into real-world commercial applications.

Time-domain text-to-speech (TD-TTS) conversion relies on a large data‚ ‚ base of prerecorded natural speech segments that are appropriately concatenated to obtain the speech transcription of arbitrary text. 7 By employing sophisticated algorithms for seaming the discrete segments, one can achieve quite natural synthetic speech. 8 Rule-based text-to-speech conversion (RB-TTS), on the other hand, models the human speech production system more closely, requiring a more profound examination and a direct modeling of all the phenomena involved. A number of high-quality, state-of-the-art systems based on RB-TTS have been presented confirming the value of this method.

Synthetic speech quality, especially naturalness, is largely dependent on the sophistication of the prosodic modeling and the prosodic rules employed. On the other hand, detailed prosodic implementation increases substantially the intelligibility of the system even at segmental levels.

The majority of TTS systems are based on sentence-level prosody, which provides various degrees of intelligibility but hardly any quasi-natural output from a prosodic and thus phonetic point of view. The main directions for improving the naturalness of synthetic speech involve studying the synthetic signal quality as well as the prosodic modeling of natural speech. Both aspects are the subject of intense research activity for improving the naturalness of synthetic speech. 9

Porting an existing speech synthesizer to a different language is a task requiring language-specific resources. Focusing to a TD-TTS approach, the creation of a high-quality speech synthesizer consists of developing an ensemble of modules for the target language. These may be divided into the areas of linguistic processing and digital signal processing, and are briefly described as follows:

  • text-to-phoneme module: converting written character complexes into phonemes to be dictated;
  • segment database: creating a database of segments (and associated corpus) that covers sufficiently the target language;
  • text decomposer: deriving an algorithm for decomposing text into segments;
  • prosodic modeling: creating a prosody generator for the target language that provides the desired synthetic speech quality;
  • speech corpus: obtaining an adequate corpus of prerecorded utterances, which will provide the basis for defining speech segments in various environments to be concatenated during synthesis;
  • synthesis algorithms: designing the algorithms that join the segments so as to generate the synthetic speech signal.
  • unit selection: providing multiple instances of each segment possessing different prosodic properties in the database to improve the speech quality. An algorithm is then used to select the unit that most closely resembles the prosodic characteristics dictated by the model, thus minimizing the audible mismatches. 10

Conclusions

The propose device is an economic and realizable means for facilitating the library services offered to users, especially users with special needs. An efficient process for both informing users about resource contents and loaning is accomplished through a device available in the library desk. The design of the device ensures low energy consumption. It must also be stressed that such a device contributes to the maintenance of the resources, especially the books, as there is no need to remove them for browsing from the shelf, since the client takes all the necessary information from the bar code attached to the back of the resource.

References

   1. Jenny Craven, "The Development of Digital Libraries for Blind and Visually Impaired People." Accessed Feb. 27, 2002, www.ariadne.ac.uk/issue30/ifla.

   2. K. Miesenberg, "Future Library Services: Developing Research Skills among Blind Students," in Digital Libraries of the Blind and the Culture Of Learning in the Information Age: Conference Proceedings of the IFLA SLB Preconference (Washington, D.C.: IFLA/SLB, 2001).

   3. D. Burger, "BrailleNet: Digital Document Delivery for the Blind in France," in Digital Libraries of the Blind and the Culture Of Learning in the Information Age: Conference Proceedings of the IFLA SLB Preconference (Washington, D.C.: IFLA/SLB, 2001).

   4. C. Sung, "The Future of Lifelong Learning in the Next Generation of Library Services," in Digital Libraries of the Blind and the Culture Of Learning in the Information Age: Conference Proceedings of the IFLA SLB Preconference (Washington, D.C.: IFLA/SLB, 2001).

   5. A. Conkie and S. Isard, "Optimal Coupling of Diphones," in J. Van Santen et al., eds., Progress in Speech Synthesis (New York: Springer-Verlag, 1997), 279-82.

   6. Conkie and Isard, "Optimal Coupling of Diphones;" E. Moulines and F. Charpentier, Pitch Synchronous Waveform Processing Techniques for Text-to-Speech Using Diphones, Speech Communication 9, no. 5-6 (autumn 1993): 453-70.

   7. T. Dutoit, An Introduction to Text-to-Speech Synthesis (Dordrecht, Netherlands: Kluwer Academic Pr., 1997).

   8. T. Dutoit and H. Leich, "Text-to-Speech Synthesis Based on a MBE Resynthesis," Speech Communication 13, no. 3-4 (autumn 1993): 435-40; Y. Stylianou, Harmonic Plus Noise Models for Speech, Combined with Statistical Methods, for Speech and Speaker Modification, Ph.D. thesis, ´Ccole Nationale Supérieure des Télécommunications, 1996; Y. Stylianou, "Removing Linear Phase Mismatches in Concatenative Speech Synthesis," IEEE Transactions on Speech and Audio Processing 9, no. 3 (Mar. 2001): 232-39.

   9. E. Keller et al., Improvements in Speech Synthesis: COST 258: The Naturalness of Synthetic Speech (Chichester, England: John Wiley & Sons, 2002).

   10. Conkie and Isard, Optimal Coup_ ling of Diphones; M. Founda et al., Proceedings of the Eurospeech - 2001 Conference vol. 2 (Aalborg, Denmark: Center for PersonKommunikation, 2001), 837-40; E. A. M. Klabbers and R. Veldhuis, "On the Reduction of Concatenation Artifacts in Diphone Synthesis," Proceedings of the ICSLP 98 Conference, vol. 5 (Sidney, Australia: ASSTA, 1998), 1,983-86.


   Markos Dendrinos ( mdendr@teiath.gr) is an Assistant Professor in the Department of Library Studies in the Technological Educational Institution of Athens (TEI-A), Greece, and a Researcher in the Speech Technology Department of the Institute for Language and Speech Processing (ILSP), Athens, Greece. Stelios Bakamidis ( bakam@ilsp.gr) is the Head of the Speech Technology Department of the Institute for Language and Speech Processing (ILSP), Athens, Greece.