Lesson 2: VoiceXML Background

Lesson goal: Familiarize yourself with VoiceXML

This tutorial contains useful background information about VoiceXML that developers should understand. However, this information is not included in the VoiceXML Certification Test.

Contents

2.1. Motivation of Speech Applications
Goal: Learn why speech applications are useful

2.2. Voice Application Languages
Goal: Learn how VoiceXML relates to other languages for developing speech applications

2.3. The VoiceXML Architecture
Goal: Extend the Web architecture to support telephones and cell phones

2.4. Review Questions

2.1.  Motivation of Speech Applications

Goal: Learn why speech applications are useful

This section explains the benefits of speech applications and why users should speak and listen to a computer via a telephone. This section motivates VoiceXML, and can be skimmed quickly.

People speak and listen to each other. People should be able to speak and listen to a computer.  

People use telephones to speak and listen to others who are not physically present. People should be able to speak and listen to computers that are not physically present.

People can listen to portable devices such as radio and televisions. People can speak into portal devices such as tape recorders. People should be able to speak and listen to handheld devices such as PDAs and pocket computers.  

People can listen to computers and respond by pressing the buttons on touchtone telephones and cell phones. However, this “half speech and half button press” style of dialog is not always natural and occasionally results with the caller being “lost in space” traversing long menu sequences. People should be able to both speak and listen to a computer using a telephone or cell phone.

VoiceXML is a markup language for building speech interfaces—the verbal equivalent of HTML. VoiceXML is affecting the speech industry dramatically by changing how developers create speech-enabled Internet applications. By hiding many low-level details, developers use VoiceXML to create speech-enabled applications by specifying high-level menus and forms rather than procedural program code. Decreasing programmer time and effort enables developers to perform additional iterations of usability testing and design refinement. VoiceXML is lowering the entry barrier for creating speech applications. 

VoiceXML makes iterative design and testing of speech-enabled applications possible. Developers can quickly mock up designs for evaluation by prospective callers. Developers quickly identify and fix trouble-spots. VoiceXML hides complex programming details, which enables the developer to concentrate on creating the overall design by refining the detailed wording of prompts and messages spoken to the caller. VoiceXML does NOT displace the need for user testing; it makes it possible to perform more user testing.

VoiceXML enables callers to speak and listen to a computer…

…despite physical handicaps such as blindness or poor physical dexterity. Speaking enables impaired callers to access computers. Callers with poor physical dexterity (who cannot type) can use speech to enter requests to the computer. Sight-impaired users can listen to the computer as it speaks. When visual and/or mechanical interfaces are not an option, callers can perform transactions by saying what they want done and supplying the appropriate information. If a person with impaired can speak and listen, that person can use a computer.

…to bypass the limitations of small keyboards and screens. As devices become smaller, our fingers do not. Keys on the keypad shrink—often to the point where people with thick fingers press two or more keys with one finger stroke. The small screens on some cell phones may be difficult to see, especially in extreme lighting conditions. Even PDAs with QWERTY keyboards are awkward. (QWERTY is a sequence of six keys found on traditional keyboards used by most English and Western-European language speakers.) Users hold the device with one hand and “hunt and peck” with the forefinger of the other hand. It is impossible to use both hands to touchtype and hold the device at the same time. By speaking, callers can bypass the keypad (except possibly for entering private data in crowded or noisy environments). By speaking and listening, callers can bypass the small screen of many handheld electronic devices.

…if the device has no keyboard. Many devices have no keypad or keyboard. For example, stoves, refrigerators, and heating and air conditioning thermostats have no keyboards. These appliances may have a small control panel with a couple of buttons and a dial. The physical controls are good for turning the appliance on and off and adjusting its temperature and time. Without speech, a user cannot specify complex instructions such as, “turn the temperature in the oven to 350 degrees for 30 minutes, then change the temperature to 250 degrees for 15 minutes, and finally leave the oven on warm.” Without speech, the appliance cannot ask questions such as, “When on Saturday morning do you turn the heat on?” Any sophisticated dialog with these appliances will require speech input.

…while callers work with their hands and eyes. Speaking and listening are especially useful in situations where the caller’s eyes and/or hands are busy. Drivers need to keep their eyes on the road and their hands on the steering wheel. If they must use a computer when driving, the interface should be speech only. When driving machines requiring their hands to operate controls and their eyes to focus on the machine activities, machine operators can also use speech to communicate with a computer. (It is not recommended that you hold and use a cell phone while driving a car.) Mothers and caregivers with children in their arms may also appreciate speaking and listening to a doctor’s Web page or medical service. If a person can speak and listen to others while they work, they can speak and listen to a computer while they work.

…at anytime during the day. Many telephone help lines and receptionists are available only during working hours. Computers can automate much of this activity, such as accepting messages, providing information, and answering callers’ questions. Callers can access these automated services 24 hours a day, 7 days a week via a telephone by speaking and listening to a computer. If a person can speak and listen, they can interact with a computer anytime during the day or night. 

…with instant connection without being placed on “hold.” Callers become frustrated when they hear “your call is very important to us” because this message means they must wait. “Thanks for waiting, all of our operators are busy” means more waiting. When using speech to interact with an application, there are no hold times. The computer responds quickly. (However, computers can become saturated which results in delays; but these occur less frequently than callers waiting for a human operator.) Because many callers can be serviced by voice-enabled applications, the human operators are freed to resolve more difficult caller problems.

…using languages that do not lend themselves to keyboarding.  Some languages do not lend themselves to data entry using the traditional QWERTY keyboard.  Rather than force Asian language users to mentally translate their words and phrases to phonetic sounds and then press the corresponding keys on the QWERTY keyboard, a much better solution is to speak and listen. Speech and handwriting recognition will be the key to enabling Asian language speakers to gain full use of computers. If a person can speak and listen to an Asian language, they can interact with a computer using that language.

…to convey emotion. In an effort to enhance written text to convey emotions, callers frequently use emoticons—keyboard symbols to convey emotions—to enhance their text messages. Example emoticons include :) for happy or joke, and :( for sad. With speech, these emotions can be conveyed naturally by changing the inflection, speed, and volume of the speaking voice.

…to access time-sensitive data.  Mobile devices—cell phones, handheld computers, and laptop computers—are portable. They can be used anywhere—at home, at work, and on the road. These devices provide the caller with time-sensitive data—available and up-to-date when the caller needs it. Examples of time-sensitive data include being notified of late-breaking news item about a company in which the caller owns stock or being notified that a flight is delayed by three hours. Verbal alerts such as these could replace paging systems. 

…to access location-sensitive data. Mobile devices provide callers with location-sensitive data—data available where the caller needs it, such as the changing traffic conditions relevant to callers as they drive home, where a hotel is, or where the nearest hospital is. People who travel or move away from their desk need location-sensitive data. Salesmen, delivery people, and other on-the-move employees can access enterprise data—which includes contact information, order entry, order status, catalog item availability, customer information, and many types of enterprise database information relevant to a customer they are visiting. Enterprise employees no longer will need to assume the office position—sitting in front of their computer keyboards and screens to access enterprise information. With mobile devices, callers have access to location-sensitive data from wherever they are. 

…to access public and personal information. A telephone or cell phone enables consumers to access a variety of public information including driving directions, traffic conditions, weather information, yellow page information, banking, stock quotes, general news, sport scores, horoscopes, and various entertaining activities. Consumers are able to serve themselves by locating the information they need or placing orders for desired goods and services. For example, a caring husband could ask the system to “Send flowers to my wife.” A telephone or cell phone enables consumers to access a variety of personal information including e-mail, business reminders, and medical information whenever the user needs the information.

…to control computerized processes and activities. Callers can remotely control security devices and appliances within their homes. For example, while on vacation, a home owner can call and verify that stove hot plate is in the off poition. Callers can perform transactions such as purchases, payments, and transfers. If callers can speak and listen, that caller can remotely control a variety of computerized functions.

There are about 1.3 billion telephones in the world and only about a quarter billion PCs.  Many more users can access the World Wide Web using telephones than PCs.

2.2. Voice Application Languages

Objective: Learn how VoiceXML relates to other languages for developing speech applications

The first part of this section summarizes VoiceXML’s pedigree. It’s purpose is to emphasize that VoiceXML is based on years of experience with earlier languages. The second part of this section describes the W3C Speech Interface Framework, which summarizes the various languages that make up a VoiceXML application. Developers of speech applications will need to understand and use each of the languages in the W3C Speech Interface Framework.

The genealogy of the VoiceXML language is illustrated below in Figure 2-1. AT&T Bell Labs developed a Phone Markup Language (PML). When AT&T was split up, AT&T and Lucent versions of PML evolved separately. IBM developed a language known as SpeechML. Motorola’s language was VoxML. IBM, AT&T, Lucent, and Motorola formed the VoiceXML Forum and created VoiceXML (briefly known as VXML). The preliminary result was published as VoiceXML 0.9 in August 1999, with Version 1.0 published in March 2000.  VoiceXML 1.0 was widely adopted, primarily because of its similarity to HTML. This enabled programmers to develop speech applications quickly.

Figure 2-1: Genealogy of VoiceXML      
Figure 2-1:  Genealogy of VoiceXML

The VoiceXML Forum asked the W3C Voice Browser Working Group to take over language evolution while the Forum concentrated on conformance and educational activities. In 2001 the W3C published drafts of three languages for specifying conversational dialogs in which callers speak and listen to speech-enabled applications: VoiceXML 2.0, Speech Recognition Grammar Specification (SRGS), and Speech Synthesis Markup Language (SSML). In February 2004 the W3C published a draft of a set of eight additional features added to VoiceXML 2.0, called VoiceXML 2.1. No changes are necessary to VoiceXML 2.0 applications to run under VoiceXML 2.1.  
 
Two interesting submissions have been made to the W3C since the publication of VoiceXML 2.0:

  1. IBM, Motorola, and Opera have submitted XHTML + Voice Profiles, frequently referred to as “X+V.” Effectively, this proposal partitions VoiceXML 2.0 into modules which can be inserted into HTML for developing multimodal applications.
  2. The SALT Forum has submitted Speech Application Language Tags (SALT), which specifies a small number of tags for managing speech resources. SALT tags must be embedded within a host language such as HTML, SMIL, or a scripting language which provides flow control. SALT may be used to develop telephony (speech-only) and multimodal applications.

The W3C Voice Browser Working Group has plans to publish additional versions of VoiceXML as the language evolves in response to requests for language extensions from VoiceXML browser vendors and application developers.

The W3C Voice Browser Working Group has separated the elements of VoiceXML 1.0 into five separate languages which are shown below in blue in the W3C Speech Interface Framework shown in Figure 2-2.  

Figure 2-2: W3C Speech Interface Framework  
Figure 2-2: W3C Speech Interface Framework

Components of the W3C Speech Interface Framework include the following:

The five languages of the W3C Speech Interface Framework were separated from VoiceXML 1.0 so they could be used independently from each other. For example:

The latest version of each of the five languages in the W3C Speech Interface Framework are available at http://www.w3.org/voice/. The most current specifications are included with this guide.  

VoiceXML 2.0 supports two types of programming styles:

  1. Declarative—The developer specifies what the computer should do but does not specify details about how the computer should perform each task. VoiceXML 2.0 uses the declarative programming style for specifying menus and forms. 
  2. Executable—The developer specifies the details of how the computer should perform each task. VoiceXML uses the executable (or procedural) programming style for specifying event handling, post processing of data collected from the user, and other situations where declarative specifications are not appropriate. 

Novice programmers like the declarative programming style because they avoid all of the low-level details of specifying how to perform each task. Novice programmers are able to write simple programs quickly. Traditional programmers sometimes feel uncomfortable with declarative programming because it is a different style from procedural programming which which they are more familiar.  

The declarative portion of VoiceXML is easy to learn. After seeing a couple of examples how to implement menus and forms, novice programmers can write simple speech applications to solicit data from users. The executable (procedural) portion of VoiceXML will seem familiar to most Web programmers who are already familiar with similar concepts widely used in Web programming. However, initially both novice and experienced Web programmers tend to develop speech applications with poor user interfaces because they are not familiar with guidelines for developing usable speech interfaces. See the following books for guidelines and tips for developing speech applications:

Bruce Balentine and David P. Morgan, How to Build a Speech Recognition Application—A Style Guide for Telephony Dialogs, Second Edition, ISBN 0-96712278-2-3, Enterprise Integration Group, Inc., 2410 San Ramon Valley Blvd., Suite 225, San Ramon, CA 94583, to order see http://www.eignic.com

Michael H. Cohen, James P. Giangola, and JenniferBalogh, Voice User Interface Design, ISBN 0-321-18576-5, Addison Wesley, 2004

James A. Larson, VoiceXML—Introduction to Developing Speech Applications, ISBN 0-13-009262-2, Prentice Hall, 2003.

Many HTML documents are generated using ColdFusion, Active Server Pages, PhP, and other techniques for decorating database content with the appropriate HTML markup. Many VoiceXML documents are similarly generated automatically using these same techniques on a server when VoiceXML documents are requested by the VoiceXML browser. The topic of automatic generation of VoiceXML documents is not covered by the VoiceXML Certification test and is beyond the scope of this tutorial.

Exercise 2-1

1.  Which of the following languages can be used to develop speech applications?
  • VoiceXML 2.0
  • SALT
  • X+V
2.  Which of the following languages can be used to develop multimodal applications?
  • VoiceXML 2.0
  • SALT
  • X+V
3.    Which of the following languages have been replaced by VoiceXML 2.0?
  • Phone Markup Language (PML)
  • SpeechML
  • VoxML
  • VoiceXML 1.0
  • SALT
  • X+V
Answers Exercise 2-1

2.3. The VoiceXML Architecture

Goal: Extend the Web architecture to support telephones and cell phones

This section describes how a VoiceXML platform works. If you are familiar with how an HTML browser works, then you are already familiar with many of the concepts in this section. If you are not familiar with how a HTML browser works, then read this section carefully.

Figure 2-3 below illustrates the use of a Web server and a visual browser. A visual browser fetches HTML documents from the Web server and interprets them. The documents are displayed on the screen for the user to read. The visual browser accepts mouse clicks and data entered by the user. The visual browser may, following directions from the HTML document it is interpreting, access data in the database server and invoke applications in the application server.
Figure 2-3: Using a Web Server and a Visual Browser  
Figure 2-3:  Using a Web Server and a Visual Browser

As illustrated in Figure 2-4 below, a voice browser works in the same way as a visual browser. The voice browser fetches VoiceXML 2.0 documents from the Web server and sends them to a VoiceXML 2.0 interpreter, which verbalizes the documents for the caller to hear and accepts voice input from the caller. The voice browser may, under direction from the VoiceXML 2.0 document it is interpreting, access data in the database server and invoke applications in the application server.  
Figure 2-4: Using a Web Server and a VoiceXML 2.0 Browser  
Figure 2-4:  Using a Web Server and a VoiceXML 2.0 Browser

There is, however, one major difference between visual and voice browsers. Telephone have no computing ability, so the VoiceXML 2.0 browser and interpreter can not run on a telephone. Instead, the voice browser resides on another device, called a voice server. The voice server provides voice services including (1) the capture information from the caller by recording the caller’s voice, (2) the recognition of what the caller says via speech recognition, and (3) the interpretation of touchtone input from the telephone keypad entered by the caller. It also speaks to the caller by replaying prerecorded audio files and producing synthesized speech. All input and output is controlled by a VoiceXML document, which the voice browser fetches from the server. During the interpretation of a VoiceXML document, data may be accessed on the backend database management system and applications within an application server may be invoked. Often, these are the same databases and applications accessed from an HTML browser.

Figure 2-4 above also illustrates a gateway that connects one or more telephone lines with the speech server. A gateway converts the telephone line in the telephone network to the Internet Protocol world of the Internet and vice versa. A gateway enables users to use a telephone or cell phone to interact with the computer. While the speech server and gateway perform very different functions, they are often combined into a single box. There are at least three options for obtaining a gateway and speech server necessary for an operational voice portal:

  1. Purchase a turn-key gateway and speech server containing the necessary hardware and software. This approach is often used when a company does not have the inhouse expertise to design and build the needed application, so a third party is hired to assemble the hardware and software.  
  2. Purchase the hardware and software components and assemble the gateway and speech server. This approach is used when a company has the inhouse expertise to assemble their own hardware and construct their own software.  
  3. Rent a gateway and speech server from a voice service provider. The software may be outsourced to outside specialists. This approach is used when a company wants to minimize the risk of developing and building a new system. 
Exercise 2-2

Which of the following are benefits of outsourcing?
  • Flexible capacity 
  • Minimal capital investment 
  • Minimal experience
  • Lowers the risk and initial price of entry into voice application
  • Lowers overall expense
Answer Exercise 2-2

2.4.  Review Questions

Lesson summary. VoiceXML applications enable users to interact with the computer by speaking and listening. Users may access voice applications via a traditional telephone or a cell phone. Cell phones have the advantage of being portable, so voice applications are always available from work, from home, or wherever you are.

VoiceXML makes speech applications easy to write and maintain. The W3C Speech Interface Framework consists of several languages useful in developing speech applications, the most important being VoiceXML 2.0, a language for specifying dialogs between a telephone user and an artificial agent in the computer. Much of VoiceXML 2.0 is a high-level, declarative language in which developers specify “what” rather than “how,” making it easy to write speech applications by avoiding the many details there were necessary for developers to specify in earlier languages.

VoiceXML 2.0 is now the defacto language for developing speech applications. Over forty vendors provide speech platforms that support VoiceXML 2.0. Microsoft is the only major exception; its speech platform supports SALT, a collection of speech tags that are embedded into a host language such as HTML or JavaSrcipt.

After completing the exercises and review questions, you should be ready to start learning more about the structure of VoiceXML applications in Lesson 3 Application Structure.







© 2002 Larson Technical Services