Table of Contents

Lesson 1:

XML Background
As its name suggests, VoiceXML is an XML language. If you are not familiar with XML languages, this chapter introduces the major XML concepts that are the basis of the VoiceXML language. XML constructs include elements, attributes, documents, and namespaces. Understanding XML is necessary to understanding VoiceXML. However, this information is not explicitly included in the VoiceXML Application Developer Certification Test. If you already understand XML or an XML language such as XHTML, then skim this chapter quickly and proceed to the next chapter. If you have never worked with an XML language, then review this chapter to get set for learning VoiceXML.

Lesson 2:

VoiceXML Background
People speak and listen with other people. People should be able to speak and listen with computers. Applications that enable users to speak and listen with the computer have many benefits which are enumerated at the beginning of this lesson. VoiceXML and its related languages are designed to develop applications that enable users to interact with a computer by using a telephone or cell phone to speak and listen. The W3C Speech Interface Framework currently consists of five separate languages that fit together like pieces of a jigsaw puzzle to specify all aspects of a speech application. VoiceXML, the most important of these languages, is a dialog language that controls the exchange of information between the user and the computer. Just as Internet Explorer or Netscape Navigator browsers enable users to interact with information in the World Wide Web by viewing, clicking, and typing, a VoiceXML browser enables users to interact with information by listening and speaking.

Lesson 3:

VoiceXML Application Structure
Just as an HTML application consists of one or more documents containing HTML instructions, a VoiceXML application consists of one or more documents that contain VoiceXML instructions. While HTML documents may contain a variety of dialogs: interaction techniques such as pull-down menus, input boxes, and scroll bars; VoiceXML documents contain only two types of dialogs: menus and forms. Menus and forms contain messages called prompts that encourage users to speak answers to specific questions. VoiceXML documents also may contain event handlers to assist users when they do not respond appropriately to verbal prompts.

Lesson 4:

Menus
Menus are one of the two dialogs supported by VoiceXML. Like their visual counterpart in HTML, verbal menus present choices to the user who responds by selecting a choice. VoiceXML menus present a verbal message called a prompt to the user that often enumerates available choices. The user indicates the desired choice by either speaking the choice or by pressing the buttons on a touchtone telephone or cell phone. Developers use one or more VoiceXML menus to enable the user to narrow his or her request by “navigating” to the desired information. This lesson presents the syntax of the VoiceXML <menu> and <choice> elements.

Lesson 5:

Forms and the Form Interpretation Algorithm
Forms are the second of the two dialogs supported by VoiceXML. Like their paper equivalent, a verbal form contains instructions and blanks into which the user writes information. The instructions are presented verbally to the user, who responds by speaking or pressing the buttons on a touchtone telephone or cell phone. Just as a paper form may contain multiple blank slots, a verbal form may contain multiple slots called fields. A special algorithm called the Forms Interpretation Algorithm determines the order in which the user is prompted to enter information into the fields of a VoiceXML form. Developers use VoiceXML forms to capture information. This lesson presents the syntax of the <form> element and describes how the Forms Interpretation Algorithm works.

Lesson 6:

Input Form Items—<field> and <record> Elements
While a paper form contains only blank slots into which the user write information, a VoiceXML form contains a variety of input form items. This lesson describes two of these input form items—<field> and <record>. A <field> element contains verbal prompts that encourage the user to enter information by speaking or pressing the buttons on a touchtone telephone or cell phone. If the user speaks, a speech recognition engine converts the speech into text which is placed into a variable associated with the VoiceXML <field> element. A <record> element contains verbal prompts that encourage the user to speak. The recorded speech is placed into a variable associated with the VoiceXML <record> element. This chapter presents the syntax of the <field> and <record> elements.

Lesson 7:

Executable Content and Navigation
VoiceXML is really two languages integrated into one. One language is declarative and specifies how a VoiceXML interpreter should process the VoiceXML <form> and <menu> elements. The Form Interpretation Algorithm (FIA) provides the procedural processing for these declarative specifications. The other language is a set of procedural elements for declaring, setting, and testing variables. The procedural language is integrated into VoiceXML so developers do not need to use a separate scripting language for simple procedural controls. While the FIA determines the sequence for visiting elements within a <form> element, additional elements are needed to specify control flow among <form> and <menu> elements. This lesson describes the places where procedural elements can be used: the <block> and <filled> elements and event handlers. (Lesson 8 will describe the procedural elements.) This lesson also describes navigation control among <form> and <menu> elements.

Lesson 8:

Procedural Elements
VoiceXML language designers created procedural elements in VoiceXML so application developers do not need to use a separate scripting language for simple procedural control. This lesson describes the various procedural elements that make VoiceXML behave like a traditional programming language, even though the Forms Interpretation Algorithm make the <menu> and <form> elements appear declarative rather than procedural.

Lesson 9:

Form Items—<object>, <subdialog> and <transfer> Elements
Lesson 6 described two of the input-form items—elements within a <form> element. This chapter describes three additional input-form items. These three elements enable the reuse of existing code within multiple VoiceXML applications. The <subdialog> element enables the user to interact with a module written using the VoiceXML language, while the <object> element enables the user to interact with a module written in other languages. The <transfer> element enables the user to use a telephone system to connect with user from a telephone or cell phone.

Lesson 10:

Variables
Several types of variables are available to VoiceXML applications. Session variables are global, read-only variables that provide general information about a VoiceXML session. Application variables contain information about the text most recent results from the speech recognition engine. The VoiceXML elements <var> and <value> support the creation and access to programmer-specified VoiceXML variables. ECMAScript is a variation of JavaScript that can be embedded into VoiceXML applications. Like JavaScript, ECMAScript also supports the creation and access of variables. A speech application developer can create variables using JavaScript and the VoiceXML <var> element. The speech application developer may access any of these variables, as well as access session variables and application variables.

Lesson 11:

Events
An event occurs as a result of a user or system action causing a situation that requires action by the user or the application. Event handlers respond to events when they occur. While there are default event handlers for most events, speech application developers frequently write event handlers that override the default event handlers, especially to provide just-in-time help to users who fail to respond properly to verbal prompts.

Lesson 12:

Resource Management
Resources stored on remote servers may include audio files, grammars, scripts, objects, and VoiceXML documents. These resources must be fetched into a cache on the speech server before the VoiceXML interpreter can access them. Speech application developers provide hints to the cache management system by specifying when to fetch resources, what to do if the fetch fails, what to present to a waiting user while resources are being fetched, and whether to use or refetch expired resources into the cache again. This lesson describes the VoiceXML syntax so a speech application developer can specify these hints.

Lesson 13:

Properties
Properties are variables that represent the state and control of processes such as the speech recognition engine, DTMF recognition engine, prompt manager, and the resource manager. Speech application developers use the <property> element to set and access selected properties. Some properties can be overridden by attributes of selected VoiceXML elements.

Lesson 14:

Introduction to Grammars
The speech application needs to inform the speech recognition engine what it should listen for, specifically what individual words the user may speak, and how those words may appear in spoken word patterns. A grammar is a set of rules used to describe words and phrases that users may speak in response to a prompt. Grammars enable the speech recognition engine to recognize words relevant to the current point in the dialog, improve the response time and recognition accuracy of the speech recognition engine, and recognize when a user fails to answer a prompt. This lesson describes the Speech Recognition Grammar Specification (SRGS) language.

Lesson 15:

Using Grammars in VoiceXML
While Lesson 14 described the syntax of the Speech Recognition Grammar specification (SRGS) language, this lesson describes how SRGS can be embedded into VoiceXML. The scope of a grammar refers to the places in the VoiceXML code where the grammar is active. The speech application developer needs to understand scoping rules in order to understand where one or more grammars are active and when a grammar overrides another grammar.

Lesson 16:

Writing Complex Grammars
Large complex grammars can be partitioned into smaller grammars which can be systematically combined. A grammar and grammar fragments can be shared among several VoiceXML applications by storing them separately from the applications and referencing them from within each application. Grammars may contain semantic interpretation scripts that extract and translate words recognized by the speech recognition engine into structures appropriate for later processing.

Lesson 17:

Speech Synthesis Markup Language
Speech application developers can override the structure, wording, pronunciation, and prosody of synthesized speech prompts produced with a speech synthesis engine by embedding Speech Synthesis Markup Language (SSML) elements directly into the <prompt> element of VoiceXML. Speech application developers can also replace synthesized speech with prerecorded audio files. SSML may also be used in a stand-along mode to create audio books and other applications independent of VoiceXML.

Lesson 18:

Introduction to Semantic Interpretation
The Semantic Interpretation is a scripting language based on JavaScript. Speech application developers embed Semantic Interpretation scripts into grammars to manipulate the results from the speech recognition engine. Semantic interpretation scripts may extract and translate words and phrases returned by the speech recognition engine into structures useful to the VoiceXML application.

Lesson 19:

Semantic Interpretation—Towards Natural Language Understanding
Natural language processing is still a research topic undergoing development and refinement. However, a limited form of natural language processing is possible in VoiceXML by embedding natural language processing algorithms as scripts into grammars. Specifically, speech application developers use semantic interpretation scripts to extract and transform words produced by the speech recognition engine into ECMAScript objects suitable for natural language processing algorithms. Some semantic interpretation scripts may contain basic natural language processing. This lesson also describes how to serialize, transform, ECMAScript objects to XML format for processing by XML languages.

Lesson 20:

Creating Voice User Interfaces for Novice Callers
Different classes of users require very different styles of dialog for speaking and listening to VoiceXML applications. Novice users require a system-directed style user interface which prompts the user by asking questions to which the user responds. This enables novice users to perform application tasks with little prior knowledge about the application. Guidelines for formulating system-directed speech user interfaces are presented.

Lesson 21:

Creating Voice User Interfaces for Average Callers
Different classes of users require very different styles of dialog for speaking and listening to VoiceXML applications. Average users have some understanding of the application and may complain that system-directed dialogs are too time-consuming, especially when users must listen to prompt messages that they have previously heard many times. This lesson describes how to enable barge-in so that users can speak the answer to a prompt before hearing the complete prompt, thus accelerating the system-directed user interface.

Lesson 22:

Creating Voice User Interfaces for Experienced Callers
Different classes of users require very different styles of dialog for speaking and listening to VoiceXML applications. Experienced users need shortcuts and advanced features that enable them to efficiently and effectively complete their tasks. These enhancements effectively change the user interface from system-directed to mixed-initiative in which the user occasionally takes control of the dialog by offering information even before the system asks for it. These advanced speech user interfaces offer switching among multiple active tasks, speaking the values for multiple fields in a single speech utterance, and even speaking values in out-of-sequence order.

Lesson 23:

Introduction to CCXML
VoiceXML has alimited capability to answer and terminate telephone calls. The Call Control XML Language (CCXML) is a language for managing telephone calls—incoming calls, outgoing calls, and call management including holding, transferring, and disconnecting—and conference calls. This lesson briefly describes the terminology of telephone calls and introduces the CCXML language for responding to events from the telephone system, from VoiceXML, and from the CCXML application itself.

Lesson 24:

CCXML and VoiceXML
A CCXML application has no end-user interface. It relies upon VoiceXML to solicit information from and present information to the user. A CCXML application can initiate and stop a VoiceXML application. CCXML can also be used to process incoming calls and initiate outgoing calls.

Lesson 25:

CCXML Conferencing
In addition to managing connections between callers and VoiceXML, CCXML also manages conferences involving two or more users. This lesson describes how to create and tear down conference calls.

Lesson 26:

ECMAScript
ECMAScript is a version of JavaScript designed to work efficiently in a client computer with limited computing resources. Speech application developers embed ECMAScript scripts directly into VoiceXML within the <script> element. ECMAScript is also used with with several VoiceXML elements that perform conditional testing. Finally, ECMAScript is the basis for the Semantic Interpretation language. The lesson is not a compete tutorial of how to write ECMAScript code. Instead, it lists the frequently used features of ECMAScript.

Lesson 27:

VoiceXML 2.1
VoiceXML 2.1 is the result of identifying eight new features that individual VoiceXML vendors have implemented. The VoiceXML 2.1 working draft is still new; so many vendors have not had a chance to implement these eight new features yet. VoiceXML 2.0 applications should work without modification under VoiceXML 2.1.