Lesson 1: |
XML Background
As its name suggests, VoiceXML is an XML language. If you are
not familiar with XML languages, this chapter introduces
the major XML concepts that are the basis of the VoiceXML
language. XML constructs include elements, attributes, documents,
and namespaces. Understanding XML is necessary to understanding
VoiceXML. However, this information is not explicitly included in the VoiceXML
Application Developer Certification Test. If you already understand XML or
an XML language such as XHTML, then skim this chapter quickly and proceed to
the next chapter. If you have never worked with an XML language, then review
this chapter to get set for learning VoiceXML. |
Lesson 2: |
VoiceXML Background
People speak and listen with other people. People should be able to speak and
listen with computers. Applications that enable users to speak and listen with
the computer have many benefits which are enumerated at the beginning of this
lesson. VoiceXML and its related languages are designed to develop applications
that enable users to interact with a computer by using a telephone or cell phone
to speak and listen. The W3C Speech Interface Framework currently consists of
five separate languages that fit together like pieces of a jigsaw puzzle to specify
all aspects of a speech application. VoiceXML, the most important of these languages,
is a dialog language that controls the exchange of information between the user
and the computer. Just as Internet Explorer or Netscape Navigator browsers enable
users to interact with information in the World Wide Web by viewing, clicking,
and typing, a VoiceXML browser enables users to interact with information by
listening and speaking. |
Lesson 3: |
VoiceXML Application Structure
Just as an HTML application consists of one or more documents
containing HTML instructions, a VoiceXML application consists of
one or more documents that contain VoiceXML instructions. While HTML
documents may contain a variety of dialogs: interaction techniques
such as pull-down menus, input boxes, and scroll bars; VoiceXML documents
contain only two types of dialogs: menus and forms. Menus and forms
contain messages called prompts that encourage users to speak answers
to specific questions. VoiceXML documents also may contain event
handlers to assist users when they do not respond appropriately to
verbal prompts. |
Lesson 4: |
Menus
Menus are one of the two dialogs supported by VoiceXML. Like
their visual counterpart in HTML, verbal menus present choices to the
user who responds by selecting a choice. VoiceXML menus present a verbal
message called a prompt to the user that often enumerates available
choices. The user indicates the desired choice by either speaking the
choice or by pressing the buttons on a touchtone telephone or cell
phone. Developers use one or more VoiceXML menus to enable the user
to narrow his or her request by “navigating” to the desired
information. This lesson presents the syntax of the VoiceXML <menu> and <choice> elements. |
Lesson 5: |
Forms and the Form Interpretation Algorithm
Forms are the second of the two dialogs supported by VoiceXML.
Like their paper equivalent, a verbal form contains instructions
and blanks into which the user writes information. The instructions
are presented verbally to the user, who responds by speaking or pressing
the buttons on a touchtone telephone or cell phone. Just as a paper
form may contain multiple blank slots, a verbal form may contain
multiple slots called fields. A special algorithm called the Forms
Interpretation Algorithm determines the order in which the user is
prompted to enter information into the fields of a VoiceXML form.
Developers use VoiceXML forms to capture information. This lesson
presents the syntax of the <form> element and describes
how the Forms Interpretation Algorithm works. |
Lesson 6: |
Input Form Items—<field> and <record> Elements
While a paper form contains only blank slots into which
the user write information, a VoiceXML form contains a variety of
input form items. This lesson describes two of these input form items—<field> and <record>. A <field> element
contains verbal prompts that encourage the user to enter information
by speaking or pressing the buttons on a touchtone telephone or cell
phone. If the user speaks, a speech recognition engine converts the
speech into text which is placed into a variable associated with
the VoiceXML <field> element. A <record> element
contains verbal prompts that encourage the user to speak. The recorded
speech is placed into a variable associated with the VoiceXML <record> element.
This chapter presents the syntax of the <field> and <record> elements. |
Lesson 7: |
Executable Content and Navigation
VoiceXML is really two languages integrated into one. One
language is declarative and specifies how a VoiceXML interpreter
should process the VoiceXML <form> and <menu> elements.
The Form Interpretation Algorithm (FIA) provides the procedural processing
for these declarative specifications. The other language is a set
of procedural elements for declaring, setting, and testing variables.
The procedural language is integrated into VoiceXML so developers
do not need to use a separate scripting language for simple procedural
controls. While the FIA determines the sequence for visiting elements
within a <form> element, additional elements
are needed to specify control flow among <form> and <menu> elements.
This lesson describes the places where procedural elements can be
used: the <block> and <filled> elements
and event handlers. (Lesson 8 will describe the procedural elements.)
This lesson also describes navigation control among <form> and <menu> elements. |
Lesson 8: |
Procedural Elements
VoiceXML language designers created procedural elements
in VoiceXML so application developers do not need to use a separate
scripting language for simple procedural control. This lesson describes
the various procedural elements that make VoiceXML behave like a
traditional programming language, even though the Forms Interpretation
Algorithm make the <menu> and <form> elements
appear declarative rather than procedural. |
Lesson 9: |
Form Items—<object>, <subdialog> and <transfer> Elements
Lesson 6 described two of the input-form items—elements
within a <form> element. This chapter describes three additional
input-form items. These three elements enable the reuse of existing
code within multiple VoiceXML applications. The <subdialog> element
enables the user to interact with a module written using the VoiceXML
language, while the <object> element enables the user
to interact with a module written in other languages. The <transfer> element
enables the user to use a telephone system to connect with user from
a telephone or cell phone. |
Lesson 10: |
Variables
Several types of variables are available to VoiceXML
applications. Session variables are global, read-only variables
that provide general information about a VoiceXML session. Application
variables contain information about the text most recent results
from the speech recognition engine. The VoiceXML elements <var> and <value> support
the creation and access to programmer-specified VoiceXML variables.
ECMAScript is a variation of JavaScript that can be embedded into
VoiceXML applications. Like JavaScript, ECMAScript also supports
the creation and access of variables. A speech application developer
can create variables using JavaScript and the VoiceXML <var> element.
The speech application developer may access any of these variables,
as well as access session variables and application variables.
|
Lesson 11: |
Events
An event occurs as a result of a user or system action causing
a situation that requires action by the user or the application. Event
handlers respond to events when they occur. While there are default
event handlers for most events, speech application developers frequently
write event handlers that override the default event handlers, especially
to provide just-in-time help to users who fail to respond properly
to verbal prompts. |
Lesson 12: |
Resource Management
Resources stored on remote servers may include audio files,
grammars, scripts, objects, and VoiceXML documents. These resources
must be fetched into a cache on the speech server before the VoiceXML
interpreter can access them. Speech application developers provide
hints to the cache management system by specifying when to fetch
resources, what to do if the fetch fails, what to present to a
waiting user while resources are being fetched, and whether to
use or refetch expired resources into the cache again. This lesson
describes the VoiceXML syntax so a speech application developer
can specify these hints. |
Lesson 13: |
Properties
Properties are variables that represent the state and control
of processes such as the speech recognition engine, DTMF recognition
engine, prompt manager, and the resource manager. Speech application
developers use the <property> element to set and access
selected properties. Some properties can be overridden by attributes
of selected VoiceXML elements. |
Lesson 14: |
Introduction to Grammars
The speech application needs to inform the speech recognition
engine what it should listen for, specifically what individual words
the user may speak, and how those words may appear in spoken word
patterns. A grammar is a set of rules used to describe words and
phrases that users may speak in response to a prompt. Grammars enable
the speech recognition engine to recognize words relevant to the
current point in the dialog, improve the response time and recognition
accuracy of the speech recognition engine, and recognize when a user
fails to answer a prompt. This lesson describes the Speech Recognition
Grammar Specification (SRGS) language. |
Lesson 15: |
Using Grammars in VoiceXML
While Lesson 14 described the syntax of the Speech Recognition
Grammar specification (SRGS) language, this lesson describes how SRGS
can be embedded into VoiceXML. The scope of a grammar refers to the
places in the VoiceXML code where the grammar is active. The speech
application developer needs to understand scoping rules in order to
understand where one or more grammars are active and when a grammar
overrides another grammar. |
Lesson 16: |
Writing Complex Grammars
Large complex grammars can be partitioned into smaller
grammars which can be systematically combined. A grammar and grammar
fragments can be shared among several VoiceXML applications by storing
them separately from the applications and referencing them from within
each application. Grammars may contain semantic interpretation scripts
that extract and translate words recognized by the speech recognition
engine into structures appropriate for later processing. |
Lesson
17: |
Speech Synthesis Markup
Language
Speech application developers can override the structure,
wording, pronunciation, and prosody of synthesized speech prompts
produced with a speech synthesis engine by embedding Speech Synthesis
Markup Language (SSML) elements directly into the <prompt> element
of VoiceXML. Speech application developers can also replace synthesized
speech with prerecorded audio files. SSML may also be used in
a stand-along mode to create audio books and other applications
independent of VoiceXML. |
Lesson 18: |
Introduction to Semantic Interpretation
The Semantic Interpretation is a scripting language based
on JavaScript. Speech application developers embed Semantic Interpretation
scripts into grammars to manipulate the results from the speech recognition
engine. Semantic interpretation scripts may extract and translate
words and phrases returned by the speech recognition engine into
structures useful to the VoiceXML application. |
Lesson 19: |
Semantic Interpretation—Towards Natural Language
Understanding
Natural language processing is still a research topic undergoing
development and refinement. However, a limited form of natural language
processing is possible in VoiceXML by embedding natural language
processing algorithms as scripts into grammars. Specifically, speech
application developers use semantic interpretation scripts to extract
and transform words produced by the speech recognition engine into
ECMAScript objects suitable for natural language processing algorithms.
Some semantic interpretation scripts may contain basic natural language
processing. This lesson also describes how to serialize, transform,
ECMAScript objects to XML format for processing by XML languages. |
Lesson 20: |
Creating Voice User Interfaces for Novice
Callers
Different classes of users require very different
styles of dialog for speaking and listening to VoiceXML applications.
Novice users require a system-directed style user interface which
prompts the user by asking questions to which the user responds.
This enables novice users to perform application tasks with little
prior knowledge about the application. Guidelines for formulating
system-directed speech user interfaces are presented.
|
Lesson 21: |
Creating Voice User Interfaces for Average
Callers
Different classes of users require very different styles
of dialog for speaking and listening to VoiceXML applications. Average
users have some understanding of the application and may complain that
system-directed dialogs are too time-consuming, especially when users
must listen to prompt messages that they have previously heard many
times. This lesson describes how to enable barge-in so that users can
speak the answer to a prompt before hearing the complete prompt, thus
accelerating the system-directed user interface.
|
Lesson 22: |
Creating Voice User Interfaces for Experienced Callers
Different classes of users require very different styles of dialog for speaking
and listening to VoiceXML applications. Experienced users need shortcuts and
advanced features that enable them to efficiently and effectively complete their
tasks. These enhancements effectively change the user interface from system-directed
to mixed-initiative in which the user occasionally takes control of the dialog
by offering information even before the system asks for it. These advanced speech
user interfaces offer switching among multiple active tasks, speaking the values
for multiple fields in a single speech utterance, and even speaking values in
out-of-sequence order.
|
Lesson 23: |
Introduction to CCXML
VoiceXML has alimited capability to answer and terminate
telephone calls. The Call Control XML Language (CCXML) is a language
for managing telephone calls—incoming calls, outgoing calls,
and call management including holding, transferring, and disconnecting—and
conference calls. This lesson briefly describes the terminology of
telephone calls and introduces the CCXML language for responding to
events from the telephone system, from VoiceXML, and from the CCXML
application itself.
|
Lesson 24: |
CCXML and VoiceXML
A CCXML application has no end-user interface. It relies
upon VoiceXML to solicit information from and present information to
the user. A CCXML application can initiate and stop a VoiceXML application.
CCXML can also be used to process incoming calls and initiate outgoing
calls.
|
Lesson 25: |
CCXML Conferencing
In addition to managing connections between callers and VoiceXML,
CCXML also manages conferences involving two or more users. This
lesson describes how to create and tear down conference calls.
|
Lesson 26: |
ECMAScript
ECMAScript is a version of JavaScript designed to work efficiently
in a client computer with limited computing resources. Speech application
developers embed ECMAScript scripts directly into VoiceXML within the <script> element.
ECMAScript is also used with with several VoiceXML elements that perform
conditional testing. Finally, ECMAScript is the basis for the Semantic
Interpretation language. The lesson is not a compete tutorial of how
to write ECMAScript code. Instead, it lists the frequently used features
of ECMAScript.
|
Lesson 27: |
VoiceXML 2.1
VoiceXML 2.1 is the result of identifying eight new features
that individual VoiceXML vendors have implemented. The VoiceXML 2.1
working draft is still new; so many vendors have not had a chance to
implement these eight new features yet. VoiceXML 2.0 applications should
work without modification under VoiceXML 2.1.
|