Putting speech to work in business

(Combining Speech Recognition Update
and Telephone Strategy News)
Speech Strategy News
provides news and analysis of commercial applications of advanced speech
technology, including speech recognition, text-to-speech synthesis,
speaker authentication, and voice search. The newsletter provides
extensive coverage of market segments such as telecommunications, both
contact-center and service-provider applications; automotive telematics;
wireless devices and mobility; PC applications, including healthcare
reports generation; industrial/warehouse applications; and games. The
broad coverage is managed so that articles of interest to specific
subscribers can be identified quickly, with two-level headlines providing
the gist of the articles.
Speech Strategy News is written by
Bill Meisel, a well-known industry analyst with decades of
experience in both the technology and business of speech technology. Bill
works hard to distill the importance of news, provide an analysis of its
importance, and to include those details you need when the news is
particularly relevant to you. His Editor’s Notes each month have helped
those impacted by the maturing technology to understand and anticipate
important trends. Direct insights from other sources are included in
interviews with influential executives and in a popular guest column, VUI
Visions.
View a typical back issue
View table of contents of a recent issue
Consulting services
View typical articles
Technologies covered:
Voice search
is a feature rather than a technology. It makes finding
things easier, like a helpful assistant. It suggests an analogy to web
search, a popular feature on PCs. In its most direct implementation, it
allows a just-say-what-you-want model of user interaction.
Speech recognition
has achieved a high level of accuracy and is rapidly becoming a required
part of the user interface in a number of fields. In telephony, the
Voice User Interface (VUI) is a huge leap in usability over the touch-tone
interface and is being widely adopted in areas such as contact center
applications and unified communications, with a surge expected in its use
in directory assistance, voice search, and
marketing applications in general.
Text-to-speech synthesis
has both supplemented speech recognition to augment two-way dialog in the
Voice User Interface and supported applications on its own. The current
technology has gone beyond the required intelligibility to a level of
naturalness that can challenge recorded prompts in many applications.
Speaker authentication
(speaker verification) can be used to add a level of biometric security to
voice applications. While it has not yet been widely adopted, the
technology works and can be a critical feature in applications where
security is a concern.
Audio search is a
form of speech recognition targeted at large audio files containing
speech, for example, podcasts, webcasts, or the audio track on videos. It
can be used to find what audio files have particular content by
text-specified search terms, and locate that content within the files.
A brief glossary of speech
technology terms:
Technologies
Voice User
Interface
(VUI): The collective term for using speech technology to interact with a
user.
Speech
Recognition
(Automated Speech Recognition, ASR, speech-to-text, voice recognition*):
Automated recognition of the content of speech for the purposes of
representing it as text or taking an appropriate action.
Text-To-Speech
(TTS, text-to-speech synthesis): Given text, automatically speaking that
text in a synthetic voice (typically using a phonetic dictionary and
letter-to-sound rules for words not in the dictionary.
Speaker
verification
(speaker authentication, speaker recognition, Voice ID, voiceprints): A
biometric identification using the quality of the person’s voice,
sometimes supplemented by requiring content (such as a password or account
number) known only to the person. Normally, the speaker makes a claim of
identity (e.g., through an account number), and the system verifies that
claim. However, speaker recognition can refer to discrimination of which
of a number of potential speakers the voice belongs to, often requiring a
different technical approach.
Audio
search
(audio mining, speech analytics): Searching an audio source or sources for
specific content (e.g., keywords or subjects). Speech analytics goes
beyond looking for single occurrences of phrases to a broader analysis of
the context of a search phrase, and may use metadata (text sources that
label the file or its location, for example).
Hidden
Markov Models
(HMM): A
statistical method at the heart of much of today’s speech recognition
technology.
Grammars
(defined grammars): In a speech recognition system, a specification of the
range of possible responses by a speaker in a particular context compiled
by a designer/developer, e.g., in response to a prompt such as, “What is
your account number?”
Statistical language model
(SLM): A specification of what the user may speak that is less constrained
than defined grammars. Typically, an SLM is created from a text database
of typical responses or finished text, and generalizes those examples.
* “Voice
recognition” has at times been used to refer to both speech
recognition (recognizing content of speech) and recognizing a voice
(speaker verification), as opposed to the content of the speech. “Speech
recognition” and “speaker verification” are less ambiguous terms.
Voice Search is a
user interface philosophy, summarized roughly as taking advantage of
advances in speech technology to reduce the amount of dialog required to
achieve a task, as opposed to using a deep hierarchy to reduce the
speech recognition problem at the expense of making navigation slow
and/or non-intuitive.
Interactive
systems
Dialog
(dialogue): In an automated speech context, a turn-taking conversation
between a person and an automated system to move toward a goal.
Multimodal
(multi-modal): In the context of speech technology, mixing speech input or
output with other modes of user input or system output, e.g., keyboards or
a stylus for user input or text display for system output. Most telephone
applications are multimodal if one includes the keypad (touch-tone
interaction), but this adjective usually refers to modalities that include
more than voice and touch-tone.
VoiceXML
(Voice eXtensible Markup Language, VXML): A standard for speech dialog
systems, particularly oriented toward telephony.
SALT
(Speech Application Language Tags): A standard for speech dialog systems,
oriented toward telephony and multimodal applications.
X+V
(XHTML +
Voice): A mixture of Web standard XHTML and VoiceXML for multimodal dialog
applications.
Interactive Voice Response
(IVR): A telephony platform for dialog systems, with touch-tone
interaction and recorded voice response at a minimum, and speech
recognition and text-to-speech a growing option.
IP
Telephony:
A growing trend toward “convergence” of telephony and computer/Internet
standards that makes telephony less of a specialized technology. The trend
doesn’t specifically imply speech technologies, but makes them easier to
add.