TMA Associates
Resources
Voice search is a feature rather than a technology. It makes finding things easier, like a helpful assistant. It suggests an analogy to web search, a popular feature on PCs. In its most direct implementation, it allows a just-say-what-you-want model of user interaction.

Speech recognition has achieved a high level of accuracy and is rapidly becoming a required part of the user interface in a number of fields. In telephony, the Voice User Interface (VUI) is a huge leap in usability over the touch-tone interface and is being widely adopted in areas such as contact center applications and unified communications, with a surge expected in its use in directory assistance, voice search, and marketing applications in general.

Text-to-speech synthesis has both supplemented speech recognition to augment two-way dialog in the Voice User Interface and supported applications on its own. The current technology has gone beyond the required intelligibility to a level of naturalness that can challenge recorded prompts in many applications.

Speaker authentication (speaker verification) can be used to add a level of biometric security to voice applications. While it has not yet been widely adopted, the technology works and can be a critical feature in applications where security is a concern.

Audio search
is a form of speech recognition targeted at large audio files containing speech, for example, podcasts, webcasts, or the audio track on videos. It can be used to find what audio files have particular content by text-specified search terms, and locate that content within the files.

Spoken language identification determines the language being spoken, either in an interactive application or as batch processing.

A brief glossary of common terms in speech technology

Audio search (audio mining, speech analytics): Searching an audio source or sources for specific content (e.g., keywords or subjects). Speech analytics goes beyond looking for single occurrences of phrases to a broader analysis of the context of a search phrase, and may use metadata (text sources that label the file or its location, for example).

Dialog (dialogue): In an automated speech context, a turn-taking conversation between a person and an automated system to move toward a goal.

Grammars (defined grammars): In a speech recognition system, a specification of the range of possible responses by a speaker in a particular context compiled by a designer/developer, e.g., in response to a prompt such as, “What is your account number?”

Hidden Markov Models (HMM): A statistical method at the heart of much of today’s speech recognition technology.

Interactive Voice Response (IVR): A telephony platform for dialog systems, with touch-tone interaction and recorded voice response at a minimum, and speech recognition and text-to-speech a growing option.

IP Telephony: A growing trend toward “convergence” of telephony and computer/Internet standards that makes telephony less of a specialized technology. The trend doesn’t specifically imply speech technologies, but makes them easier to add.

Multimodal (multi-modal): In the context of speech technology, mixing speech input or output with other modes of user input or system output, e.g., keyboards or a stylus for user input or text display for system output. Most telephone applications are multimodal if one includes the keypad (touch-tone interaction), but this adjective usually refers to modalities that include more than voice and touch-tone.

Speaker verification
(speaker authentication, speaker recognition, Voice ID, voiceprints): A biometric identification using the quality of the person’s voice, sometimes supplemented by requiring content (such as a password or account number) known only to the person. Normally, the speaker makes a claim of identity (e.g., through an account number), and the system verifies that claim. However, speaker recognition can refer to discrimination of which of a number of potential speakers the voice belongs to, often requiring a different technical approach.

Speech Recognition (Automated Speech Recognition, ASR, speech-to-text, voice recognition): Automated recognition of the content of speech for the purposes of representing it as text or taking an appropriate action.
Spoken language identification determines the language being spoken, either in an interactive application or as batch processing.

Statistical Language Model
(SLM): A specification of what the user may speak that is less constrained than defined grammars. Typically, an SLM is created from a text database of typical responses or finished text, and generalizes those examples.
Text-To-Speech (TTS, text-to-speech synthesis): Given text, automatically speaking that text in a synthetic voice (typically using a phonetic dictionary and letter-to-sound rules for words not in the dictionary.

Voice Search in its broadest use is a user interface philosophy, summarized roughly as taking advantage of advances in speech technology to reduce the amount of dialog required to achieve a task, as opposed to using a deep hierarchy to reduce the speech recognition problem at the expense of making navigation slow and/or non-intuitive. In a more narrow use, it can refer to initiating a Web search by speaking the search terms rather than typing them.

Voice User Interface (VUI): The collective term for using speech technology to interact with a user.

VoiceXML
(Voice eXtensible Markup Language): A standard for speech dialog systems, particularly oriented toward telephony.