TMA Associates

Speech Technology News and Analysis

 

TMA Home
Newsletter
Conferences
Consulting
Resources
Press Releases
Contact Information
Web Subscribers

 

Putting speech to work in business

 

 

 

 

 

 

 

 

 

 

 

 

 

Speech Strategy News

(Combining Speech Recognition Update and Telephone Strategy News)

 

Speech Strategy News provides news and analysis of commercial applications of advanced speech technology, including speech recognition, text-to-speech synthesis, speaker authentication, and voice search. The newsletter provides extensive coverage of market segments such as telecommunications, both contact-center and service-provider applications; automotive telematics; wireless devices and mobility; PC applications, including healthcare reports generation; industrial/warehouse applications; and games. The broad coverage is managed so that articles of interest to specific subscribers can be identified quickly, with two-level headlines providing the gist of the articles.

 

Speech Strategy News is written by Bill Meisel, a well-known industry analyst with decades of experience in both the technology and business of speech technology. Bill works hard to distill the importance of news, provide an analysis of its importance, and to include those details you need when the news is particularly relevant to you. His Editor’s Notes each month have helped those impacted by the maturing technology to understand and anticipate important trends. Direct insights from other sources are included in interviews with influential executives and in a popular guest column, VUI Visions.

 

View a typical back issue

 

View table of contents of a recent issue

 

Consulting services

 

View typical articles

 

Technologies covered:

Voice search is a feature rather than a technology. It makes finding things easier, like a helpful assistant. It suggests an analogy to web search, a popular feature on PCs. In its most direct implementation, it allows a just-say-what-you-want model of user interaction.

Speech recognition has achieved a high level of accuracy and is rapidly becoming a required part of the user interface in a number of fields. In telephony, the Voice User Interface (VUI) is a huge leap in usability over the touch-tone interface and is being widely adopted in areas such as contact center applications and unified communications, with a surge expected in its use in directory assistance, voice search, and marketing applications in general.

 

Text-to-speech synthesis has both supplemented speech recognition to augment two-way dialog in the Voice User Interface and supported applications on its own. The current technology has gone beyond the required intelligibility to a level of naturalness that can challenge recorded prompts in many applications.

 

Speaker authentication (speaker verification) can be used to add a level of biometric security to voice applications. While it has not yet been widely adopted, the technology works and can be a critical feature in applications where security is a concern.

 

Audio search is a form of speech recognition targeted at large audio files containing speech, for example, podcasts, webcasts, or the audio track on videos. It can be used to find what audio files have particular content by text-specified search terms, and locate that content within the files.

 

Spoken language identification determines the language being spoken, either in an interactive application or as batch processing.

 

A brief glossary of speech technology terms:

Technologies

Voice User Interface (VUI): The collective term for using speech technology to interact with a user.

Speech Recognition (Automated Speech Recognition, ASR, speech-to-text, voice recognition*): Automated recognition of the content of speech for the purposes of representing it as text or taking an appropriate action.

Text-To-Speech (TTS, text-to-speech synthesis): Given text, automatically speaking that text in a synthetic voice (typically using a phonetic dictionary and letter-to-sound rules for words not in the dictionary.

Speaker verification (speaker authentication, speaker recognition, Voice ID, voiceprints): A biometric identification using the quality of the person’s voice, sometimes supplemented by requiring content (such as a password or account number) known only to the person. Normally, the speaker makes a claim of identity (e.g., through an account number), and the system verifies that claim. However, speaker recognition can refer to discrimination of which of a number of potential speakers the voice belongs to, often requiring a different technical approach.

Audio search (audio mining, speech analytics): Searching an audio source or sources for specific content (e.g., keywords or subjects). Speech analytics goes beyond looking for single occurrences of phrases to a broader analysis of the context of a search phrase, and may use metadata (text sources that label the file or its location, for example).

Hidden Markov Models (HMM): A statistical method at the heart of much of today’s speech recognition technology.

Grammars (defined grammars): In a speech recognition system, a specification of the range of possible responses by a speaker in a particular context compiled by a designer/developer, e.g., in response to a prompt such as, “What is your account number?”

Statistical language model (SLM): A specification of what the user may speak that is less constrained than defined grammars. Typically, an SLM is created from a text database of typical responses or finished text, and generalizes those examples.

* “Voice recognition” has at times been used to refer to both speech recognition (recognizing content of speech) and recognizing a voice (speaker verification), as opposed to the content of the speech. “Speech recognition” and “speaker verification” are less ambiguous terms.

Voice Search is a user interface philosophy, summarized roughly as taking advantage of advances in speech technology to reduce the amount of dialog required to achieve a task, as opposed to using a deep hierarchy to reduce the speech recognition problem at the expense of making navigation slow and/or non-intuitive.

Spoken language identification determines the language being spoken, either in an interactive application or as batch processing.

Interactive systems

Dialog (dialogue): In an automated speech context, a turn-taking conversation between a person and an automated system to move toward a goal.

Multimodal (multi-modal): In the context of speech technology, mixing speech input or output with other modes of user input or system output, e.g., keyboards or a stylus for user input or text display for system output. Most telephone applications are multimodal if one includes the keypad (touch-tone interaction), but this adjective usually refers to modalities that include more than voice and touch-tone.

VoiceXML (Voice eXtensible Markup Language, VXML): A standard for speech dialog systems, particularly oriented toward telephony.

SALT (Speech Application Language Tags): A standard for speech dialog systems, oriented toward telephony and multimodal applications.

X+V (XHTML + Voice): A mixture of Web standard XHTML and VoiceXML for multimodal dialog applications.

Interactive Voice Response (IVR): A telephony platform for dialog systems, with touch-tone interaction and recorded voice response at a minimum, and speech recognition and text-to-speech a growing option.

IP Telephony: A growing trend toward “convergence” of telephony and computer/Internet standards that makes telephony less of a specialized technology. The trend doesn’t specifically imply speech technologies, but makes them easier to add.