From Speech Strategy News, May 2010
Google seeks “ubiquitous availability” for a
speech option
Michael Cohen, Manager, Speech Technology, Google, gave a keynote address at the
Mobile Voice Conference in San Francisco, speaking on April 22, 2010. The
keynote was entitled “Transparent Mobility.” One aspect of the
use of mobile phones, Cohen noted, is the growing expectation of completely
ubiquitous availability: Users increasingly expect to have constant
access to the information and services of the web. The small devices and
usage while moving have increased the importance of speech technology in meeting
this need for ubiquitous mobile access—“any
time, any place, any usage scenario, as part of any type of activity.”
Cohen said that achieving ubiquity of spoken
access requires two things: (1) availability
in every possible interaction where speech input or output makes sense, and (2)
performance (it works so well that
the modality adds no friction to the interaction). If the speech is always
available when a keyboard can be used, the only addition necessary is some sort
of “speak” or “listen” button icon (although training users to use such a
button properly is a bit challenging). Cohen said that there is a strong
measured correlation between speech recognition accuracy and repeat usage,
suggesting that users learn to use a speech interface to best advantage with
practice. Availability of speech whenever a keyboard is displayed will be
available on the Google Nexus One mobile phone released in January, Cohen said.
Cohen said that, to get to this point, delivery
from the cloud and operating at a large scale are important, two capabilities
that Google obviously can support. He said that Google voice search gets “lots
of use.” Its goal is to recognize any search query by voice. Beyond search,
navigation is an important application for voice. Google wants to support
navigation requests from the search bar. In late April, Google moved further
toward ubiquitous availability of voice search in Google Maps with a release of
a version that lets users of Windows Mobile and Symbian S60 phones search by
voice. The feature has been available on BlackBerry and Android phones, but is
still unavailable for the iPhone.
In another aspect of scale, Google voice search
got off to a good start because Google had a huge text database of searches to
create the initial language model. The initial English model was created from a
240-billion word database and contains one million unique words. (The
dictionary contains less entries because it doesn’t count variations—such as
“write,” “wrote,” “written,” “writing”—as individual entries.) It takes 70 CPU
YEARS to train a language model with this amount of data, Cohen said. Testing
shows that more complex language models do improve accuracy, he indicated.
And the model is continuously adapted as Google
learns how people say their search requests. “People speak differently in real
use than in the laboratory,” Cohen noted. One difference, for example, is that
people will speak more “Wh” (“What,” “Where,” “When,” etc.) questions than one
finds in typing.
Acoustic models—representations of particular
sounds (usually with units more than a single phoneme) are also continuously
improved. The system uses “machine learning” techniques to do so, improving by
assuming that transcriptions that score high using the speech recognition
confidence are correct rather than requiring human transcription of the speech.
Cohen said that it takes about twice as much data to get the same improvement
using unsupervised learning as it would if the data were transcribed.
For real-time recognition, the system uses a
number of models in parallel, covering aspects of the speech signal such as
accent, speaker-specific characteristics, and background noise. The powerful
Google cloud allows substantial parallelization.
In a separate talk at the Mobile Voice
Conference, Bill Byrne, Voice Interface Engineer, and Alex Gruenstein, Software
Engineer, Google, presented a talk, “Google speech APIs: How developers can use
available speech resources.” In short, Google will make its network-based
speech recognition available to other developers through Application
Programming Interfaces. The software returns text that is a transcription of
the speech using its statistical language models. While Google is notoriously
cautious about pre-announcing any plans, it appears that the APIs will not
allow tuning of the language models or acoustic models for the foreseeable
future, but will simply deliver the best-guess text for an application to
analyze.
Byrne also spoke in another session on the
subject, “Solving New Design Dilemmas in Multi-modal Mobile UIs.” He noted that
the introduction of voice input features for both the Android platform and
Google Mobile App has uncovered a brand new set of User Interface (UI) design
issues involving utterance capture, correction, and disambiguation, just to
name a few. Byrne analyzed these new puzzles and provided a range of ideas and
strategies to address them, based on large amounts of user data.
The Google Mobile App (GMA) supports voice
search in English, Mandarin, or Japanese. It includes location support so that
one can avoid typing a current location while searching for nearby businesses.
Google Suggest in the GMA shows suggested options as one types, and those can
be chosen by tapping. Search history allows one to quickly search again for
queries that were recently performed. One can also search the phone’s email or
contacts. One can search in contexts such as Google Maps, Images, News, and
Shopping or navigate to other Google services, such as Gmail.
Another Google challenge was addressed by Pedro
Moreno, Speech Internationalization Tech Lead, Google: “Speech
Internationalization: The Google Experience.” In 2009 Google's speech team
embarked on a new massive effort, namely to internationalize the company’s
voice search product. Moreno discussed the researchers’ experiences in porting
Google’s voice search products to new languages and cultures. Google had to
resolve problems in data collection, text normalization, lexicon development,
language modeling, and complete system evaluation.
The overall impression of Google’s extensive participation
in the conference (and their being a principal sponsor) is that (1) Google
recognizes the importance of the mobile device as a new and competitive
frontier [not a surprise, since their CEO gave a talk in another conference
entitled “Mobile First” (SSN, March 2010, p. 1)]; and (2) Google thinks that
speech will be an important part of the mobile user interface.