TMA Associates

From Speech Strategy News, May 2010

 

Google seeks “ubiquitous availability” for a speech option

Keynote and talks at the Mobile Voice Conference

Michael Cohen, Manager, Speech Technology, Google, gave a keynote address at the Mobile Voice Conference in San Francisco, speaking on April 22, 2010. The keynote was entitled “Transparent Mobility.” One aspect of the use of mobile phones, Cohen noted, is the growing expectation of completely ubiquitous availability:  Users increasingly expect to have constant access to the information and services of the web.  The small devices and usage while moving have increased the importance of speech technology in meeting this need for ubiquitous mobile access—“any time, any place, any usage scenario, as part of any type of activity.”

Cohen said that achieving ubiquity of spoken access requires two things: (1) availability in every possible interaction where speech input or output makes sense, and (2) performance (it works so well that the modality adds no friction to the interaction). If the speech is always available when a keyboard can be used, the only addition necessary is some sort of “speak” or “listen” button icon (although training users to use such a button properly is a bit challenging). Cohen said that there is a strong measured correlation between speech recognition accuracy and repeat usage, suggesting that users learn to use a speech interface to best advantage with practice. Availability of speech whenever a keyboard is displayed will be available on the Google Nexus One mobile phone released in January, Cohen said.

Cohen said that, to get to this point, delivery from the cloud and operating at a large scale are important, two capabilities that Google obviously can support. He said that Google voice search gets “lots of use.” Its goal is to recognize any search query by voice. Beyond search, navigation is an important application for voice. Google wants to support navigation requests from the search bar. In late April, Google moved further toward ubiquitous availability of voice search in Google Maps with a release of a version that lets users of Windows Mobile and Symbian S60 phones search by voice. The feature has been available on BlackBerry and Android phones, but is still unavailable for the iPhone.

In another aspect of scale, Google voice search got off to a good start because Google had a huge text database of searches to create the initial language model. The initial English model was created from a 240-billion word database and contains one million unique words. (The dictionary contains less entries because it doesn’t count variations—such as “write,” “wrote,” “written,” “writing”—as individual entries.) It takes 70 CPU YEARS to train a language model with this amount of data, Cohen said. Testing shows that more complex language models do improve accuracy, he indicated.

And the model is continuously adapted as Google learns how people say their search requests. “People speak differently in real use than in the laboratory,” Cohen noted. One difference, for example, is that people will speak more “Wh” (“What,” “Where,” “When,” etc.) questions than one finds in typing.

Acoustic models—representations of particular sounds (usually with units more than a single phoneme) are also continuously improved. The system uses “machine learning” techniques to do so, improving by assuming that transcriptions that score high using the speech recognition confidence are correct rather than requiring human transcription of the speech. Cohen said that it takes about twice as much data to get the same improvement using unsupervised learning as it would if the data were transcribed.

For real-time recognition, the system uses a number of models in parallel, covering aspects of the speech signal such as accent, speaker-specific characteristics, and background noise. The powerful Google cloud allows substantial parallelization.

In a separate talk at the Mobile Voice Conference, Bill Byrne, Voice Interface Engineer, and Alex Gruenstein, Software Engineer, Google, presented a talk, “Google speech APIs: How developers can use available speech resources.” In short, Google will make its network-based speech recognition available to other developers through Application Programming Interfaces. The software returns text that is a transcription of the speech using its statistical language models. While Google is notoriously cautious about pre-announcing any plans, it appears that the APIs will not allow tuning of the language models or acoustic models for the foreseeable future, but will simply deliver the best-guess text for an application to analyze.

Byrne also spoke in another session on the subject, “Solving New Design Dilemmas in Multi-modal Mobile UIs.” He noted that the introduction of voice input features for both the Android platform and Google Mobile App has uncovered a brand new set of User Interface (UI) design issues involving utterance capture, correction, and disambiguation, just to name a few. Byrne analyzed these new puzzles and provided a range of ideas and strategies to address them, based on large amounts of user data.

The Google Mobile App (GMA) supports voice search in English, Mandarin, or Japanese. It includes location support so that one can avoid typing a current location while searching for nearby businesses. Google Suggest in the GMA shows suggested options as one types, and those can be chosen by tapping. Search history allows one to quickly search again for queries that were recently performed. One can also search the phone’s email or contacts. One can search in contexts such as Google Maps, Images, News, and Shopping or navigate to other Google services, such as Gmail.

Another Google challenge was addressed by Pedro Moreno, Speech Internationalization Tech Lead, Google: “Speech Internationalization: The Google Experience.” In 2009 Google's speech team embarked on a new massive effort, namely to internationalize the company’s voice search product. Moreno discussed the researchers’ experiences in porting Google’s voice search products to new languages and cultures. Google had to resolve problems in data collection, text normalization, lexicon development, language modeling, and complete system evaluation.

The overall impression of Google’s extensive participation in the conference (and their being a principal sponsor) is that (1) Google recognizes the importance of the mobile device as a new and competitive frontier [not a surprise, since their CEO gave a talk in another conference entitled “Mobile First” (SSN, March 2010, p. 1)]; and (2) Google thinks that speech will be an important part of the mobile user interface.