| ||||||
|
From Speech Strategy News, September 2007 Vlingo attempts a universal speech interface for mobile devicesFounded by industry veterans, features adaptive “Hierarchical Language Model”Vlingo, Inc. is a start-up with impressive founders and an interesting product. Readers of this newsletter may be familiar with Mike Phillips, Co-founder & Chief Technology Officer of Vlingo, who held the same position at SpeechWorks (acquired by ScanSoft (now Nuance), where he was also CTO) and spent some time at MIT before co-founding Vlingo. Dave Grannan, Vlingo president & CEO, was General Manager, Mobility Solutions at Nokia and an AVP & General Manager at Sprint PCS; and John Nguyen, Vlingo co-founder & VP Engineering, was also at ScanSoft as senior director of Network ASR Products, as well as a VP of Engineering at Groove Mobile. Investors include well-known firms Charles River Ventures and Sigma Partners. In addition to $6.5 million in financing, the firms contributed board members Izhar Armony, general partner for Charles River Ventures, and Robert Davoli, managing director with Sigma Partners, and Jeff Dunn, former COO of the Nickelodeon Networks group and chief executive of Nickelodeon Film and Enterprises. On August 21, Vlingo, Inc. announced a beta version of software that creates a user interface on a mobile phone using thin client software and network-based speech recognition. (Speech is sent over the data channel.) The Vlingo beta is currently available directly to consumers with most 3G phone models on the Vlingo web site. The key feature, Phillips said, is that the user interface is consistent for all applications. Users say what they want to enter into a Vlingo-enabled application and the text is entered into that application. Users can enter text manually when they don’t want to use voice input. Vlingo has no application-specific grammars or scripted interactions. “Consumers haven’t completely embraced mobile data services yet for one simple reason—they’re being held hostage by 12 tiny keys,” said Grannan. “Vlingo removes this obstacle of the past by giving consumers control over their phones with the power of speech. By opening up the potential for these mobile data services, Vlingo gives carriers and mobile application providers a quantum leap in usability and the corresponding revenue opportunities with the only voice user interface ‘plug in’ on the market.” Unlike some large-vocabulary solutions, the technology is speaker-independent in that no enrollment is required, but the system is speaker-adaptive. Phillips said Vlingo is performing unsupervised (or user-supervised given correction feedback) adaptation on acoustic models and pronunciations, making it possible, for example, to take account of a Southern accent. He said, “We are doing all of these forms of adaptation both on a per-user basis and across users and groups of users.” The key to the potential success of this ambitious speech-to-text technology is accuracy, in particular, creating a language model that will not frustrate the user. Vlingo uses a licensed speech recognition technology, the source of which the company declined to name. The speech vendor has given Vlingo access to the technology at a deep enough level that Vlingo can supplement the core speech technology with a new approach they call adaptive Hierarchical Language Models (HLMs), a form of Statistical Language Models (SLMs). One key to HLMs is that they adapt to the user’s habits and the specific text box in a particular application, increasing accuracy over time. For example, Phillips said, the user might often send “how’s it goin’” text messages. This adaptation in itself would not be remarkable, since classical SLMs can adapt. HLMs go further, in that, in effect, a different language model is being used for each user depending on his/her history, but the HLM is nevertheless maintained as one large language model. A key to this approach is that—while it might appear that the technology is user-specific—it is in fact collaborative. The HLM takes into account what other users say in the text box as well as the current user (presumably with different weights). For example, Grannan noted, suppose a new singer appears on the scene and the speech recognition gets the name wrong in a song request. If the user then types it in, that correction becomes a new vocabulary word in the system, available to all users. In order to maintain this centralized adaptive technology, Vlingo is offered to application developers and service providers as a hosted service. The HLM approach, Phillips said, supports previously difficult speech tasks, such as § Tasks with unbounded vocabularies; § Tasks that have very large vocabularies (> 1 million entries); and § Tasks with rapidly changing vocabularies. A demo is available at www.vlingomobile.com/demo. Phillips said that integration with software on the phone is easy, in that the speech engine is in the network. Integration is through Application Program Interfaces (APIs) that do audio capture and send it over the network. Vlingo’s solution is designed for higher-end phones that have more open environments. A possible deficiency is that the application is not eyes-free—desirable for example while driving—in that it uses the visual interface on the device; this can potentially be remedied by text-to-speech software on the phone, versions of which are available independently. Phillips said that Vlingo has in fact found it useful to use text-to-speech verification in some applications. Another possible difficulty is that many text applications on mobile phones are menu-driven and hierarchical in themselves. When Vlingo voice-enables such applications or the user navigates between them, the interface becomes more “modal” and less consistent. The best use of Vlingo technology may be interfaces that are already flexible, such as Web search or text messaging. Phillips commented, “While this can coexist with menu-driven navigation and multiple text boxes for more structured interactions, we encourage application developers to move to a more open-search sort of approach in their applications.” |