TMA Associates

From Speech Strategy News, May 2010

AT&T provides speech processing in the cloud

“Speech and mobile services are perfect together”

AT&T has long been a pioneer in speech technology, starting with its Bell Labs heritage. The company previously introduced its AT&T Navigator, speech recognition for address and points of interest voice entry on mobile phones (SSN, August 2008, p. 13) and “Speech Mashups,” a way for Web and mobile developers to use speech recognition, currently at no cost (SSN, September 2008, p. 1, and discussion at the end of this article). Jay Wilpon, Executive Director, Speech Services Research, AT&T Labs, gave a keynote address at the Mobile Voice Conference in April on “The ChangingEcosystem: Mobile Devices, Cloud Computing and Speech.” Beginning his career in 1977, Jay is an IEEE Fellow and an AT&T Fellow and one of the world’s pioneers and AT&T’s chief evangelist for speech technologies and services.

Jay noted that speech and mobile services are perfect together, not just because of the small form factor of the devices, but because their wireless connection to the cloud allows the union of speech and cloud computing. Speech processing in the cloud side-steps device-dependent issues such as software management, limited processing, and battery life.

Jay said that the number of mobile broadband subscribers passed the number of fixed broadband subscriptions in 2009. Mobile devices other than handsets are projected to account for $90 billion in operator revenue by 2013 (Source: Rethink Wireless, September 2008). AT&T has seen 6700% growth in wireless data use over 13 quarters through Q3 2009. Strategy Analytics predicted revenues from mobile apps delivered through App Stores would reach $2.7 billion in the US alone in 2013.

Mobile phones have certainly evolved. The Nokia 3210, released in 2Q 1999, sold 160 million units, but had a monochrome screen showing six lines of text and memory for 250 phone book entries. The Apple iPhone 3GS, released 2Q 2009, has a high-resolution touch screen and 32 GB of memory. The speech recognition can be largely independent of device or platform today, but the variety of platforms, particularly the number of operating systems on mobile phones presents a challenge to the developer.

Speech in the cloud is changing the ecosystem of the speech technology industry. The ecosystem has evolved from proprietary all-in-one systems that required the legendary forklift upgrade to a more distributed system. Today, speech recognition in the cloud can evolve largely independent of platform and can be purchased with a pay-as-you-go model. AT&T provides its Watson speech recognition and Natural Voices text-to-speech through its speech mashups cloud-based service.

Jay described AT&T Speak4it, an application that uses AT&T’s cloud-based speech resources (www.speak4it.com). Speak4it is a multimodal voice-driven local search app for the Apple iPhone, iPad, and iPod touch. Just press the “Push to speak” button and say what you’d like to find. You can even point to a spot on the map and ask what’s there. Speak4it knows about most of the businesses in the United States.

AT&T has also partnered with Vlingo (p. 14). Vlingo offers an application that performs web search, local business search, voice dialing, and voice dictation of entries for Twitter, Facebook, texting, and e-mail.

Another company using the AT&T services is Qooco, which provides a web-based service for Learning English as a Second Language. The service is initially targeted at a Chinese audience. Jay provided a video demo of the application, which allows conversing with an avatar in typical situations, such as ordering a beverage at a coffee shop. He showed additional applications as well.

Jay noted that, despite good initial success, the ultimate size of the market for speech and mobile services and its key players are far from clear. The forces at work are complex and vary substantially with the class of application—characterizing them holds the key to predicting the success of speech and mobile services.

Developer ecosystem

Jay noted that AT&T is developing a “world-class developer ecosystem.” The AT&T service delivery approach includes core network capabilities through its voice and data network, and the company has to date taken the view that increasing use of that core system justifies the speech technology support it is providing. At the platform level, AT&T is providing application enablers and APIs. At the Apps/Services level, AT&T is creating “intelligent services.” It is providing a computing infrastructure and speech technology in the network. It provides developer support. The result is a much less difficult environment for developing and delivering speech-enabled applications. A tool named Plusmo will convert an app developed once for deployment on multiple mobile platforms (see http://att.com/sdk). Some of the tools and APIs are not currently available, but are planned for introduction throughout 2010. One possible plan is a “Pull Thru” model, which Jay indicated is the best plan for app developers, where one gives an API away free and monetizes through other services (similar to Google Maps and PayPal).

Speech Mashups

“Speech mashups” provide an easy way for web developers to incorporate a speech interface into their web apps so their users can use voice commands and receive back spoken responses. All speech and language processing—including speech recognition, text-to-speech conversion, and natural language processing—is performed on AT&T servers.

Speech mashups work as follows: Audio or text from a mobile device or a web browser is relayed over the mobile network to the speech mashup manager, which manages the entire process by accessing AT&T servers where the speech and language processing takes place, and then relaying the result (interpreted into programming language) to the web application. If the application result is to be spoken, the speech mashup manager sends it for TTS conversion before relaying the spoken response back to the user. A developer's guide with instructions and examples is available from the speech mashup portal (http://service.research.att.com/smm/).