From Speech Strategy News, May 2010
AT&T has long been a pioneer in speech technology, starting
with its Bell Labs heritage. The company previously introduced its AT&T Navigator, speech recognition for address and
points of interest voice entry on mobile phones (SSN, August 2008, p. 13)
and “Speech Mashups,” a way for Web and mobile developers to use speech
recognition, currently at no cost (SSN, September 2008, p. 1, and discussion at
the end of this article). Jay Wilpon, Executive Director, Speech Services
Research, AT&T Labs, gave a
keynote address at the Mobile Voice Conference in April on “The ChangingEcosystem:
Mobile Devices, Cloud Computing and Speech.” Beginning his career in 1977, Jay
is an IEEE Fellow and an AT&T Fellow and one of the world’s pioneers and
AT&T’s chief evangelist for speech technologies and services.
Jay noted that speech and mobile services are
perfect together, not just because of the small form factor of the devices, but
because their wireless connection to the cloud allows the union of speech and
cloud computing. Speech processing in the cloud side-steps device-dependent
issues such as software management, limited processing, and battery life.
Jay said that the number of mobile broadband
subscribers passed the number of fixed broadband subscriptions in 2009. Mobile
devices other than handsets are projected to account for $90 billion in operator revenue by 2013 (Source: Rethink
Wireless, September 2008). AT&T has seen 6700% growth in wireless data
use over 13 quarters through Q3 2009. Strategy Analytics predicted
revenues from mobile apps delivered through App Stores would reach $2.7 billion
in the US alone in 2013.
Mobile phones have certainly evolved. The Nokia 3210, released in 2Q 1999, sold 160
million units, but had a monochrome screen showing six lines of text and memory
for 250 phone book entries. The Apple iPhone 3GS, released 2Q 2009, has a high-resolution touch screen
and 32 GB of memory. The speech recognition can be largely independent of
device or platform today, but the variety of platforms, particularly the number
of operating systems on mobile phones presents a challenge to the developer.
Speech in the cloud is changing the ecosystem
of the speech technology industry. The
ecosystem has evolved from proprietary all-in-one systems that required the
legendary forklift upgrade to a more distributed system. Today, speech
recognition in the cloud can evolve largely independent of platform and can be
purchased with a pay-as-you-go model. AT&T provides its Watson speech
recognition and Natural Voices text-to-speech through its speech mashups
cloud-based service.
Jay
described AT&T Speak4it, an application that uses AT&T’s cloud-based
speech resources (www.speak4it.com). Speak4it is a multimodal voice-driven
local search app for the Apple iPhone, iPad, and iPod touch. Just press the
“Push to speak” button and say what you’d like to find. You can even point to a
spot on the map and ask what’s there. Speak4it knows about most of the businesses
in the United States.
AT&T has also partnered with Vlingo (p. 14). Vlingo offers an
application that performs web search,
local business search, voice dialing, and voice dictation of entries for
Twitter, Facebook, texting, and e-mail.
Another
company using the AT&T services is Qooco, which provides a web-based
service for Learning English as a Second Language. The service is initially
targeted at a Chinese audience. Jay provided a video demo of the application,
which allows conversing with an avatar in typical situations, such as ordering
a beverage at a coffee shop. He showed additional applications as well.
Jay noted that, despite good initial success,
the ultimate size of the market for speech and mobile services and its key
players are far from clear. The forces at work are complex and vary
substantially with the class of application—characterizing them holds the key
to predicting the success of speech and mobile services.
Developer ecosystem
Jay noted that AT&T is developing a
“world-class developer ecosystem.” The AT&T
service delivery approach includes core network capabilities through its voice
and data network, and the company has to date taken the view that increasing
use of that core system justifies the speech technology support it is
providing. At the platform level, AT&T is providing application
enablers and APIs. At the Apps/Services level, AT&T is creating
“intelligent services.” It is providing a computing infrastructure and speech
technology in the network. It provides developer support. The result is a much
less difficult environment for developing and delivering speech-enabled
applications. A tool named Plusmo will convert an app developed once for
deployment on multiple mobile platforms (see http://att.com/sdk). Some of the tools and APIs are not currently
available, but are planned for introduction throughout 2010. One possible plan
is a “Pull Thru” model, which Jay indicated is the best plan for app
developers, where one gives an API away free and monetizes through other
services (similar to Google Maps and PayPal).
Speech Mashups
“Speech mashups” provide an easy way for web
developers to incorporate a speech interface into their web apps so their users
can use voice commands and receive back spoken responses. All speech and
language processing—including speech recognition, text-to-speech conversion, and
natural language processing—is performed on AT&T servers.
Speech mashups work as follows: Audio or text from a mobile
device or a web browser is relayed over the mobile network to the speech mashup
manager, which manages the entire process by accessing AT&T servers where
the speech and language processing takes place, and then relaying the result
(interpreted into programming language) to the web application. If the
application result is to be spoken, the speech mashup manager sends it for TTS
conversion before relaying the spoken response back to the user. A developer's
guide with instructions and examples is available from the speech mashup portal
(http://service.research.att.com/smm/).