From Telephone Strategy News -
April 2004
Microsoft launches Microsoft Speech Server 2004 with pricing details
Bill Gates keynote at
AVIOS~SpeechTEK 2004 includes Microsoft vision for speech
Microsoft Chairman and Chief Software Architect Bill
Gates launched Microsoft Speech Server in his keynote address on March 24, 2004
at the co-located AVIOS~SpeechTEK Spring 2004, Microsoft Mobile
Developer Conference 2004, and Fawcette Technical Publication’s VSLive!
San Francisco 2004. In the context of the company’s vision for “Seamless
Computing,” Gates highlighted plans in mobile computing, Visual Studio
development tools, and the Microsoft Speech Server 2004 (MSS) for telephone and
multimodal applications. A follow-on keynote by Kai-Fu Lee, corporate vice
president for the Microsoft Speech Server product group, gave further details
and examples of beta deployments.
MSS includes Microsoft’s
telephone speech recognition, a SALT (Speech Application Language Tags)
interpreter, ScanSoft TTS, and other platform software. It is priced on a
per-CPU basis, rather than the per-port pricing more common in telephony
applications. Development tools, which work within a Microsoft Visual Studio
Integrated Development Environment, are free. An overview of the architecture is
presented visually at the end of this article. MSS supports DTMF (touch-tone)
only, combined speech and DTMF, and multimodal (mixed speech and visual)
applications.
The Microsoft Speech Server
2004 comes in two versions: Standard Edition and Enterprise Edition. The
versions are technically equivalent, except that the Standard edition is
software-limited to 24 simultaneous speech recognition processes. The Enterprise
Edition is not limited by the number of processes, only by the number of
processes the server on which it is running can support, which will depend on
the complexity of the application. In one example of a high-load application,
James Mastan, director of marketing for the Microsoft Speech Server, gave the
example of a 96-port configuration that was handled by 3 dual-CPU servers (one
running the Telephony Application Server and two running Speech Engine
Services—see figure at the end of this article), with a worst-case load and
applications. To address the needs of organizations that want to do a test with
four ports, Intel and Microsoft are offering a $995 Starter Kit which
includes an Intel four-port telephone interface board and Microsoft Standard
Edition with a 180-day license (p. TBB).
MSS will be generally
available on June 1, although the Starter Package may be available in May. The
initial release of MSS is in English and targeted at North America. The product
builds on existing Microsoft development environments, and should be
particularly attractive to companies already using Windows Server 2003 and
Visual Studio .NET 2003 development tools.
A wide range of partners
supporting the release have been previously announced, including Intel,
Intervoice, and ScanSoft. Intel and Intervoice announced details of their
respective Telephone Interface Manager (TIM) supporting MSS (p. 11 and 13); the
TIM software is separately priced and is required to run MSS. The system works
with Intel Dialogic 4-port and 16-port analog boards, and 48-channel and
96-channel digital boards. ScanSoft’s text-to-speech is bundled with MSS at no
additional cost. MSS also supports the Scansoft OpenSpeech Recognizer engine if
purchased separately. Other partners have been previously announced (e.g., TSN,
March 2004, p. 1), and new partners announced support timed to coincide with the
Microsoft announcement (p. 7). Announcements include tools, hardware, and
applications. Brooktrout announced that it would also make its telephone
interface boards compatible with MSS (p. 7 and 19).
The announcement emphasized
several key objectives: (1) a lower price point, making telephone speech
technology more accessible to small-to-medium enterprises; (2) integration with
existing Web development tools, so that developers familiar with those tools
could more easily incorporate speech solutions; (3) integration with existing
Web delivery technologies, so that the work done on existing Web solutions and
their continuing management could be leveraged for speech; (4) an integrated
solution, so that companies could move quickly to deployment; and (5) a flexible
platform that could support pure telephone solutions, but also multimodal
solutions within the same environment. “For years now, this technology has been
accessible only to a short list of Fortune 500 companies because it has been so
difficult and expensive to implement,” Lee said. “Both large and midsize
companies need a lower cost of entry and lower total cost of ownership. A key
value of Speech Server is to dramatically reduce the cost and complexity of
developing and deploying speech applications, making the technology more
accessible to a broader range of enterprise customers.” Gates emphasized the
current strength and value of the Visual Studio .NET and .NET Framework
ecosystem for customers and partners, highlighting the distribution of more than
80 million copies of the .NET Framework, the 2.5 million developers currently
using Visual Studio .NET, the more than 180 Visual Studio Industry Partners and
the fact that more than 60% of Fortune 100 companies are run on the .NET
Framework.
Mastan addressed the
“speech opportunity space.” He says that Microsoft is initially targeting the
thousands of large organizations with more than 250 agents in their call centers
and the hundreds of thousands of medium organizations with 25-250 agent call
centers. In addition, there is a market of millions of small organizations with
less than 25 agents that is more difficult to address today.
George Platt, senior vice
president of marketing and corporate strategy, Intervoice, said, “Call centers
have been the traditional market for speech technology. The launch of Microsoft
Speech Server gives speech technology a lot of visibility and credibility, and
early on you’ll see new solutions for call centers of all kinds, from financial
services to insurance companies and retailers. But other kinds of applications
will soon follow, such as human resources solutions that provide employee
self-service. That’s really going to help this industry grow.”
Pricing
The estimated retail price
for MSS 2004 Standard Edition is $7,999 per processor. The estimated retail
price for MSS 2004 Enterprise Edition is $17,999 per processor. These prices are
estimated retail price (ERP). A Microsoft spokesperson said that there may be
significant discounting from this ERP (20-50% depending on volume licensing
arrangements that includes the total amount of Microsoft software purchased, not
just MSS software.) Using the previously mentioned example of a 96-port
application that was handled by 3 dual-CPU servers (six CPUs total), the ERP per
port is about $1,125 per port. For the standard edition, Mastan gave the example
of a single dual-CPU server (with both the Telephony Application Services and
Speech Engine Services on one server) handling 24 ports for a per-port ERP of
$667 per port. The case on which he based both examples was running a 100% load
(all channels operating simultaneously), running three applications
simultaneously (outbound banking alerts, a mid-sized, mixed-initiative travel
booking application, and a 100,000-name auto-attendant application). Mastan said
a less-intense application should be able to support 96 ports with two (rather
than three) dual-processor servers.
MSS requires a Telephone
Interface Manager (TIM) software module to connect to Dialogic phone interface
cards from Intel. The TIM versions are available separately from either
Intervoice (p. 11) or Intel (“NetMerge Call Manager,” p. 13). (Other card
manufacturers will support MSS; see p. 19). The cost of the Intervoice TIM
ranges from $125 per port for four ports to $300 per port for large numbers of
ports. A CTI (Computer Telephone Interface) software module is an option for the
TIM from Intervoice; it allow easier connection to many existing telephone
systems that route calls or support agents, but is not required for all
installations. The CTI module is about $300 per port. The Intel TIM, called
NetMerge Call Manager, is expected to be priced by distributors at $75-$100 per
port. It also doesn’t include CTI functions, which Intel handles differently
than Intervoice, see p. 13.
Using the cases mentioned,
the cost of the MSS software with Intel Call Manager software would be about
$767 per port for 24 ports and $1,225 per port for 96 ports. For very small port
sizes, the per-port price rises rapidly because of the per-CPU MSS pricing. As a
point of comparison, the Nuance Voice Platform lists at $1,600 to $3,200
per port, depending on the complexity of the application.
The Intel cards are sold
through distributors and VARs, often in bundles or with added services, and, in
some cases, with discounting. As an estimate, a four-port card without extra
services would be around $1,500; a 48-port card around $11,000; and a 96-port
card around $16,000. If a dual-CPU server is about $4,000, a 96-port system’s
core price (without installation or other services, without application
software, without management software, and without discounts) with three
dual-processor servers is in the ballpark of $146,000 [3x4,000 + 6x17,999 +
16,000 + 96x100].
Deployments
Mastan indicated that there
were 600 beta participants for MSS (chosen from 1,100 applicants). He said that
the effort generated 6,000 proactive customer leads, over 60 partners, and 24
Early Adopter Program (EAP) customers.
Microsoft Speech Server was
run continually at a high load for 33 days without failure, according to Mastan,
with 95% of calls having a user-perceived latency of less than 1.5 seconds.
Microsoft estimates that current industry solutions for such tests have 80% of
calls with user-perceived latency of under 2.5 seconds. Mastan said that Speech
Server performance runs at a 95% successful call completion rate in their tests.
In the AVIOS~SpeechTEK
announcement, Microsoft highlighted EAP participants Grange Insurance,
the New York City Department of Education (NYC DOE), and Southwest
Alabama Integrated Criminal Justice System (SAICS). Grange Insurance
automated the handling of routine customer requests, such as “Did you get my
payment?” or “When is my payment due?” through a contract with Tata
Consultancy Services, a Microsoft Global System Integrator. Grange was so
enamored with the system that it deployed live in April using the beta version
of the MSS.
The NYC DOE is working with
Microsoft and Intervoice to develop a voice-enabled telephony application
for parents to check such things as their child’s attendance record, course
grades and daily lunch menus. MSS was attractive to the NYC DOE in part because
it enabled the department to leverage an already existing interface developed
for the department’s Web site. Adding the telephone helped bridge the “digital
divide,” according to Richard Langford, deputy chief information officer for the
NYC DOE, and made the system accessible to parents without Internet access or
away from their computers.
SAICS worked with MSS
partner ComputerTalk to develop a speech-enabled law enforcement
application to help officers in the field and to reduce the burden on
dispatchers. Officers can access driver’s license, social security, and license
plate data over the phone with a direct voice query. The application transmits
data verbally and visually for devices that support multimodal applications.
Microsoft has also
partially deployed a MSS-based auto-attendant system called MS Connect within
Microsoft. When fully deployed, the new system will provide callers with fast
and accurate calling access to any of the 50,000 Microsoft employees through a
speech-enabled application, rather than requiring a human operator.
Partners
The Microsoft Speech Server
Partner Program now includes more than 60 companies that provide prepackaged
applications and services to customers implementing MSS. Partners joining Lee on
stage at AVIOS~SpeechTEK included Accenture, Solar Software, and
Voice Automation Inc. Accenture demonstrated speech-enabled telephony and
multimodal commerce solutions; Solar Software demonstrated a voice-enabled IT
administrator and Help Desk automation solution based on Windows networks and
Active Directory; and Voice Automation demonstrated a speech-enabled Microsoft
customer relationship management (CRM) application. For a sampling of other
partners, see p. 7.
Development tools
The Microsoft Speech
Application SDK was built for the .NET platform and leverages the Visual Studio
.NET development environment familiar to many developers. The SALT
specification, rather than being a language in itself, specifies how one can add
“Speech Application Language Tags” to Web applications. The SALT tags are
interpreted by an interpreter at runtime to manage speech and multimodal
dialogs.
The tools use the
drag-and-drop and other tools in the Visual Studio environment, adding speech
“controls,” functional pieces that perform specific actions. These include basic
controls that respond to specific events, dialog controls that encapsulate
building blocks of dialog and track the state of the user-computer interaction,
and application controls which accelerate the development of common voice-only
scenarios by composing dialog controls with additional features.
The SASDK includes a prompt
editor and grammar editor. The prompt editor allows listening to and editing
individual recorded prompts. The grammar editor provides several ways to view
the grammar, including a graphical view. It allows mapping the grammar to a
semantic interpretation, such as mapping all the various ways to say “yes” to
one generic YES. There is also a grammar library of commonly used grammars that
can be used within a larger grammar.
The grammar format used by
MSS is the World Wide Web Consortium’s (W3C) standard Speech Recognition Grammar
Specification, the same grammar format used by VoiceXML. The prompt engine
provided with MSS supports the open-standard World Wide Web Consortium’s (W3C)
Speech Synthesis Markup Language Specification version 1.0, also used by
VoiceXML. It also uses CSTA, an Ecma call control standard.
The Prompt Manager in the
runtime system offers more than its name might suggest. It is a cross between
standard recorded prompts and text-to-speech (TTS) synthesis. While not as
flexible as TTS, it can concatenate recorded phrases, making intelligent
decisions which recorded phrases to use to sound most natural. With a large
database of pre-recorded prompts, companies fielding conventional applications
may be able to avoid or minimize paying for outside professional recording.
GM Voices recorded the prompt database and offers additional recordings with
the same male and female voices (p. 7).
The SASDK includes tools
for testing and debugging. These include a Telephony Application Simulator so
that the developer need not be attached to a telephone system to check the basic
operation of the application.
Reporting tools include the
tools that are part of the Visual Studio environment, as well as tools specific
to speech applications. The latter includes a call viewer that provides a
summary view of calls and their status. A user can get details of a call as
desired. Users can also create custom speech application reports of frequently
used data, taking advantage of SQL Reporting Services.
Future plans
As the MSS evolves,
Microsoft plans to add languages other than English, to further enhance natural
language support, and to add speech recognition for embedded devices. In
addition, Microsoft is refining Speech Server feature requirements specific to
telecommunication carrier needs. An example of this is Microsoft’s work with
Huawei Technologies Company Ltd., the leading provider of carrier-grade
telecommunications solutions in China. Microsoft is working to enable future
versions of Microsoft Speech Server to serve both the enterprise and
telecommunication carrier customers.
Seamless computing
Using the theme of
“seamless computing,” Gates put the Microsoft announcements in a broader
context: “We won’t have many different networks, the TV network, the voice
network, the data network, all of these things will be the same, and they’ll be
driven by rich applications. Even things like the set-top box and the videogame,
those platforms will come together, and your TV screen will be connected the
same way that your PC is, providing a seamless experience.”