TMA Associates

Speech Technology News and Analysis

 

TMA Home
Newsletter
Conferences
Consulting
Resources
Press Releases
Contact Information
Web Subscribers

From Telephone Strategy News - April 2004

Microsoft launches Microsoft Speech Server 2004 with pricing details

Bill Gates keynote at AVIOS~SpeechTEK 2004 includes Microsoft vision for speech

Microsoft Chairman and Chief Software Architect Bill Gates launched Microsoft Speech Server in his keynote address on March 24, 2004 at the co-located AVIOS~SpeechTEK Spring 2004, Microsoft Mobile Developer Conference 2004, and Fawcette Technical Publication’s VSLive! San Francisco 2004. In the context of the company’s vision for “Seamless Computing,” Gates highlighted plans in mobile computing, Visual Studio development tools, and the Microsoft Speech Server 2004 (MSS) for telephone and multimodal applications. A follow-on keynote by Kai-Fu Lee, corporate vice president for the Microsoft Speech Server product group, gave further details and examples of beta deployments.

MSS includes Microsoft’s telephone speech recognition, a SALT (Speech Application Language Tags) interpreter, ScanSoft TTS, and other platform software. It is priced on a per-CPU basis, rather than the per-port pricing more common in telephony applications. Development tools, which work within a Microsoft Visual Studio Integrated Development Environment, are free. An overview of the architecture is presented visually at the end of this article. MSS supports DTMF (touch-tone) only, combined speech and DTMF, and multimodal (mixed speech and visual) applications.

The Microsoft Speech Server 2004 comes in two versions: Standard Edition and Enterprise Edition. The versions are technically equivalent, except that the Standard edition is software-limited to 24 simultaneous speech recognition processes. The Enterprise Edition is not limited by the number of processes, only by the number of processes the server on which it is running can support, which will depend on the complexity of the application. In one example of a high-load application, James Mastan, director of marketing for the Microsoft Speech Server, gave the example of a 96-port configuration that was handled by 3 dual-CPU servers (one running the Telephony Application Server and two running Speech Engine Services—see figure at the end of this article), with a worst-case load and applications. To address the needs of organizations that want to do a test with four ports, Intel and Microsoft are offering a $995 Starter Kit which includes an Intel four-port telephone interface board and Microsoft Standard Edition with a 180-day license (p. TBB).

MSS will be generally available on June 1, although the Starter Package may be available in May. The initial release of MSS is in English and targeted at North America. The product builds on existing Microsoft development environments, and should be particularly attractive to companies already using Windows Server 2003 and Visual Studio .NET 2003 development tools.

A wide range of partners supporting the release have been previously announced, including Intel, Intervoice, and ScanSoft. Intel and Intervoice announced details of their respective Telephone Interface Manager (TIM) supporting MSS (p. 11 and 13); the TIM software is separately priced and is required to run MSS. The system works with Intel Dialogic 4-port and 16-port analog boards, and 48-channel and 96-channel digital boards. ScanSoft’s text-to-speech is bundled with MSS at no additional cost. MSS also supports the Scansoft OpenSpeech Recognizer engine if purchased separately. Other partners have been previously announced (e.g., TSN, March 2004, p. 1), and new partners announced support timed to coincide with the Microsoft announcement (p. 7). Announcements include tools, hardware, and applications. Brooktrout announced that it would also make its telephone interface boards compatible with MSS (p. 7 and 19).

The announcement emphasized several key objectives: (1) a lower price point, making telephone speech technology more accessible to small-to-medium enterprises; (2) integration with existing Web development tools, so that developers familiar with those tools could more easily incorporate speech solutions; (3) integration with existing Web delivery technologies, so that the work done on existing Web solutions and their continuing management could be leveraged for speech; (4) an integrated solution, so that companies could move quickly to deployment; and (5) a flexible platform that could support pure telephone solutions, but also multimodal solutions within the same environment. “For years now, this technology has been accessible only to a short list of Fortune 500 companies because it has been so difficult and expensive to implement,” Lee said. “Both large and midsize companies need a lower cost of entry and lower total cost of ownership. A key value of Speech Server is to dramatically reduce the cost and complexity of developing and deploying speech applications, making the technology more accessible to a broader range of enterprise customers.” Gates emphasized the current strength and value of the Visual Studio .NET and .NET Framework ecosystem for customers and partners, highlighting the distribution of more than 80 million copies of the .NET Framework, the 2.5 million developers currently using Visual Studio .NET, the more than 180 Visual Studio Industry Partners and the fact that more than 60% of Fortune 100 companies are run on the .NET Framework.

Mastan addressed the “speech opportunity space.” He says that Microsoft is initially targeting the thousands of large organizations with more than 250 agents in their call centers and the hundreds of thousands of medium organizations with 25-250 agent call centers. In addition, there is a market of millions of small organizations with less than 25 agents that is more difficult to address today.

George Platt, senior vice president of marketing and corporate strategy, Intervoice, said, “Call centers have been the traditional market for speech technology. The launch of Microsoft Speech Server gives speech technology a lot of visibility and credibility, and early on you’ll see new solutions for call centers of all kinds, from financial services to insurance companies and retailers. But other kinds of applications will soon follow, such as human resources solutions that provide employee self-service. That’s really going to help this industry grow.”

Pricing

The estimated retail price for MSS 2004 Standard Edition is $7,999 per processor. The estimated retail price for MSS 2004 Enterprise Edition is $17,999 per processor. These prices are estimated retail price (ERP). A Microsoft spokesperson said that there may be significant discounting from this ERP (20-50% depending on volume licensing arrangements that includes the total amount of Microsoft software purchased, not just MSS software.) Using the previously mentioned example of a 96-port application that was handled by 3 dual-CPU servers (six CPUs total), the ERP per port is about $1,125 per port. For the standard edition, Mastan gave the example of a single dual-CPU server (with both the Telephony Application Services and Speech Engine Services on one server) handling 24 ports for a per-port ERP of $667 per port. The case on which he based both examples was running a 100% load (all channels operating simultaneously), running three applications simultaneously (outbound banking alerts, a mid-sized, mixed-initiative travel booking application, and a 100,000-name auto-attendant application). Mastan said a less-intense application should be able to support 96 ports with two (rather than three) dual-processor servers.

MSS requires a Telephone Interface Manager (TIM) software module to connect to Dialogic phone interface cards from Intel. The TIM versions are available separately from either Intervoice (p. 11) or Intel (“NetMerge Call Manager,” p. 13). (Other card manufacturers will support MSS; see p. 19). The cost of the Intervoice TIM ranges from $125 per port for four ports to $300 per port for large numbers of ports. A CTI (Computer Telephone Interface) software module is an option for the TIM from Intervoice; it allow easier connection to many existing telephone systems that route calls or support agents, but is not required for all installations. The CTI module is about $300 per port. The Intel TIM, called NetMerge Call Manager, is expected to be priced by distributors at $75-$100 per port. It also doesn’t include CTI functions, which Intel handles differently than Intervoice, see p. 13.

Using the cases mentioned, the cost of the MSS software with Intel Call Manager software would be about $767 per port for 24 ports and $1,225 per port for 96 ports. For very small port sizes, the per-port price rises rapidly because of the per-CPU MSS pricing. As a point of comparison, the Nuance Voice Platform lists at $1,600 to $3,200 per port, depending on the complexity of the application.

The Intel cards are sold through distributors and VARs, often in bundles or with added services, and, in some cases, with discounting. As an estimate, a four-port card without extra services would be around $1,500; a 48-port card around $11,000; and a 96-port card around $16,000. If a dual-CPU server is about $4,000, a 96-port system’s core price (without installation or other services, without application software, without management software, and without discounts) with three dual-processor servers is in the ballpark of $146,000 [3x4,000 + 6x17,999 + 16,000 + 96x100].

Deployments

Mastan indicated that there were 600 beta participants for MSS (chosen from 1,100 applicants). He said that the effort generated 6,000 proactive customer leads, over 60 partners, and 24 Early Adopter Program (EAP) customers.

Microsoft Speech Server was run continually at a high load for 33 days without failure, according to Mastan, with 95% of calls having a user-perceived latency of less than 1.5 seconds. Microsoft estimates that current industry solutions for such tests have 80% of calls with user-perceived latency of under 2.5 seconds. Mastan said that Speech Server performance runs at a 95% successful call completion rate in their tests.

In the AVIOS~SpeechTEK announcement, Microsoft highlighted EAP participants Grange Insurance, the New York City Department of Education (NYC DOE), and Southwest Alabama Integrated Criminal Justice System (SAICS). Grange Insurance automated the handling of routine customer requests, such as “Did you get my payment?” or “When is my payment due?” through a contract with Tata Consultancy Services, a Microsoft Global System Integrator. Grange was so enamored with the system that it deployed live in April using the beta version of the MSS.

The NYC DOE is working with Microsoft and Intervoice to develop a voice-enabled telephony application for parents to check such things as their child’s attendance record, course grades and daily lunch menus. MSS was attractive to the NYC DOE in part because it enabled the department to leverage an already existing interface developed for the department’s Web site. Adding the telephone helped bridge the “digital divide,” according to Richard Langford, deputy chief information officer for the NYC DOE, and made the system accessible to parents without Internet access or away from their computers.

SAICS worked with MSS partner ComputerTalk to develop a speech-enabled law enforcement application to help officers in the field and to reduce the burden on dispatchers. Officers can access driver’s license, social security, and license plate data over the phone with a direct voice query. The application transmits data verbally and visually for devices that support multimodal applications.

Microsoft has also partially deployed a MSS-based auto-attendant system called MS Connect within Microsoft. When fully deployed, the new system will provide callers with fast and accurate calling access to any of the 50,000 Microsoft employees through a speech-enabled application, rather than requiring a human operator.

Partners

The Microsoft Speech Server Partner Program now includes more than 60 companies that provide prepackaged applications and services to customers implementing MSS. Partners joining Lee on stage at AVIOS~SpeechTEK included Accenture, Solar Software, and Voice Automation Inc. Accenture demonstrated speech-enabled telephony and multimodal commerce solutions; Solar Software demonstrated a voice-enabled IT administrator and Help Desk automation solution based on Windows networks and Active Directory; and Voice Automation demonstrated a speech-enabled Microsoft customer relationship management (CRM) application. For a sampling of other partners, see p. 7.

Development tools

The Microsoft Speech Application SDK was built for the .NET platform and leverages the Visual Studio .NET development environment familiar to many developers. The SALT specification, rather than being a language in itself, specifies how one can add “Speech Application Language Tags” to Web applications. The SALT tags are interpreted by an interpreter at runtime to manage speech and multimodal dialogs.

The tools use the drag-and-drop and other tools in the Visual Studio environment, adding speech “controls,” functional pieces that perform specific actions. These include basic controls that respond to specific events, dialog controls that encapsulate building blocks of dialog and track the state of the user-computer interaction, and application controls which accelerate the development of common voice-only scenarios by composing dialog controls with additional features.

The SASDK includes a prompt editor and grammar editor. The prompt editor allows listening to and editing individual recorded prompts. The grammar editor provides several ways to view the grammar, including a graphical view. It allows mapping the grammar to a semantic interpretation, such as mapping all the various ways to say “yes” to one generic YES. There is also a grammar library of commonly used grammars that can be used within a larger grammar.

The grammar format used by MSS is the World Wide Web Consortium’s (W3C) standard Speech Recognition Grammar Specification, the same grammar format used by VoiceXML. The prompt engine provided with MSS supports the open-standard World Wide Web Consortium’s (W3C) Speech Synthesis Markup Language Specification version 1.0, also used by VoiceXML. It also uses CSTA, an Ecma call control standard.

The Prompt Manager in the runtime system offers more than its name might suggest. It is a cross between standard recorded prompts and text-to-speech (TTS) synthesis. While not as flexible as TTS, it can concatenate recorded phrases, making intelligent decisions which recorded phrases to use to sound most natural. With a large database of pre-recorded prompts, companies fielding conventional applications may be able to avoid or minimize paying for outside professional recording. GM Voices recorded the prompt database and offers additional recordings with the same male and female voices (p. 7).

The SASDK includes tools for testing and debugging. These include a Telephony Application Simulator so that the developer need not be attached to a telephone system to check the basic operation of the application.

Reporting tools include the tools that are part of the Visual Studio environment, as well as tools specific to speech applications. The latter includes a call viewer that provides a summary view of calls and their status. A user can get details of a call as desired. Users can also create custom speech application reports of frequently used data, taking advantage of SQL Reporting Services.

Future plans

As the MSS evolves, Microsoft plans to add languages other than English, to further enhance natural language support, and to add speech recognition for embedded devices. In addition, Microsoft is refining Speech Server feature requirements specific to telecommunication carrier needs. An example of this is Microsoft’s work with Huawei Technologies Company Ltd., the leading provider of carrier-grade telecommunications solutions in China. Microsoft is working to enable future versions of Microsoft Speech Server to serve both the enterprise and telecommunication carrier customers.

Seamless computing

Using the theme of “seamless computing,” Gates put the Microsoft announcements in a broader context: “We won’t have many different networks, the TV network, the voice network, the data network, all of these things will be the same, and they’ll be driven by rich applications. Even things like the set-top box and the videogame, those platforms will come together, and your TV screen will be connected the same way that your PC is, providing a seamless experience.”

TMA Associates

P.O. Box 570308

Tarzana, CA 91357-0308

1(818)708-0962

FAX: 1(818)232-0368

info@tmaa.com