TMA Associates

From Speech Strategy News, August 2009

 

Editor's Notes

Adapting speech recognition based on human corrections

Bill Meisel, Publisher & Editor


Adaptive speech recognition, where the system employs corrections made by users to adjust its parameters and improve its performance, has long been incorporated in some dictation systems. In Nuance's eScription healthcare dictation system, as an example, corrections in the result of the speech recognition processing that are made by medical transcriptionists are used by the speech recognition software to increase its accuracy over time (p. 22). Using editors to review a speech-recognition-processed medical record is conventional and not controversial, perhaps because medical transcriptionists have been seeing these medical reports long before speech recognition entered the picture.

Review and adjustment of speech recognition using people occurs in call centers as well. Automated customer service applications require tuning by review of what callers unexpectedly say that causes failure of the automated system. Dialog-design experts have long used recordings of call center conversations and similar tools to adjust the speech recognition grammars to cover those cases.

Yet, SpinVox was recently blasted by the British Broadcasting Company for using transcriptionists in foreign countries with their automated speech technology. The BBC report also questioned the security of using humans in the process, claiming that transcriptionists at outsourced centers abroad were seeing entire messages, some with sensitive contents. The company attributed the reports to "disgruntled employees," according to one source.

SpinVox has always admitted that its speech recognition was aided by human transcriptionists, much like the common practice in medical transcription (SSN, March 2009, p. 17). Unlike medical transcription, however, SpinVox claims it only gives the transcriptionists the parts of messages that score low in speech recognition processing. (Presumably, the caller's name is also not available to transcriptionists.) SpinVox also uses the transcriptionists' corrections to improve the speech recognition processing, presumably resulting in a larger proportion of calls being fully automated. One could view this process as having two objectives: (1) an investment in R&D to improve speech recognition in the environments that SpinVox covers through feedback of errors; and (2) an investment in developing confidence in the results by users to increase acceptance of speech-to-text processing in telephone applications.

"Investment" is perhaps the correct word. Similar to web companies that build up their businesses by giving things away for free, SpinVox is almost certainly losing money with its current method, despite not giving the service away. It presumably must increase its automation rate, perhaps eventually moving to a fully automated system like Google's (with Google Voice, p. 6 and 7). One presumes that this is the ultimate objective.

The core speech technology has a strong history. SpinVox's Cambridge-based Advanced Speech Group (ASG) is run by Dr. Tony Robinson, formerly of the Cambridge University Machine Intelligence Laboratory, and Prof. Philip Woodland of Cambridge is a consultant. ASG has more than 20 speech specialists and PhDs working to develop and refine the system, according to SpinVox.

SpinVox raised $100 million in March 2008 from Goldman Sachs and other venture funds. SpinVox recently also offered employees the option of taking salaries in stock options to conserve cash. The company has moved to expand its speech-to-text service through partners using its Application Programming Interface (p. 18). Increasing volume, however, requires increasing automation in the long run.

The controversy should be placed in perspective. SpinVox isn't "cheating" by using transcriptionists; it is using a common technique for improving speech recognition technology, with the additional benefit of delivering more acceptable final transcriptions. The controversy is more accurately placed in the context of other company's strategies outside of speech recognition--building market share and proving the desirability of a service before building profits. Think Twitter or Facebook. Of course, one must prove the service can be profitable eventually.