For most of us, voice is our first user interface. Unfortunately, it's been surprisingly tough to bring that interface to the PC. But a top Microsoft researcher I spoke with last week says that human-quality speech recognition -- good enough to let your computer reliably transcribe a newspaper read out loud -- is now about a decade away. In the meantime, limited, but still useful, speech-based apps will only become more and more common.
That's all according to Xuedong Huang, general manager of Microsoft's .Net Speech Technologies. His immediate task is to make speech applications easier to develop, less expensive to implement, and, over time, increasingly intelligent.
I met with Dr Huang -- who goes by the initials "XD" -- in his office at Microsoft's Redmond campus, one of a series of meetings I held there as part of my search for the Next Big Thing.
One of the world's leading experts on voice recognition, Huang said his estimates are based on historical progress in voice technology. His benchmark of success is for computers to match the human brain in their ability to translate speech into data. We're still a long way from that goal.
For example, today's error rate for computers doing newspaper speech transcription is 3 percent -- which is just bad enough to render the transcriptions almost unusable. Humans perform the same task to a 0.9 percent error rate. Bridging that gap, Huang estimates, will take a bit less than 11 years.
Freestyle speech recognition -- meaning an understanding of a regular human conversation, complete with contractions, slang, and a large vocabulary -- today generates a 30 percent error rate for computers; for humans, the error rate is 4 percent. Bridging that gap, Huang says, will take 19 years. (Dictation apps like IBM's ViaVoice have error rates in the 5 to 15 percent range -- not bad if you don't mind cleaning up the transcribed text after every use.)
By comparison, it will take 15 years for your computer to recognise individual letters of the alphabet with the same reliability (1 percent) as humans; the error rate now is 5 percent. Number strings are easier for computers to understand; the error rate today is 0.7 percent. But it'll take a whopping 41 years for machines to match the .009 error rate attained by human beings.
Though computer speech recognition won't be "human quality" for a while, it's already good enough for many applications. And some of them -- such as those interactive voice-response systems used by airlines, directory services, utilities, and other service organisations -- have already become quite common.
For example, my Sprint PCS voice-dialling service needs to recognise only a handful of standard commands plus the names I've entered into the system. Given this relatively small -- but useful to me -- pool of possible inputs, I can get excellent results with commands like, "Call Dan in the office."
(Alas, I did not ask Dr Huang to predict when we will no longer have to repeat our account information to a human being -- and wait while they manually key it into their workstation -- immediately after dictating that same info to a computer. I'll know the world has really changed when, after I tell a computer my account number, a human being then answers, "Hello, Mr Coursey!" But I digress...)
Microsoft is building similar, constrained voice recognition into its user interfaces. Huang demonstrated how he was able to access information and send email in Microsoft Outlook using voice commands alone. He pointed out that the technology handled his Chinese-accented English quite well; it wasn't so long ago that accents threw recognition engines for a loop.
As long as the application can provide the voice-recognition system with some context, some sense of what to expect next, current technology works fairly well. But that can be tough to do. The typical voice recogniser today is aware of only the last three words it has heard, making it difficult to "guess" the word that will follow. That context problem is of the major issues facing scientists developing recognition engines.
Another problem is noise -- i.e. the problem of knowing what to pay attention to and what to ignore. People are very good at this, Huang said, but teaching computers how to make that distinction remains a formidable challenge.
Despite the hurdles, Microsoft hopes to make voice as common a user interface as keyboarding -- perhaps more so. Every keyboard-based computer could benefit from having voice recognition available as an input option; once that technology is sufficiently improved, it could make possible entirely new form factors and applications.
Huang's primary goal right now is to bring voice recognition to corporate apps, enabling business customers to access and input information vocally. The way to do that is to make a speech recognition server part of .Net. That addition, along with XML extensions and development tools, should allow those corporate customers to access a single store of information using both Web- and voice-based interfaces.
Even though general-purpose voice recognition is still a decade away, I found my visit with Dr Huang to be very encouraging. In the past few years, a trickle of voice-recognition applications have entered my life. I like them and look forward to more. If Microsoft succeeds in making voice a standard user interface, the floodgates could soon open. Over the next three to five years we may find ourselves talking to more and more computers -- and getting more and more done as a result.













