Advertisement
To print: Select File and then Print from your browser's menu
-------------------------------------------------------------- This story was printed from ZDNet Australia. --------------------------------------------------------------
Voice recognition: Past, present and future

By Jeanne-Vida Douglas, ZDNet Australia
June 20, 2002
URL: http://www.zdnet.com.au/news/communications/soa/Voice-recognition-Past-present-and-future/0,130061791,120266107,00.htm


"You have called Telstra's directory assistance, please say only the name you want," there is a pause as an 18 month old software system awaits your response.

With a vocabulary covering the 2000 most sought-after listings, the Alcatel-based system was initially designed to handle about 15 percent of enquiries to Telstra's directory assistance service. At the same age, an average human child would be able to produce roughly five utterances, and no doubt understand many more as it would be just about to hit the steep side of the early language learning curve. Telstra's voice-recognition software, on the other hand, has to forget some of its words, before it can start to remember any new ones.

It can learn new ways to say the same word - but, unlike the toddler, it can't learn any new ones without explicit intervention.

For the majority of us, this kind of call centre automation is our most likely contact with voice recognition technology. And while call automation, in the form of touchtone technology, has been around for some time, speech recognition vendors are having some success talking up the benefits of their more-expensive systems.

"There are some intangible benefits on the soft side of it all," explains Peter Chidiac, Australian managing director of speech recognition vendor Speechworks. "There is ample evidence of a better customer experience."

According to Chidiac, one of the more tangible benefits of speech recognition technology is its ability to cut through the menus on which touchtone technology depends.

"You can't call into a banking application and say 'what is my stock price?', instead you have to sift through menus," Chidiac says. "Speech recognition offers a smaller foot print for conducting larger transactions."

However, speech recognition goes a lot further than call centre applications. According to IBM's Richard Gray, speech technology within Big Blue is broken into three main areas; speaker dependent unlimited vocabulary, speaker independent limited vocabulary, and command control.

While call centre systems generally depend on speaker-independent-limited-vocabulary systems (ie Telstra's 2000 words, and unlimited customers), speaker-dependent-unlimited-vocabulary first have to "learn" which combinations of sounds the speaker is likely to use to represent different words.

"Think of it as dictation software," Gray says. "You have to read a script and train it to recognise your voice."

Gray says the early adopters of IBM's ViaVoice software are mainly professionals who often find themselves outside the office, such as doctors and lawyers.

"It can take the grind out of dictation, all the PA's have to do when they get the data back is the fine tune the text, rather than type out the whole thing," Gray says.

Command control has been popularised on film and TV. From the sinister user/computer interactions in Stanley Kubrick's 2001: A Space Odessey, to the more light-hearted banter of Red Dwarf, sci-fi writers introduced speech recognition in all its forms long before the technology was ready to deliver, and the ability to remotely control computers with voice alone is one of the most common.

Gray believes one of the first places such technology will become prevalent is in the car - where hands free access to devices, information services, the Internet and telephones could prove a popular functionality.

IBM's Driver Assistance and Information Systems (or DAISY for short), has already been picked up by DaimlerChrysler, and although it is only available in proto types at car shows at the moment the company is looking to integrate voice control as a standard capability in new cars some time in 2003. Voicing concerns about integration

Voice recognition research is by no means a young science. Back in the 1950's IBM computers began pumping out statistical correlations between "sounds and the words they represent".

By 1964 Big Blue had created a "shoe box recogniser" to recognise spoken digits.

Suffice to say, the technology has been around for a while - so why is it only recently becoming visible?

Two factors: cost and integration.

Clive Summerfield, a speech recognition technology consultant, believes it won't be long before speech recognition technology breaks into the main-stream applications market.

"Speech is a natural-interface technology, but at this stage the price point is such that it is only applicable in high-value applications where you are using speech recognition to replace human agents," Summerfield says. "In many cases you are touching the customers at the business' most sensitive point, you cannot afford to get it wrong, because everything you do will have a significant impact on the way the business is perceived."

While the software systems are now capable of recognising millions of words - converting these to text and even reproducing them in an entirely new language - a poorly implemented solution could have a profoundly damaging effect on any company.

"Typically people think of speech recognition as a replacement for touchtone technology, but it is a very different interface," Summerfield says. "You need people like linguists who are far more attuned with the best ways user interfaces are put together, and can replicate a dialogue the user might have with an agent."

As for increased integration with consumer applications, IBM's Gray says it is just a matter of time before products begin to become more prevalent.

"We are at the stage where the basic technology exists, and will continue to develop based on commercial need," Gray said. "Voice has the potential to fill the void between the Web and mobile access, and the next phase of development has the potential to mix speech and display response, providing multiple ways to deliver the same information."

Talking hubs: voice recognition hits the net

In preparation for this new wave of voice-based Web access, Dr Rolf Schwitter, a lecturer at Sydney's Macquarie University, has integrated training in voice XML into an introductory course on Web technology.

"We cover basic speech technologies, speech synthesis, text to speech, then expand into an introduction to voice XML," Schwitt explains.

Schwitter believes, as speech technology becomes more prevalent, developers will need to understand both the engineering requirements of implementing a solution, as well as some of the psychological and linguistic requirements needed to write a dialogue flow.

"You have to ask the right questions to get the right information," Schwitter says. "If they are interested after having completed the first phase of the course, we have a unit called Interactive natural language systems, devoted to the subject."

Accordingly the course material he has been involved with was developed in conjunction with industry partners such as Motorola and Phillips.

"We are formulating the course based on what they want from their future employees, and teaching students what they will have to know," Schwitter says.

Traversing the technological plateau

While vendors are characterising the next phase of development in speech recognition technology as applications and integrations focussed, researchers in the field recognise that the technology behind such applications has largely reached a plateau.

Dr Steve Cassidy, senior lecturer in computing at Macquarie says there is a trend in the academic literature discussing what the next quantum leap in the technology might be.

"As far as the vendors are concerned the technology is at the state where you can do lots of useful things with it, and while the researchers are always trying to push things as far as they can, most of the work being done is based on incremental changes," Cassidy says.

Dr David Grayden, research fellow at the Bionic Ear Institute in Melbourne, believes that along side advances in processing power there have been three main advances in speech recognition technology.

"The first breakthrough was the introduction of databased approaches - rather than trying to understand every little speech event, then came dynamic time warping, which enabled the software to compare incoming speech with stored versions of the speech," says Grayden. "Next came hidden Markov models, which allowed continuous speech to be recognised, and forms the basis of dictation type models."

While conceding he is probably in the minority among engineering-focussed researchers Grayden argues that an earlier move away from integrating linguistic physiology into speech recognition research has ultimately proven detrimental.

"In there early days there was a notion that every time the research became more engineering based there was a leap forward in the technology," Grayden says. "I believe that is why it plateaued. There is now a need for a breakthrough, something new that will give us a jump in performance, and I believe it will come from a mixture of skills, including computer engineering, linguistics and physiology."

In the mean time, Cassidy is focussing on training graduates for an employment market where voice applications development is likely to provide the bulk of the work opportunities.

"Customer acceptance is growing, and it is fairly inevitable that these kinds of voice systems will take off and that the possibility for a bigger voice industry is already there, even given the limitations of the current technologies," Cassidy surmises.


Copyright © 2009 CBS Interactive, a CBS Company. All Rights Reserved.
ZDNET is a registered service mark of CBS Interactive. ZDNET Logo is a service mark of CBS Interactive.