Learning to talk
Despite its benefits, the technical aspects of speech recognition implementations are relatively painless thanks to more than decade in which the technology grew from being a laboratory fantasy to an inexpensive, practical new interface for human-machine interaction.
This growth came as researchers, dedicated to finding a way to help computers understand a broad range of words and spoken accents, steadily improved their software's ability to recognise discrete phonemes.
Recognising these sounds, which make up the basic units of all speech, lies at the core of today's speech recognition engines.
In the early days, software basically tried to compare the sound wave of spoken words with those stored in a massive database containing recordings of words.
But this approach was quite clunky, forcing speakers to pause after each word and demanding a noticeable amount of time for the system to search through its database. Even then, it wasn't particularly accurate: homonyms, tricky accents, background noise, and user unsophistication made early speech recognition an exercise in frustration.
The technology's big breakthrough came in the late 1980s with the introduction of hidden Markov models (HMM), a mathematical technique for deciding the most probable match between two similar sets of data--in this case, digitised waveforms. Thanks to HMM, speech recognition systems became far more flexible by gaining the ability to compensate for imperfections in the speaker's voice.
Over time, computers became more powerful and were able to pick increasingly small phonemic units out of a stream of speech. This paved the way for natural language speech recognition (NLSR), which finally freed users from the need to speak in halting phrases with breaks between words.
Now that NLSR has been moved from the desktop to the server side, the technology has been successfully extended into a massively scalable speech processing infrastructure. This infrastructure is typically made up of one or several standard servers clustered together into a high-availability node. As a rule of thumb, plan to have one NLSR server per 30 phone lines.
"There was a lot of marketing pressure behind speech recognition in the early and mid 1990s," says Clive Summerfield, a longtime speech researcher who founded Syrinx Speech Systems over a decade ago. "But over the past five years, the technology has started to come of age and is now really moving from high-cost, high-value niche applications into more mainstream style applications."
A one-time Australian success story, Syrinx helped kick-start the global speech recognition market with world-class technology that secured a 6000-line customer care contract with US telecommunications giant AT&T in 1998.
Last year, Syrinx installed 180 lines of speech recognition for online trader ComSec, then gained new management and a new name (Sayso!) before going into voluntary administration after a major investor pulled out this year.
"Speech recognition models have been trained on very large databases of speech to create phoneme models," he continues. "Vendors have processed those large databases and therefore the technology now is very robust-and the more they're used, the better they become.
This is a true learning machine, and the first of the artificial learning technologies to find wide-ranging commercial applications. When implemented correctly, it can return an extremely effective return on investment very quickly."













