Speech technology: 'Hey, I'm talkin' to you!'

Speech technologies compared


There are number of different speech interfaces, which perform different functions, and are available via different technologies.

Text-to-Speech (TTS)

Text-to-speech is the ability for a computer or system to take a normal written sentence and turn it into spoken words. This has been around for some time, but only recently has started to improve.

The Macintosh OS has featured text-to-speech capabilities for a number of years. Error messages can be read out as well as allowing integration with applications so documents and web pages can also be read out. Windows 2000 now provides similar functionality via Narrator.

There are also special purpose programs, such as JAWS, that allow visually impaired people to use computers and navigate the World Wide Web with the information read out by a synthesised voice. This technology provides a method of making information accessible to a wide range of people who may not have otherwise have been able to view it.

The main issue with text-to-speech systems is that they usually sound very unnatural due to the synthesised voice that is used, while the intonation of the voice can also be quite monotone. When concatenated speech is used (which are recorded sound bites of a real person's voice strung together), it can sound very unnatural, as the words do not seem to flow as they do in regular speech. For example, when some systems read out phone numbers it is particularly obvious that it is concatenated, as the start and stop sounds of the individual numbers do not fit well together when in a sequence.

Interactive Voice Response

Interactive Voice Response (IVR) systems are the type of speech interface most of us have come across at least once. They are the recorded voice at the end of the phone you may encounter when you call to pay a telephone bill or get in touch with your bank. They involve a menu of options with the ability to select the appropriate option by using the telephone keypad.

Some IVRs include a basic speech recognition system which means that instead of pressing one, the user has to say "one" into the telephone. This doesn't really provide much value and increases the chance for error.

Issues with IVRs are that they are very structured, rigid and make the customer choose options that may seem irrelevant, leaving them to often guess what is the right option. People often find IVRs frustrating as they are forced to interact with the service or products the way the company wants them to, which may not be the way they view it. Recent usability studies show that the company usually has a 'silo' view of the products, while customers look at them in a completely different way. When the new model was tested, it showed significant ease of use and customers were more comfortable with the system. Obviously, this can have a significant impact on the business. Some IVRs are much better than others, with clear paths and appropriate language and terminology.

Speech Recognition

Dictation
One of the first forms of speech recognition that is slowly becoming usable is dictation software. This involves a person speaking at the computer via a microphone and the computer turning the spoken words into text and displaying them onscreen.

However, unlike the good ear of the secretary, computers can have issues with determining things like context, which is necessary for the spelling of the words (e.g. there vs. their), and pronunciation. Computer have have problems interpreting verbal punctuation. You can't finish a sentence and expect the computer to know to put a full stop, you have to say 'full-stop', which in itself is quite unnatural.

There are two types of dictation software: discrete and continuous. Discrete requires you to place pauses between the words in order to allow the computer to process a word at a time. Continuous is the more natural way of speaking to the computer as it is not as jerky and is therefore subsequently faster. Additionally, continuous speech recognition sometimes allows context to be determined as the words are processed in groups rather than individually.

Most dictation systems have to be trained to gain the best possible accuracy. This can take a number of hours and most people just couldn't be bothered. Training also limits the amount of variation that can be allowed in the voice. The voice can change significantly, for example, when a person is sick or when a male is going through puberty.

Current companies that offer solutions are Dragon Naturally Speaking, IBM Via Voice, Philips FreeSpeech, Lernout & Hauspie Voice Xpress, and MacSpeech. Microsoft claims on its Web site that Windows Office XP incorporates both dictation and command speech interfaces via the Speech.net API.

Speech Commands
Controlling a computer by speech is another powerful tool, when it works correctly. This involves giving voice commands to the computer or device and it completing the action automatically.

One of the simplest examples of speech commands is pattern recognition, which is a common feature on some high end mobile phones. It allows you to say the name of the person you want to call and the phone then dials the number. Many users have troubles making it work consistently. This is often because the initial input isn't clear enough or the background noise interferes with the input.

Voice commands are also available via some operating systems. If you own a Macintosh, the speech capabilities are quite impressive, especially with OS X 10.1.

Voice commands are also appropriate for smaller devices allowing an easier method of navigation, especially when on the go as they then only require one hand to hold the device. The new Compaq iPaq, for example, includes IBM Via Voice Command and Control software that is used with the Calender, Contacts and Inbox applications. Dictation software for the handheld will be a major benefit given the difficulty of current handwriting text input methods and onscreen keyboards.

Natural Language Speech Recognition

While most IVR speech recognition interfaces, such as those described above, rely on a fairly limited vocabulary and very set patterns of interaction, some have the ability to understand a much wider vocabulary and indeed whole phrases which may contain more than one piece of information and not necessarily in the right order. This type of speech interaction is called Natural Language Speech Recognition (NLSR). This is the exciting space to be in, as this is where speech offers a quite realistic interface compared to communicating in a prescribed way. It is much more flexible and allows the customer to provide information in a much more natural manner without going through static paths. NLSR also affects the output of the computer as well allowing it to respond in a logical and 'smart' way. The NLSR program can ask appropriate questions or if you don't seem to be providing the right information, rephrase the question.

The aim of Natural Language Speech Recognition is to allow people to deal with computers in an everyday conversational manner.

For example, customers using the Credit Union Australia service, using the underlying Speechworks technology, can simply ask for an account balance on a particular account and there is a good chance they will get the account balance read out to you. You can also transfer money between accounts without lifting a finger. In fact, you can do the whole transaction in one sentence without going through complex menu systems or needing to press buttons.

Regent Taxis on the Gold Coast allows users to call and it will simply ask if there are four or less passengers and they are ready now. If this is the case you simply say 'yes' and a taxi will be ordered with no more information required. It knows your address using Caller Line Identifcation. If you have more than four passengers or you want to book for tomorrow it will prompt you to answer a couple of questions and book the appropriate taxi at the appropriate time. If it doesn't quite understand it will ask you to confirm the details. This was a Ve-commerce designed solution utilising the Nuance speech operating system.

Who's out there?
There are potentially three companies that contribute to making a speech recognition solution.

Those that make the core speech recognition engine and surrounding technology, such as Speechworks, Nuance, Syrinx and Phillips. These companies have significant research and development processes which are bent on improving the speech recognition engines. Then there are companies who develop the application that the customer interacts with, such as Inflection Technologies, Appen, Ve-commerce and Holly.

Inflection Technologies focuses solely on speech application design. This involves conducting an application discovery process to ensure that all the aspects of the customer experience have been considered before the application is designed. Another company, Appen, provides professional services in application design and tuning, lexicon development and transcription including catering to foreign language customers.

Applications for Natural Language Speech Recognition

Voice interaction can be used as an interface for Internet services through phone systems or Web sites themselves. Telephones are cheaper than computers and are more widely accessible. However, there is information that would not be suitable for delivery through a speech interface such as pictures and information that needs to be seen all at once to provide a conceptual model. WAP is a key area in which voice is expected to be beneficial as a front end, rather than having to use the very limited keypad input and limited viewable text output.

In-car navigation systems are also an area where speech plays a starring role. However, when designing the system there are a number of usability issues which need to be taken into account, for example, how far ahead do you want to be given the signal to turn (50 metres 100 metres or both).

Like this article? Click below to send it to your mobile for free!

Talkback 1 comments

  1. Don't forget those are hearing impaired people who cannot hear!!! I'm not impressed of this article about voice communication. Those techology will leave hearing impaired without jobs that require voice communication and can affect them in ma Anonymous -- 08/09/04

    Don't forget those are hearing impaired people who cannot hear!!! I'm not impressed of this article about voice communication.

    Those techology will leave hearing impaired without jobs that require voice communication and can affect them in many ways.

    Hearing impaired user!


Sponsored content

Power Centre - Content from our premier sponsors

Blogs

Tags

Back to top

Featured