|
|
To print: Select File and then Print from your browser's menu
-------------------------------------------------------------- This story was printed from ZDNet Australia. --------------------------------------------------------------
|
Speech technology: 'Hey, I'm talkin' to you!' By Oliver Weidlich, Performance Technologies Group October 12, 2001 URL: http://www.zdnet.com.au/news/communications/soa/Speech-technology-Hey-I-m-talkin-to-you-/0,130061791,120261133,00.htm
It was one of the first promises made at the beginning of the PC revolution - computers that understand human speech and could speak back. Nevertheless, voice recognition has had problems making the leap from Hollywood fantasy to the real world. ZDNet Australia takes a look at the current state of play.
Most of us take the gift of speech for granted, but what about talking to computers, or talking through a telephone to a recorded voice on a computer? Do people want to be able to talk to computers? How do they want to talk to them? Even stranger is the notion of computers talking back! Although speech interfaces have been around for some time, it is only now that they are actually being used in a wide variety of applications and delivered through different technologies to provide different services. In the following article, we take a look at the ways in which people communicate with each other in order to arrive at a better understanding of how we can communicate with computers. We then provide a brief overview of speech interfaces currently in use, their differences, and the issues that surround each system. The three main areas of speech interface examined are:
Whether it is dictating to a PC, an over-the-phone credit card payment system, a taxi booking system, or a natural language banking service, speech offers a method of interaction with computers and intelligent devices that can save us significant time, effort and even physical pain(such as RSI). Finally this article looks at the usability of speech interfaces and what needs to be considered when developing Speech IVRs and Natural Language Speech Recognition interfaces. How we communicateBefore we take a look at what is available, let's consider what it takes to talk and listen. Communicating with devices via speech is probably more difficult than you think. Consider the amount of information that is conveyed between two people in a vocal exchange. To start with, some 70 percent of communication between people is through non-verbal cues, such as facial expressions and body language. This obviously cannot be conveyed through speech interfaces. Secondly, there is a lot of information provided by the human voice itself. Consider when you are talking to someone on the telephone--you can usually tell if they are happy, sad or in a rush through their intonation, pitch, pronunciation and punctuation. While all this information might not be relevant when communicating with a computer, some aspects are, such as the punctuation and intonation (i.e. a statement vs. a question). The computer cannot always identify where you have finished a sentence and where the next one starts. Finally, remember that the human brain's ability to process speech is amazing and when it comes to understanding what we are saying, while a computer simply doesn't yet have the processing power or algorithms to accurately deal with everything a human might be trying to convey. Not only is it the words and how they sound that provide information, but it is also the order that we put them in. There are certain guidelines to interaction that psychologists call 'scripts'. These define the order in which things generally take place and assist us in determining the right information at the right time. For example, when you go to a restaurant, there are certain procedures that are usually followed. When the waiter first asks you what you would like to order, you know it is the start of the meal so you may order an entree, when they ask again you may order a main and the third time you order the double chocolate mud cake. A lot is also dependant upon the context of the scenario. These guidelines also shape the interaction by narrowing down the possibilities. The waiter may ask what you would like to eat and you respond with fish, he may then ask which type of fish you would like and then how you would like it cooked: grilled or battered. This process is important as it allows the agent, in this case the waiter, to quickly identify the exact item you want rather than him having to read through all the items and all the options. It's no wonder then that it takes a lot to develop a high quality, useable and effective speech interface, drawing on the talents of programmers, psychologists, linguists, anthropologists and sociologists to ensure that the person on the other end can communicate with a box of wires. What can speech offer? For all the trouble they take to successfully develop, speech interfaces can offer a number of benefits over other types of interfaces, including: Speed of text entry - When accuracy is achieved, speech is much faster method of entering text for most customers. Especially for people who may not have much interaction with computers or similar types of devices. It requires very little learning or training for most people. Allows for different form factors - Keyboards are sometimes impractical in terms of either their size and or shape. For example, accurate speech systems allow smaller devices that have no keyboards can therefore allow for all different types of form factors. Reduced physical strain - By eliminating the keyboard as a text input device, you reduce the risk of physical strain such as repetitive strain injury. Reduced concentration on one place - Speech interfaces can offer a suitable method of interaction when the customer may be visually concentrating on something else and the speech provides support information. I know I'm always looking at the keyboard when I type and then when I look up all the words are spelt funny. With speech recognition, the person can also be doing other things with their hands, as they don't need to be typing, such as making gestures on a touch screen Easy to access - The most common speech interface device is the telephone that most people have, if not a mobile phone as well. This reduces the costs required to access the information and therefore increases the chance of being used. By using a telephone as the technology, customers can access services from virtually anywhere in the world. Speech interfaces can also be provided in multiple languages so the customer can be accommodated in the most appropriate manner. Speech technologies comparedThere are number of different speech interfaces, which perform different functions, and are available via different technologies. Text-to-Speech (TTS) Text-to-speech is the ability for a computer or system to take a normal written sentence and turn it into spoken words. This has been around for some time, but only recently has started to improve. The Macintosh OS has featured text-to-speech capabilities for a number of years. Error messages can be read out as well as allowing integration with applications so documents and web pages can also be read out. Windows 2000 now provides similar functionality via Narrator. There are also special purpose programs, such as JAWS, that allow visually impaired people to use computers and navigate the World Wide Web with the information read out by a synthesised voice. This technology provides a method of making information accessible to a wide range of people who may not have otherwise have been able to view it. The main issue with text-to-speech systems is that they usually sound very unnatural due to the synthesised voice that is used, while the intonation of the voice can also be quite monotone. When concatenated speech is used (which are recorded sound bites of a real person's voice strung together), it can sound very unnatural, as the words do not seem to flow as they do in regular speech. For example, when some systems read out phone numbers it is particularly obvious that it is concatenated, as the start and stop sounds of the individual numbers do not fit well together when in a sequence. Interactive Voice Response Interactive Voice Response (IVR) systems are the type of speech interface most of us have come across at least once. They are the recorded voice at the end of the phone you may encounter when you call to pay a telephone bill or get in touch with your bank. They involve a menu of options with the ability to select the appropriate option by using the telephone keypad. Some IVRs include a basic speech recognition system which means that instead of pressing one, the user has to say "one" into the telephone. This doesn't really provide much value and increases the chance for error. Issues with IVRs are that they are very structured, rigid and make the customer choose options that may seem irrelevant, leaving them to often guess what is the right option. People often find IVRs frustrating as they are forced to interact with the service or products the way the company wants them to, which may not be the way they view it. Recent usability studies show that the company usually has a 'silo' view of the products, while customers look at them in a completely different way. When the new model was tested, it showed significant ease of use and customers were more comfortable with the system. Obviously, this can have a significant impact on the business. Some IVRs are much better than others, with clear paths and appropriate language and terminology. Speech Recognition
Dictation However, unlike the good ear of the secretary, computers can have issues with determining things like context, which is necessary for the spelling of the words (e.g. there vs. their), and pronunciation. Computer have have problems interpreting verbal punctuation. You can't finish a sentence and expect the computer to know to put a full stop, you have to say 'full-stop', which in itself is quite unnatural. There are two types of dictation software: discrete and continuous. Discrete requires you to place pauses between the words in order to allow the computer to process a word at a time. Continuous is the more natural way of speaking to the computer as it is not as jerky and is therefore subsequently faster. Additionally, continuous speech recognition sometimes allows context to be determined as the words are processed in groups rather than individually. Most dictation systems have to be trained to gain the best possible accuracy. This can take a number of hours and most people just couldn't be bothered. Training also limits the amount of variation that can be allowed in the voice. The voice can change significantly, for example, when a person is sick or when a male is going through puberty. Current companies that offer solutions are Dragon Naturally Speaking, IBM Via Voice, Philips FreeSpeech, Lernout & Hauspie Voice Xpress, and MacSpeech. Microsoft claims on its Web site that Windows Office XP incorporates both dictation and command speech interfaces via the Speech.net API. Speech Commands One of the simplest examples of speech commands is pattern recognition, which is a common feature on some high end mobile phones. It allows you to say the name of the person you want to call and the phone then dials the number. Many users have troubles making it work consistently. This is often because the initial input isn't clear enough or the background noise interferes with the input. Voice commands are also available via some operating systems. If you own a Macintosh, the speech capabilities are quite impressive, especially with OS X 10.1. Voice commands are also appropriate for smaller devices allowing an easier method of navigation, especially when on the go as they then only require one hand to hold the device. The new Compaq iPaq, for example, includes IBM Via Voice Command and Control software that is used with the Calender, Contacts and Inbox applications. Dictation software for the handheld will be a major benefit given the difficulty of current handwriting text input methods and onscreen keyboards. Natural Language Speech Recognition While most IVR speech recognition interfaces, such as those described above, rely on a fairly limited vocabulary and very set patterns of interaction, some have the ability to understand a much wider vocabulary and indeed whole phrases which may contain more than one piece of information and not necessarily in the right order. This type of speech interaction is called Natural Language Speech Recognition (NLSR). This is the exciting space to be in, as this is where speech offers a quite realistic interface compared to communicating in a prescribed way. It is much more flexible and allows the customer to provide information in a much more natural manner without going through static paths. NLSR also affects the output of the computer as well allowing it to respond in a logical and 'smart' way. The NLSR program can ask appropriate questions or if you don't seem to be providing the right information, rephrase the question. The aim of Natural Language Speech Recognition is to allow people to deal with computers in an everyday conversational manner. For example, customers using the Credit Union Australia service, using the underlying Speechworks technology, can simply ask for an account balance on a particular account and there is a good chance they will get the account balance read out to you. You can also transfer money between accounts without lifting a finger. In fact, you can do the whole transaction in one sentence without going through complex menu systems or needing to press buttons. Regent Taxis on the Gold Coast allows users to call and it will simply ask if there are four or less passengers and they are ready now. If this is the case you simply say 'yes' and a taxi will be ordered with no more information required. It knows your address using Caller Line Identifcation. If you have more than four passengers or you want to book for tomorrow it will prompt you to answer a couple of questions and book the appropriate taxi at the appropriate time. If it doesn't quite understand it will ask you to confirm the details. This was a Ve-commerce designed solution utilising the Nuance speech operating system.
Applications for Natural Language Speech Recognition Voice interaction can be used as an interface for Internet services through phone systems or Web sites themselves. Telephones are cheaper than computers and are more widely accessible. However, there is information that would not be suitable for delivery through a speech interface such as pictures and information that needs to be seen all at once to provide a conceptual model. WAP is a key area in which voice is expected to be beneficial as a front end, rather than having to use the very limited keypad input and limited viewable text output. In-car navigation systems are also an area where speech plays a starring role. However, when designing the system there are a number of usability issues which need to be taken into account, for example, how far ahead do you want to be given the signal to turn (50 metres 100 metres or both). Usability issues with Speech InterfacesSpeech is a wonderful thing, but it's very difficult to design a speech interface that works well. We can't usually tell if a typed message may have been generated by a person or a computer, but with speech, the ability to identify it as computer-generated is very easy. Firstly, because most of the speech recognition and delivery engines haven't quite got the logic behind them yet, this results in the computer not responding in a way that makes sense to the user. Secondly, the way you have to speak to the computer or telephone (slow, steady and sometimes in an American accent) may be unnatural. Thirdly, the way the computer talks may not be smooth or have the right intonation and, again, the interaction doesn't 'sound' or feel natural. There are a number of issues that need to be addressed in relation to the usability of speech interfaces and some of these are outlined below.
The future of speech interfacesSo, what have we got to look forward to? Hopefully not a world where we have to repeat things constantly and speak in an American accent. HAL, in 2001 A Space Odyssey, is always held as the benchmark for speech interaction, but to get to this level requires faster computers with more processing power, and better speech processing algorithms. There's still plenty of work left for the psychologist and linguists. Tim Courtright, Inflection Technologies, notes that in the future, voice systems will be networked so that you can surf voice networks in a similar way to the Internet in order to conduct associated searches even though they may be from different countries. For example you might hire a car from A to B. After this is done the computer might ask if you would like to look for a hotel in the area or local attraction or you could just ask, "is there a Flag Inn around". This will rely on open standards such as VXML or common connection interfaces between the programs themselves. One day we'll be able to interact quickly and easily with devices in a natural manner with the computer able to detect pauses and pronunciation and accurately deliver them to the page. Commands will be simpler through better training or through better recognition systems and, of course, the computer will improve in its ability to learn and adapt to the customer's speech patterns. "Computer - Log Off"
About The Performance Technologies Group The Performance Technologies Group improves the experience people have when using technology. Our focus is on creating usable, technically robust and business effective technology through independent usability, technical and business consulting solutions. With a team of psychologists, technologists, and business specialists, we address the key problems associated with the development of technology such as: cost effectiveness; customer satisfaction; quality, performance, and reliability; value creation for customers; business performance; accessibility for people with a disability; and usability. For further information, please call The Performance Technologies Group on (02) 8920 2288, fax (02) 8920 0488, email info@performancetechnologies.com, or visit their Web site at www.performancetechnologies.com.
Copyright © 2009 CBS Interactive, a CBS Company. All Rights Reserved. |