Talking to Tomorrow
By Andrea Orr, Mon Mar 19 00:00:00 GMT 2001
Whether you're outdoors, on the road, or just prefer talking to typing, speech recognition software promises to deliver your spoken command to any device you desire. But challenges exist before the young technology becomes widespread...
New phone services that respond to simple spoken commands and retrieve information on news, sports scores or even the nearest Italian restaurant may seem impressive the first time you use them. But researchers in the area of speech recognition say that technology is relatively simple compared to what they hope to incorporate into mobile devices - some day. For now, however, there are some pretty big problems standing in the way of their visions.
Even under the most optimal conditions, there will always be problems with speech recognition technology. Just like there are occasional problems with the human ear. Experts estimate that people typically mis-hear spoken words about one half of one percent of the time, even when they are standing together in a quiet room, and there is no technology involved. Not perfect, but a success rate that people can live with.
Add a cell phone into the equation and the room for error soars. Crisp connections turn blurry, voices fade in and fade out, or a subway passes in the background, drowning out the speaker in the foreground. By most estimates the failure rate over mobile devices is much higher - probably around ten percent.
So, even if there were no notion of using cellular phones for anything other than person-to-person conversations, the technology would be in need of some improvement. As it is, companies are pushing a host of new mobile services that depend all the more on computers' ability to understand the spoken word. Voice portals, which accept simple commands like "stocks," "weather," or "traffic," are supposed to bring all the powers of a PC to small devices. But many experts believe they will fail unless companies develop better recognition systems that can understand the spoken word, even when it is spoken from a busy street during rush hour.
Filtering such background, or ambient noise, remains one of the biggest challenges in speech recognition. When computers are able to hear noise, they typically hear it all, whether it is a voice spoken right into the receiver or the collective chatter in a noisy restaurant. And while researchers have started to build fairly good solutions to this problem in some of the biggest and most powerful computers, they are far from bringing the same technology to the mobile devices where it is needed most. Making superior recognition technology compact is just not that easy.
"Every time you do something on a computer a bit of electricity is used up," explains Zulfikar Ramzan, Chief Scientist at Lucira Technologies Inc, a Boston company involved in speech recognition research. "The technology involved in speech recognition is pretty taxing, there are some pretty complicated math formulas involved."
Jordan Cohen, the Chief Technology Officer for Voice Signal Technologies Inc in Woburn, Mass. agrees. "In mobile devices like cell phones, speech recognition technology comes up against some pretty severe constraints," he says. "I think there are a lot of things we could do with speech recognition which we can't do now because of these constraints."
People who have been led to believe that computers are ever being made smaller and more powerful may be interested to learn that this has not been the case for speech recognition technology. While there have been some improvements over the past ten years, most agree the advances have been incremental, not exponential. And a real breakthrough remains elusive.
"The fundamental technology, the fundamental algorithms really have not changed a lot," explains Marc Cygnus, Senior Technologist at Arial Phone, another speech recognition company. "Until we get to the point where the algorithms have changed we won't have a good way to deal with ambient noise."
In other words, Moore's Law has really not applied to speech recognition processing. "On devices that have a small amount of memory and computing power, there are special issues with speech recognition technology," says Jim Glass, a researcher in the Spoken Language Systems Group at the Massachusetts Institute of Technology in Cambridge, Mass. "I can't see any great breakthroughs on the horizon, to be honest. As the state of the art improves, there will probably be some trickle down effect in mobile devices, but that may be about it."
Sobering words for people who had thought all technology was on a steep and steady curve of improvement.
There is of course, an alternative to embedding the speech recognition technology in the device itself. Many systems work by locating the technology on a network server, which frees up a lot of space for processing. But most experts say the server solution creates new problems. For starters, these server-based speech recognizers will fail when the consumer moves out of network range, which would prevent them from offering an "anytime, anywhere" kind of service.
There are other problems with network-based speech recognizers as well. Voice Signal's Jordan Cohen recently outlined some of these problems in an editorial for the computer trade publication eWeek. Localized speech engines, he argued, are far better than network-based systems in learning individual language quirks and adapting to different noise environments. In addition, when speech is translated into data locally, data can then be sent across the network, and made to retrieve a digital response. For consumers who are truly on the go, this is not a minor distinction.
"You can ask for a stock quote while running to catch a cab and get the answer in digits, on your screen and in the memory of the phone," explains Cohen.
If such services are still not fine-tuned, Cohen and most other researchers say they have good reason to continue looking for a breakthrough. If and when they find a way to squeeze more speech recognition computing power into the smallest portable devices, they say they will be able to offer services infinitely more sophisticated than the voice portals of today, which generally accept only the simplest commands. Consumers may be impressed the first time they successfully locate an Italian restaurant using a voice portal, but the speech recognition technology supporting such services is nothing compared to the services people like Glass and Cohen see on the distant horizon.
Cohen, for instance, argues that the technology needs to be made at least sufficient so that all Web interaction from a mobile phone can be conducted by voice. "I'd like to make the technology in cell phones good enough so that you can talk your way through all the applications." he says. "If you've ever typed a long message using a number pad, you'd understand it's something you don't want to do. The cell phone is an audio device."
And many other devices could also be made into audio devices, insists Arial Phone's Marc Cygnus. "Oh yes, there is much more that could be done," he says.
"My vision of speech recognition really doesn't even include picking up the phone. It has my house and my entire environment responding to me with voice. And clearly, that is decades off."
A promising future
Maybe. But it is this future vision that keeps researchers enthusiastic. Jordan Cohen says his company is involved in speech recognition in embedded systems like light switches, toys and cars, and believes that unless such services are developed, speech recognition technology could remain practically irrelevant in people's everyday lives.
"If I could do all my email by dictation, that would really be a big win," says Cohen. "But the current protocols that offer this are have been a bomb because they are just not very good."
Other researchers are looking much further into the future. Jim Glass of M.I.T. notes that his research group is named the Spoken Dialogue Systems Group because it aims to eventually make computers that interact withhumans in a far more natural fashion than any computer does today. Speech recognition technology, he says, is only the first step in achieving that goal. He thinks machines will eventually need to understand the meaning of spoken words, and respond in voices that are less robotic than today's computers.
"Right now, people have speech recognition on the brain and they forget everything else required for humans to interact with a computer naturally.
Just think of all the things that go on in a dialogue between two people."
Not easy to duplicate, but Glass believe researchers must try if speech recognition is ever to be more than a fun, or a marginal computer application.
"I think there is a vast potential use for speech in the context of human interaction to extract information and conduct transactions, but speech recognition is only one part of that. There are a lot of other speech synthesis technologies that will have to come into play."
"Right now, what good is speech recognition, aside from dictating letters?" Glass argues. "What part of your day do you spend dictating letters?"
From Silicon Valley, Andrea Orr covers developments in the mobile world for TheFeature. She is also a correspondent for Reuters in the Palo Alto, California, bureau.