Say Anything (Within Reason)
By Mark Frauenfelder, Mon Oct 28 00:00:00 GMT 2002
Call it wishful thinking. For centuries, people have dreamed of machines that could converse as easily as people do.
Imagine the value of a
technology that could convert your utterances into text, extract the
gist of what you are saying, and respond appropriately. Such a device
could completely change the way we interact with computers, with the
potential of making keyboards, touch-tone pads, and mice
High quality, inexpensive natural language processing
could also be very useful for mobile phone users, who could use voice
recognition systems to navigate voice portals or get other information
without having to resort to using the ridiculously tiny keypads than
come with wireless devices.
I get my first taste of the natural language
processing future on a warm Wednesday morning in Menlo Park, California.
I'm sitting in front of Starbucks making a call to SprintPCS. I
have a question about the phone bill. A perky service representative
answers, identifying herself as Claire. "Briefly tell me how I can
help you today," she says. "I lost my bill and I don't
know how much I owe," I tell Claire. "Account
information!" she enthuses. She then asks me to key in my zip code
for verification, and after a couple of seconds, announces, "OK,
Claire is remarkably effervescent, especially at
this hour of the morning. In fact, she's just as bubbly to every
one of the 20 million callers she talks to every month. That's
because Claire isn't real. She's a "virtual service
representative" designed to direct Sprint PCS subscribers through
various account functions. Callers talk to Claire as they would to a
real customer support representative, and she does everything her human
counterpart is supposed to do except cop an attitude. And if she gets
stumped, she'll turn you over to a non-virtual assistant.
company called Nuance designed Claire's underlying technology.
Located in a bland one-story building in a nondescript business park
near Stanford University, Nuance spun off from SRI International
(formerly the Stanford Research Institute) in 1994 to develop speech
recognition, voice authentication, and text-to-speech software for
customer support systems.
Claire, who started taking calls for
SprintPCS in November, uses Nuance's "Say Anything"
technology, which is the latest development in the evolution of
automated telephone support systems. "Say Anything" goes a
step beyond "directed dialogue," the all-too-familiar voice
prompt system that constrains what callers can say by asking for
specific answers to questions: "To pay your bill, say 'pay
bill.' To request a copy of your statement, say
Nuance's natural language system
typically starts out by asking callers, "What can I do for
you?" and then listens for certain key words, ignoring everything
else. Wally Brill, whose bleached and buzzed pate makes him looks more
like an electronica pop star than Nuance's director of persona
design and production, explains that the system was developed, in part,
to overcome the fact that people don't use perfect grammar when
they talk. For instance, he says, a customer calling in to report a lost
credit card could conceivably say something like "Uh, a UFO landed
in our backyard and my wife was abducted by aliens! She had her purse
with her and when they let her go she noticed her credit card was
missing!" Say Anything will look through the database of previously
recorded conversations, and discover that the words "card,"
and "missing" often show up in requests to report a lost of
stolen card, and direct the person to the proper department. "It
ignores the stutters, ums and uhs," says John Shae, Nuance's
vice president of product marketing and management. "Earlier
systems would get screwed up," trying to process the superfluous
Before Say Anything can do anything, however, the
system must first digitize callers' speech. It doesn't
actually store the words themselves. It converts them into acoustical
models made of phonemes (phonetic bits of speech, such as "ah"
or "buh") that make up each word. "The system looks at
the energy levels of speech and compares them to the dictionary of
energy patterns already loaded into the database," says Shae.
"There's a unique energy pattern associated with each
This particular way of converting speech
isn't really anything new. It's the way in which Say Anything
processes the speech that's different. To explain how Nuance's
system differs from traditional systems, Shae gives the example of an
airline reservation system. The standard method, he explains, would be
to construct a database that contains hand-coded grammatical rules that
can detect every conceivable variation of the phrase "I want to fly
to" followed by the name of a city. With Nuance's "Say
Anything," the system developer starts out by letting a large
number of test users call the system and make requests, recording what
they say. The requests are sorted into different categories: itinerary
changes, cancellation, frequent flier account, etc., and stored in the
data based. "There's no structured grammar file," says
Shae, "We just throw them into the system." Then, when real
users call into the system, their requests are compared to the recorded
requests, and using statistical models, the callers are directed to the
(hopefully) proper service area.
"This is the direction we
are headed in," says Dan Hawkins, a managing analyst for voice
business at Datamonitor in New York. Hawkins says that Nuance's Say
Anything (along with competitor Speechworks' "How May I Help
You?" technology), is, in certain cases, a big improvement over
directed dialogue systems. Take phone-based betting. Rather than having
to go through a large number of steps ("What race would you like to
bet on?" "What kind of bet would you like to place?"
"Which horses?" "In what order?" "How much
would you like to bet?"), you can just tell the system, "$10
on horse three to place, in the second race." The betting system
would analyze the sentence, making sure that all the empty
"buckets" of information were filled by the request, and if
not, it would ask the caller for the missing information: "OK, $10
on horse number three, second race. Matinee or evening
As it turns out, a well-designed user
interface can go a long way in making up for shortcomings in the kind of
speech recognition technology that Nuance uses. It can convey the
impression that the system understands what a caller is saying, even
though it's merely looking for key words it can use to determine
what callers want. Really, Nuance's system isn't much
different than a touch-tone menu, in which users are trapped in voice
The trick is to make users think that they
aren't trapped, but are in an open, free-form conversation. Nuance
and Speechworks (an MIT Media Lab spin-off) achieve this by combining
their statistical-based language processing systems with cleverly
constructed user interfaces designed to elicit utilizable information
from callers. "We are faking an open interaction, even though it is
pretty closed," says Brill. "We want to make the customer feel
they are in charge, rather than being led through the nose by the
To add more realism to this useful illusion, the
virtual assistants in Nuance and Speechworks use carefully scripted
personas, designed to allow the caller to interact with them in much the
same way that they would with a real person. "Speech recognition
systems are social actors," says Blade Kotelly, Speechwork's
manager of worldwide solutions marketing. "While we never try to
pretend a real person is on the other end of the line, we do try to
clearly make sure that people are working with a system that is treating
them in a certain way, and that people feel that they are being treated
well by the system."
Nuance even goes so far as to give its
personas "back stories," describing the personas'
hobbies, occupation, and interests, so that voice actors can imbue the
personas with consistent behavior.
All this makes for entertaining phone
conversation. But how useful is it, really? Can a company improve its
bottom line by using a virtual assistant to handle its customer
Yes, say the experts, but only in certain cases.
Natural speech processing is still too young (SprintPCS is Say
Anything's only customer so far, and Speechworks has yet to
announce any "How May I Help You?" deployments) to yield any
useful metrics. But on the basis of "anecodotal feedback,"
says Hawkins, callers using natural language systems are able to get to
the information they need more quickly than callers using directed
dialogue systems. But he warns that deploying a natural speech system is
"a tall order. It requires lots of works for months and
months," mainly because it is "very difficult to predict what
people are going to say. The variety of ways that people can ask for the
same information is remarkable."
Aurica Yen, a senior
analyst who covers enhanced productivity applications and services for
the Yankee Group, agrees that while natural speech applications are
"a good thing," they are limited. "It's not going to
apply for everything. There's a limit to the applications that will
result in efficient services, and there's some concern that this
type of technology could be inappropriately sold to companies that
don't need it."
Nuance and Speechworks say
they're only interested in using natural language in places where
it makes sense to do so. And Hawkins says there are very few places
right now where it does make sense. "99 times out of 100, directed
dialogue is better. The increased work [required to deploy a natural
speech system] is going to add to the cost." However, he says, the
cost of a natural speech system is less than employing a warehouse full
of human support staffers.
Which is why there's a future
for natural speech technology. Claire, who can handle 4,000 calls at
once, never gets grumpy, and more importantly, won't be asking her
boss for a raise anytime soon.
Frauenfelder is a writer and illustrator from Los