[guispeak] Fwd: Synthesizing human emotions

  • From: Andy Baracco <wq6r@xxxxxxxxxxxxxx>
  • To: guispeak@xxxxxxxxxxxxx
  • Date: Thu, 09 Dec 2004 21:35:05 -0800

Things have come a long way from the days of the Apple Echo Synthesizer, I had one of those, to the modern natural speech synths like Neo Speech and the ATT Naturals. Though sometimes they still sound annoyed to be forced into reading for us, smile. This would be quite neat.

    AP Worldstream
Monday, November 29, 2004

Synthesizing human emotions

By By Michael Stroh, Sun Staff

Speech: Melding acoustics, psychology and linguistics, researchers teach
computers to laugh and sigh, express joy and anger.

Shiva Sundaram spends his days listening to his computer laugh at him.
Someday, you may know how it feels.

The University of Southern California engineer is one of a growing number of
researchers trying to crack the next barrier in computer speech synthesis -
emotion. In labs around the world, computers are starting to laugh and sigh,
express joy and anger, and even hesitate with natural ums and ahs.

Called expressive speech synthesis, "it's the hot area" in the field today,
says Ellen Eide of IBM's T.J. Watson Research Center in Yorktown Heights,
N.Y., which plans to introduce a version of its commercial speech
synthesizer that incorporates the new technology.

It is also one of the hardest problems to solve, says Sundaram, who has
spent months tweaking his laugh synthesizer. And the sound? Mirthful, but
still machine-made.

"Laughter," he says, "is a very, very complex process."

The quest for expressive speech synthesis - melding acoustics, psychology,
linguistics and computer science - is driven primarily by a grim fact of
electronic life: The computers that millions of us talk to every day as we
look up phone numbers, check portfolio balances or book airline flights
might be convenient but, boy, can they be annoying.

Commercial voice synthesizers speak in the same perpetually upbeat tone
whether they're announcing the time of day or telling you that your
retirement account has just tanked. David Nahamoo, overseer of voice
synthesis research at IBM, says businesses are concerned that as the
technology spreads, customers will be turned off. "We all go crazy when we
get some chipper voice telling us bad news," he says.

And so, in the coming months, IBM plans to roll out a new commercial speech
synthesizer that feels your pain. The Expressive Text-to-Speech Engine took
two years to develop and is designed to strike the appropriate tone when
delivering good and bad news.

The goal, says Nahamoo, is "to really show there is some sort of feeling
there." To make it sound more natural, the system is also capable of
clearing its throat, coughing and pausing for a breath.

Scientist Juergen Schroeter, who oversees speech synthesis research at AT&T
Labs, says his organization wants not only to generate emotional speech but
to detect it, too.

"Everybody wants to be able to recognize anger and frustration
automatically," says Julia Hirschberg, a former AT&T researcher now at
Columbia University in New York.

For example, an automated system that senses stress or anger in a caller's
voice could automatically transfer a customer to a human for help, she says.
The technology also could power a smart voice mail system that prioritizes
messages based on how urgent they sound.

Hirschberg is developing tutoring software that can recognize frustration
and stress in a student's voice and react by adopting a more soothing tone
or by restating a problem. "Sometimes, just by addressing the emotion, it
makes people feel better," says Hirschberg, who is collaborating with
researchers at the University of Pittsburgh.

So, how do you make a machine sound emotional?

Nick Campbell, a speech synthesis researcher at the Advanced
Telecommunications Research Institute in Kyoto, Japan, says it first helps
to understand how the speech synthesis technology most people encounter
today is created.

The technique, known as "concatenative synthesis," works like this:
Engineers hire human actors to read into a microphone for several hours.
Then they dice the recording into short segments. Measuring in the
milliseconds, each segment is often barely the length of a single vowel.

When it's time to talk, the computer picks through this audio database for
the right vocal elements and stitches them together, digitally smoothing any
rough transitions.

Commercialized in the 1990s, concatenative synthesis has greatly improved
the quality of computer speech, says Campbell. And some companies, such as
IBM, are going back to the studio and creating new databases of emotional
speech from which to work.

But not Campbell.

"We wanted real happiness, real fear, real anger, not an actor in the
studio," he says.

So, under a government-funded project, he has spent the past four years
recording Japanese volunteers as they go about their daily lives.

"It's like people donating their organs to science," he says.

His audio archive, with about 5,000 hours of recorded speech, holds samples
of subjects experiencing everything from earthquakes to childbirth, from
arguments to friendly phone chat. The next step will be using those sounds
in a software-based concatenative speech engine.

If he succeeds, the first customers are likely to be Japanese auto and toy
makers, who want to make their cars, robots and other gadgets more
expressive. As Campbell puts it, "Instead of saying, 'You've exceeded the
speed limit,' they want the car to go, "Oy! Watch it!"

Some researchers, though, don't want to depend on real speech. Instead, they
want to create expressive speech from scratch using mathematical models.
That's the approach Sundaram uses for his laugh synthesizer, which made its
debut this month at the annual meeting of the Acoustical Society of America
in San Diego.

Sundaram started by recording the giggles and guffaws of colleagues. When he
ran them through his computer to see the sound waves represented
graphically, he noticed that the sound waves trailed off as the person's
lungs ran out of air. It reminded him of how a weight behaves as it bounces
to a stop on the end of a spring. Sundaram adopted the mathematical
equations that explain that action for his laugh synthesizer.

But Sundaram and others know that synthesizing emotional speech is only part
of the challenge. Yet another is determining when and how to use it.

"You would not like to be embarrassing," says Jurgen Trouvain, a linguist at
Saarland University in Germany who is working on laughter synthesis.

Researchers are turning to psychology for clues. Robert R. Provine, a
psychologist at the University of Maryland, Baltimore County who pioneered
modern laughter research, says the truth is sometimes counterintuitive.

In one experiment, Provine and his students listened in on discussions to
find out when people laughed. The big surprise?

"Only 10 to 15 percent of laughter followed something that's remotely
jokey," says Provine, who summarized his findings in his book Laughter: A
Scientific Investigation.

The one-liners that elicited the most laughter were phrases such as "I see
your point" or "I think I'm done" or "I'll see you guys later." Provine
argues that laughter is an unconscious reaction that has more to do with
smoothing relationships than with stand-up comedy.

Provine recorded 51 samples of natural laughter and studied them with a
sound spectrograph. He found that a typical laugh is composed of expelled
breaths chopped into short, vowel-like "laugh notes": ha, ho and he.

Each laugh note lasted about one-fifteenth of a second, and the notes were
spaced one-fifth of a second apart.

In 2001, psychologists Jo-Anne Bachorowski of Vanderbilt University and
Michael Owren of Cornell found more surprises when they recorded 1,024
laughter episodes from college students watching the films Monty Python and
the Holy Grail and When Harry Met Sally.

Men tended to grunt and snort, while women generated more songlike laughter.
When some subjects cracked up, they hit pitches in excess of 1,000 hertz,
roughly high C for a soprano. And those were just the men.

Even if scientists can make machines laugh, the larger question is how will
humans react to machines capable of mirth and other emotions?

"Laughter is such a powerful signal that you need to be cautious about its
use," says Provine. "It's fun to laugh with your friends, but I don't think
I'd like to have a machine laughing at me."


To hear clips of synthesized laughter and speech, visit

The first computer speech synthesizer was created in the late 1960s by
Japanese researchers. AT&T wasn't far behind. To hear how the technology
sounded in its infancy, visit

Today's most natural sounding speech synthesizers are created using a
technique called "concatenative synthesis," which starts with a prerecorded
human voice that is chopped up into short segments and reassembled to form
speech. To hear an example of what today's speech synthesizers can do, all
you need to do is dial 411. Or visit this AT&T demo for its commercial
speech synthesizer: http://www.naturalvoices.com/demos/

Many researchers are now working on the next wave of voice technology,
called expressive speech synthesis. Their goal: to make machines that can
sound emotional. In the coming months, IBM will roll a new expressive speech
technology. To hear an early demo, visit http://www.research.ibm.com/tts/

For general information on speech synthesis research, visit

Copyright © 2004, The Baltimore Sun


** To leave the list, click on the immediately-following link:- ** [mailto:guispeak-request@xxxxxxxxxxxxx?subject=unsubscribe] ** If this link doesn't work then send a message to: ** guispeak-request@xxxxxxxxxxxxx ** and in the Subject line type ** unsubscribe ** For other list commands such as vacation mode, click on the ** immediately-following link:- ** [mailto:guispeak-request@xxxxxxxxxxxxx?subject=faq] ** or send a message, to ** guispeak-request@xxxxxxxxxxxxx with the Subject:- faq

Other related posts:

  • » [guispeak] Fwd: Synthesizing human emotions