[bct] Re: Speech Synthesis types

  • From: "Neal Ewers" <neal.ewers@xxxxxxxxxxxxxx>
  • To: <blindcooltech@xxxxxxxxxxxxx>
  • Date: Wed, 15 Feb 2006 09:29:28 -0600

Tim, thanks for an excellent explanation of complicated stuff in a way
that is easy to understand.  This makes me think that someone who has
access to a lot of different speech synthesizers should do a podcast.
It may not be designed to show people which is the best for them,
because that is a personal decision, but it would be interesting to have
them all lined up and saying the same thing.  We have a lot of them at
Trace so perhaps I will see if I can get their help in lining up a few.
Perhaps another way to go is to have people send in to myself or someone
snippets of the synthesizers they use reading a paragraph we come up
with as a template.  Of course, then we would have to worry a bit about
the different sampling rates people might use on their recordings, but
that may not be as important as hearing the different synthesizers
themselves.  What do people think?


-----Original Message-----
From: blindcooltech-bounce@xxxxxxxxxxxxx
[mailto:blindcooltech-bounce@xxxxxxxxxxxxx] On Behalf Of Tim Cross
Sent: Wednesday, February 15, 2006 2:25 AM
To: blindcooltech@xxxxxxxxxxxxx
Subject: [bct] Speech Synthesis types

Hi All,

As there has been some discussion on the various qualities of speech
synthesis, I thought some might appreciate a brief, high level outline
of the differences between how different speech synthesis techniques
work and why some are better at handling high speech rates and some
sound more "natural".

There are essentially two different techniques to generate synthetic
speech, formant-based and concatenative. The older technique is
formant-based and what you find in synthesizers like the DecTalk
express. Concatenative speech is the more modern approach and tends to
be what is used in synthesizers with the more "natural" speech. 

The formant-based technique is a fully artificial speech generation
approach. A mathematical model of speech is used to generate the sounds
from nothing to synthesize a vowel, produce an excitation periodic
waveform of the appropriate fundamental frequency, like would be done by
vocal cords, and filter it with a dynamic filter that emulates the
resonances of the vocal tract and articulatory organs (nasal cavity,
mouth, lips etc. Formants are the peaks in the filter 
response curve. This type of speech generation tends to be compact,
robust and able to handle faster speaking rates without degredation in
how the words are announced. The formant-based model is, by necessity, a
simplification of what goes on really, and 
produces speech that's way too "regular" to be human.  What you hear 
was never spoken by a human speaker.

Concatenative speech is actually created from real speech. Essentially,
human speakers are recorded for hours speaking specially designed
scripts. The recorded voice is then sliced up down to sub phonetic
elements and stuck in a database. At speech generation time, the
database is searched for the most appropriate sounds, which are joined
together to produce the final output. The quality depends a lot on how
well the algorithms identify the most appropriate sounds and how closely
the synthesized speech rate matches the speech rate of the original
speaker when the initial recordings were produced. The final signal has
to be post processed with som esignal processing to smooth out the wave
form to hide the join points and to add some distortion to get prosidy
correct. Once you increase the speech rate above that of the original
recording, you lose quality and this is why concatenative speech
synthesizers don't handle faster speaking rates very well. If the
synthesized speech rate is close to the original recording rate, the
synthesized speech can sound remarkably human like. 

The current market place is very much after more natural sounding
synthetic speech because most of the applications it is being used for
(automated phone systems, auditory feedback for consumer products like
GPS, lifts which tell you what floor your on, vending machines etc want
naturally sounding voices. As most of the speaking these systems do is
short well defined phrases, speed is less critical than sounding
natural. This could pose a problem for users like us who want to listen
to large amounts of synthesized speech, but not necessarily at a slow
(normal) speaking rate. It could mean that in a few years time, it will
be difficult or expensive to obtain good quality formant-based
synthesizers and we will either be forced to learn how to listen to
speech which has poor phrasing at high rates or get use to having to
take much longer to get through the material we can process currently.


Other related posts: