[bct] Speech Synthesis types

  • From: Tim Cross <tcross@xxxxxxxxxxxxxxx>
  • To: blindcooltech@xxxxxxxxxxxxx
  • Date: Wed, 15 Feb 2006 19:25:14 +1100

Hi All,

As there has been some discussion on the various qualities of speech
synthesis, I thought some might appreciate a brief, high level outline
of the differences between how different speech synthesis techniques
work and why some are better at handling high speech rates and some
sound more "natural".

There are essentially two different techniques to generate synthetic
speech, formant-based and concatenative. The older technique is
formant-based and what you find in synthesizers like the DecTalk
express. Concatenative speech is the more modern approach and tends to
be what is used in synthesizers with the more "natural" speech. 

The formant-based technique is a fully artificial speech generation
approach. A mathematical model of speech is used to generate the
sounds from nothing to synthesize a vowel, produce an excitation
periodic waveform of the appropriate fundamental frequency, like would
be done by vocal cords, and filter it with a dynamic filter that
emulates the resonances of the vocal tract and articulatory organs
(nasal cavity, mouth, lips etc. Formants are the peaks in the filter 
response curve. This type of speech generation tends to be compact,
robust and able to handle faster speaking rates without degredation in
how the words are announced. The formant-based model is, by necessity, a 
simplification of what goes on really, and 
produces speech that's way too "regular" to be human.  What you hear 
was never spoken by a human speaker.

Concatenative speech is actually created from real speech.
Essentially, human speakers are recorded for hours speaking specially
designed scripts. The recorded voice is then sliced up down to sub
phonetic elements and stuck in a database. At speech generation time,
the database is searched for the most appropriate sounds, which are
joined together to produce the final output. The quality depends a lot
on how well the algorithms identify the most appropriate sounds and
how closely the synthesized speech rate matches the speech rate of the
original speaker when the initial recordings were produced. The final
signal has to be post processed with som esignal processing to smooth
out the wave form to hide the join points and to add some distortion
to get prosidy correct. Once you increase the speech rate above that
of the original recording, you lose quality and this is why
concatenative speech synthesizers don't handle faster speaking rates
very well. If the synthesized speech rate is close to the original
recording rate, the synthesized speech can sound remarkably human

The current market place is very much after more natural sounding
synthetic speech because most of the applications it is being used for
(automated phone systems, auditory feedback for consumer products like
GPS, lifts which tell you what floor your on, vending machines etc
want naturally sounding voices. As most of the speaking these
systems do is short well defined phrases, speed is less critical than
sounding natural. This could pose a problem for users like us who want
to listen to large amounts of synthesized speech, but not necessarily
at a slow (normal) speaking rate. It could mean that in a few years
time, it will be difficult or expensive to obtain good quality
formant-based synthesizers and we will either be forced to learn how
to listen to speech which has poor phrasing at high rates or get use
to having to take much longer to get through the material we can
process currently.


Other related posts: