That's exactly what I've observed listening to computerized speech. The
natural stuff sounds like bits joined together no matter what a lot of the
time. As long as there are blind people using screen readers, there will
probably be the artificially generated stuff to use somewhere. It would be
too hard to put natural speech into a small box like a Dec Talk.
----- Original Message ----- From: "Tim Cross" <tcross@xxxxxxxxxxxxxxx>
Sent: Wednesday, February 15, 2006 2:25 AM
Subject: [bct] Speech Synthesis types
As there has been some discussion on the various qualities of speech synthesis, I thought some might appreciate a brief, high level outline of the differences between how different speech synthesis techniques work and why some are better at handling high speech rates and some sound more "natural".
There are essentially two different techniques to generate synthetic speech, formant-based and concatenative. The older technique is formant-based and what you find in synthesizers like the DecTalk express. Concatenative speech is the more modern approach and tends to be what is used in synthesizers with the more "natural" speech.
The formant-based technique is a fully artificial speech generation
approach. A mathematical model of speech is used to generate the
sounds from nothing to synthesize a vowel, produce an excitation
periodic waveform of the appropriate fundamental frequency, like would
be done by vocal cords, and filter it with a dynamic filter that
emulates the resonances of the vocal tract and articulatory organs
(nasal cavity, mouth, lips etc. Formants are the peaks in the filter
response curve. This type of speech generation tends to be compact,
robust and able to handle faster speaking rates without degredation in
how the words are announced. The formant-based model is, by necessity, a simplification of what goes on really, and
produces speech that's way too "regular" to be human. What you hear
was never spoken by a human speaker.
Concatenative speech is actually created from real speech. Essentially, human speakers are recorded for hours speaking specially designed scripts. The recorded voice is then sliced up down to sub phonetic elements and stuck in a database. At speech generation time, the database is searched for the most appropriate sounds, which are joined together to produce the final output. The quality depends a lot on how well the algorithms identify the most appropriate sounds and how closely the synthesized speech rate matches the speech rate of the original speaker when the initial recordings were produced. The final signal has to be post processed with som esignal processing to smooth out the wave form to hide the join points and to add some distortion to get prosidy correct. Once you increase the speech rate above that of the original recording, you lose quality and this is why concatenative speech synthesizers don't handle faster speaking rates very well. If the synthesized speech rate is close to the original recording rate, the synthesized speech can sound remarkably human like.
The current market place is very much after more natural sounding synthetic speech because most of the applications it is being used for (automated phone systems, auditory feedback for consumer products like GPS, lifts which tell you what floor your on, vending machines etc want naturally sounding voices. As most of the speaking these systems do is short well defined phrases, speed is less critical than sounding natural. This could pose a problem for users like us who want to listen to large amounts of synthesized speech, but not necessarily at a slow (normal) speaking rate. It could mean that in a few years time, it will be difficult or expensive to obtain good quality formant-based synthesizers and we will either be forced to learn how to listen to speech which has poor phrasing at high rates or get use to having to take much longer to get through the material we can process currently.