[bct] Re: Speech Synthesis types

  • From: "Neal Ewers" <neal.ewers@xxxxxxxxxxxxxx>
  • To: <blindcooltech@xxxxxxxxxxxxx>
  • Date: Wed, 15 Feb 2006 10:54:04 -0600

Jake, my guess is that there has been a lot of research into what kinds
of voices people would like to hear.  After all, these voices will be
used on everything from telephone answering services to product sales
and promotions.  You can be sure that advertising companies have some
ideas for what kinds of voices they would like to hear.  As someone who
does voice overs, narrations, and commercials, I know that people have
particular voices in mind for almost everything they want to advertise.
So, next time you are on the bus and happen to sit down next to
Crystal...  Boy! Wouldn't that be a trip?


-----Original Message-----
From: blindcooltech-bounce@xxxxxxxxxxxxx
[mailto:blindcooltech-bounce@xxxxxxxxxxxxx] On Behalf Of Jake Joehl
Sent: Wednesday, February 15, 2006 10:35 AM
To: blindcooltech@xxxxxxxxxxxxx
Subject: [bct] Re: Speech Synthesis types

Hi Tim. This is interesting. I'd be very curious to find out who these 
speakers actually are. I wonder, for example, if companies just assign
duties to different employees or what.
----- Original Message ----- 
From: "Tim Cross" <tcross@xxxxxxxxxxxxxxx>
To: <blindcooltech@xxxxxxxxxxxxx>
Sent: Wednesday, 15 February, 2006 2:25 AM
Subject: [bct] Speech Synthesis types

> Hi All,
> As there has been some discussion on the various qualities of speech 
> synthesis, I thought some might appreciate a brief, high level outline

> of the differences between how different speech synthesis techniques 
> work and why some are better at handling high speech rates and some 
> sound more "natural".
> There are essentially two different techniques to generate synthetic 
> speech, formant-based and concatenative. The older technique is 
> formant-based and what you find in synthesizers like the DecTalk 
> express. Concatenative speech is the more modern approach and tends to

> be what is used in synthesizers with the more "natural" speech.
> The formant-based technique is a fully artificial speech generation 
> approach. A mathematical model of speech is used to generate the 
> sounds from nothing to synthesize a vowel, produce an excitation 
> periodic waveform of the appropriate fundamental frequency, like would

> be done by vocal cords, and filter it with a dynamic filter that 
> emulates the resonances of the vocal tract and articulatory organs 
> (nasal cavity, mouth, lips etc. Formants are the peaks in the filter 
> response curve. This type of speech generation tends to be compact, 
> robust and able to handle faster speaking rates without degredation in

> how the words are announced. The formant-based model is, by necessity,

> a simplification of what goes on really, and produces speech that's 
> way too "regular" to be human.  What you hear was never spoken by a 
> human speaker.
> Concatenative speech is actually created from real speech. 
> Essentially, human speakers are recorded for hours speaking specially 
> designed scripts. The recorded voice is then sliced up down to sub 
> phonetic elements and stuck in a database. At speech generation time, 
> the database is searched for the most appropriate sounds, which are 
> joined together to produce the final output. The quality depends a lot

> on how well the algorithms identify the most appropriate sounds and 
> how closely the synthesized speech rate matches the speech rate of the

> original speaker when the initial recordings were produced. The final 
> signal has to be post processed with som esignal processing to smooth 
> out the wave form to hide the join points and to add some distortion 
> to get prosidy correct. Once you increase the speech rate above that 
> of the original recording, you lose quality and this is why 
> concatenative speech synthesizers don't handle faster speaking rates 
> very well. If the synthesized speech rate is close to the original 
> recording rate, the synthesized speech can sound remarkably human 
> like.
> The current market place is very much after more natural sounding 
> synthetic speech because most of the applications it is being used for

> (automated phone systems, auditory feedback for consumer products like

> GPS, lifts which tell you what floor your on, vending machines etc 
> want naturally sounding voices. As most of the speaking these systems 
> do is short well defined phrases, speed is less critical than sounding

> natural. This could pose a problem for users like us who want to 
> listen to large amounts of synthesized speech, but not necessarily at 
> a slow (normal) speaking rate. It could mean that in a few years time,

> it will be difficult or expensive to obtain good quality formant-based

> synthesizers and we will either be forced to learn how to listen to 
> speech which has poor phrasing at high rates or get use to having to 
> take much longer to get through the material we can process currently.
> Tim
> --
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.375 / Virus Database: 267.15.9/261 - Release Date:

Other related posts: