A system or algorithm that produces a speech output from an orthographic text input. A set of rules is employed to convert the orthographic input into appropriate low-level parameters to drive the synthesizer. Various methods for speech synthesis exist, including formant synthesis and diphone (or concatenative) synthesis.
In formant synthesis, the acoustic resonances of the human vocal tract are modelled using 4-pole band-pass filters whose centre frequency and bandwidth are modified to enable the effects of moving the articulators (jaw, tongue, lips, etc.) to be implemented. A voiced excitation is produced using a quasi-periodic waveform that mimics the acoustic excitation of the vibrating vocal folds, and a voiceless excitation is produced with random noise.
In diphone synthesis, recordings of a human speaker are stored, covering all the possible sounds of the language. These sounds are the links between the steady-state portions of speech (or ‘phones’), and are therefore known as ‘diphones’; any spoken message can be created by joining them together. This process is also known as concatenative synthesis, since the diphones are concatenated.