KLSYN D.H. Klatt KLSYN: A FORMANT SYNTHESIZER PROGRAM The KLSYN speech synthesis program accepts user commands to create parametric data to control a digital speech synthesizer, and it produces an output waveform file with a user-specified name. The synthesizer is the same as the one documented in some detail in Klatt (1980) <1>, except that the voicing source has been augmented so as to permit a choice between two glottal waveforms. The new voicing source waveform is intended to be more flexible and thus be capable of producing more natural changes in voice quality over the duration of a sentence, if controlled properly. The theory of control, and the new control parameters are all described herein. STARTING KLSYN The program is started simply by typing: klsyn KLSYN Version x.x D. H. Klatt Date-created Default utterance duration = 500 ms, 5 ms per frame therefore, there are 100 frames to be synthesized > The program begins by typing out a line identifying itself. The program then reads in the default synthesizer configuration file "default.kpr". This file specifies the utterance duration and parameter update interval in ms among other things -- all of which can be changed by user commands. (This version of Klsyn expects this file to be in one of two places: (1) the current working directory, or (2) 'c:\windows'.) Finally, the program types the user prompt symbol ">". The default configuration is listed in Table 1. Default values are assigned to all of the constants and variable parameters of synthesis in this way. -- How to Begin with a Special Synthesizer Configuration. Alternatively, one can startup KLSYN with your own previously prepared set of defaults. In particular, a command is available for changing the default values in the file default.kpr (the 'C' command). When a waveform is finally synthesized, the synthesizer configuration (set of default values) is saved in a ------------ <1> "Software for a Cascade/Parallel Formant Synthesizer", J. Acoust. Soc. Am. 67, 971-995. KLSYN PAGE 2 D.H. Klatt file with the same first name as the waveform file. For example, if the waveform file were called "baa", then the configuration file would be named "baa.kpr". The next time you want to run the synthesizer using these special defaults as a starting point, simply type: klsyn baa and the configuration file baa.kpr will be read instead of default.kpr. If the file baa.kpr did not exist in the above example, the system will complain and abort. -- Other Optional Arguments. The program KLSYN can take arguments of two types. The first type is an argument preceded by a minus sign. There are currently two arguments of this type. '-n' (novice user) indicates that you would like as much help as you can get from the program while using it. This option simply prints more "help" messages to the screen during the course of your KLSYN session. The second argument of this type is the '-c' (control file) option. This option allows you to run KLSYN non-interactively from an external file. When specified, KLSYN will look for its input from the file bearing the name that you give it immediatley following the option (no spaces should appear between the -c and the filename). The default extension on this file is '.ctl'. If no extension is given, KLSYN will look for a file with the basename that you specified and the '.ctl' extension. If the filename given already has an extension on it, KLSYN will just leave that filename and extension alone and look for that file. An example usage of this option looks like this: klsyn -cFilename or klsyn baa -cFilename In the above example, KLSYN will try to get its input from the file named Filename.ctl. If the specified file can not be found, an error message will be printed to the screen and the program will abort.[NOTE (JH): THIS -c OPTION DOES NOT SEEM TO WORK. I CHECKED THE SOURCE CODE AND THERE DOES NOT SEEM TO BE ANYTHING THERE THAT LOOKS FOR THIS OPTION.KLSYN CAN BE RUN IN BATCH MODE, THOUGH. SEE THE examples directory -- 'syndaga.bat' and 'daga01.cmd'-'daga10.cmd') The second type of argument, identified by the absence of a minus sign, is the name of a configuration file, as discussed above. SYNTHESIS COMMANDS After startup initialization, the program types the prompt ">", indicating that it is ready to accept user commands from the terminal. The permitted commands, all case insensitive, are listed in Table 2, and described more fully in the following paragraphs. KLSYN PAGE 3 D.H. Klatt COMMAND ACTION H HELP, list legal KLSYN commands C CHANGE default value for a synthesis parameter D DEPOSIT parameters (save current parameters in file) F FETCH a parameter file from disk S SYNTHESIZE waveform file, save everything T INTERPOLATE parameters V VIEW current parameters Q QUIT program, save nothing TABLE 2: KLSYN synthesizer commands acceptable when the prompt ">" appears. -- 'H' HELP Typing "H" at the ">" prompt (or any illegal command, such as ) will cause KLSYN to print a help menu (Table 2). -- 'C', CHANGE Parameter Default Value To change the default value of any synthesis constant or variable parameter, type "C". The system will respond: Par: and you should type the 2-character symbol for the parameter to be changed (or '?' to get a listing of acceptable 2-char names). If a legal 2-char name is typed, the system will then respond: Change value of xx from yy to Value: and you should type the desired value. If your requested value falls outside the range specified by the minimum and maximum values listed in the configuration, the system will ask whether you really want to use this value. Accepted responses are "y" (yes) or "n" (no). Do not respond "y" unless you are reasonably certain that such a request makes logical sense. Consider an example. To change the constant "duration of the stimulus" to 300 msec, one would type: > C Par: du Change value of du from 500 to Val: 300 > KLSYN PAGE 4 D.H. Klatt The default values for variable parameters are used throughout the synthesis unless a time function is specified for each parameter using the "t" command described below. -- 'D', DEPOSIT parameters The 'D' command can be used to save the current synthesis parameters to a file without synthesizing. When this command is used, KLSYN asks for the first name of the ".kpr" file to which the parameters will be written. At this point, only the basename of the ".kpr" file should be entered. KLYSN automatically appends the ".kpr" extension to this name. If you happen to put an extension on this filename yourself KLSYN will still append the ".kpr" extension to it. This leaves you with a filename with two extensions (filename.ext.kpr). -- 'F', FETCH a parameter file The 'F' command is used to retreive a different parameter file. If you are currently working with one parameter file and wish to change to a different one, the 'F' command allows you to do this without exiting the program and restarting it. When this command is issued, KLSYN asks for the name of the file to retreive. If a filename with no extension is entered, KLSYN will append a ".kpr" extension to it and use this synthesized filename as the one to look for. If an extension is given, KLSYN does nothing to it and looks for a file bearing the exact name as the one that was entered. If the file specified can not be found, an error message will be printed to the screen and the program will abort. WARNING: The current parameter file at the time that the 'F' command is issued is NOT saved. Therefore, if you wish to save any changes that you might have made to this file, you must deposit them ('D' command) before you fetch another file. -- 'S', SYNTHESIZE waveform When all of the variable parameters have been given appropriate default values or appropriate time functions, a waveform can be synthesized, using the "s" command. The first name of the waveform file is requested upon entering this command. From this name, two filenames are formed. The first one is formed by appending a ".b" extension to the given name. The second name is formed by appending a ".kpr" extension to the given name. The ".b" file is then used to store the actual waveform that is synthesized while the ".kpr" file is used to store the parameter configuration that was used for the synthesis. This ".kpr" file can then be used for future KLSYN activity (it can either be specified on the command line or fetched in). Any extensions given with the filename requested are simply stripped off to form the base name to which the ".b" and ".kpr" extensions are appended. KLSYN PAGE 5 D.H. Klatt The peak output level is printed at the end of the synthesis in dB, where any number greater than zero dB indicates that the signal has exceeded the bits available to the digital-to-analog converter, and will have to be synthesized again with source amplitudes reduced (see the 'g0' configuration parameter in Table 1). On the other hand, any level below about -12 dB will not use the two highest order bits of the d/a converter, and might be profitably resynthesized at a higher level, if this is consistent with the experiment (i.e. the levels of the other stimuli that would have to be increased in level as well). A new auto-scaling feature (sc) has been added to overcome problems with the peak output level. When this switch is set (sc = 1), the entire signal is scaled such that the peak output level is always zero db. (The maximum value is tracked and, after synthesis is complete, the signal is scaled to maximum amplitude (32767 on a 16 bit machine) and written to the output file. This can be turned off by changing the value of 'sc' to zero.) It should be noted, however, that the peak output level that is reported at the end of the synthesis will not reflect this scaling. This reported level is the peak prior to the scaling. Following synthesis of an important waveform, it is good practice to obtain a listing of the default parameter values and the variable parameter data for future reference. -- 'T', INTERPOLATE parameters All variable parameters (those not followed by a "C" in the "V/C" column of Table 1) can be varied by specifying values for each time increment (5 msec default update interval), using the "T" command. The system responds with: > t Par: and you should type the appropriate 2-character symbol. The system then asks for a time in msec to begin the interpolation. For example, the following dialog sets the first formant to values appropriate for the syllable [ba]: > d Par: F1 Time: 0 Val: 180 Time: 100 (instant of [b] release) Val: 180 Time: 105 Val: 400 Time: 142 Val: 750 Time: 495 KLSYN PAGE 6 D.H. Klatt Val: 750 Time: Par: > The resulting parameter time function is drawn in Figure 1. Several aspects of this dialog require explanation. - Requested times are rounded down to the nearest multiple of the update interval. In this example, the "time=142" request was rounded down (silently) to 140 ms. - Values exceeding suggested limits are detected and the user is asked if this is really what was intended. - When the prompt is "Time:", typing ) signals the end of the interpolation for that parameter. - One need not specify values for all times; the default value is used for time frames not covered by the specified range of interpolation. - The default utterance duration of 500 ms was typed as the last time, but, of course, the last frame of parameter data starts at time 495 since the first frame started at t=0). The program will accept '500' as a legal time, and round it down to the logically correct '495' value. Times in excess of 'du' will not be accepted. - When "Par:" is the prompt, typing indicates that no more variable parameters are to be modified at the moment. The computer types '>' and waits for a new command. Interpolation can be specified forward in time, as in the above example, or backward in time, as in the example below which modifies the time function for F1 according to the dashed line in Figure 1: > d Par: F1 Time: 160 Value: 750 Time: 105 Value: 400 Time: Par: -- 'V', VIEW parameters The 'V' command is used to view the synthesis parameters. KLSYN PAGE 7 D.H. Klatt When this command is issued, KLSYN asks for a starting time from which to start viewing the parameters. If a 'c' (for constant) is entered, the default values for the synthesis parameters will be displayed. If a time is entered (in msecs), only those parameters which have been varied over time will be displayed. This display will start from the point in time that was just entered. If a carriage return is entered, the varied parameters will be displayed starting from time zero. The display of the varied parameters has built into it a paging feature which allows viewing the parameters a page at a time. At the end of each page displayed, KLSYN will ask if it should continue and display the next page. Anything other than a 'y' response to this inquiry will cause KLSYN to fall out of the view mode. PARAMETERS AND CONSTANTS A list of the constants and variable parameter time functions that control the synthesizer is shown in Table 1. The two-character symbols stand for full names given in column 2. Each control variable has been assigned a default value, which is indicated in column 3. It will be used during synthesis unless changed by the user. Parameters that must remain constant throughout an utterance are indicated by a 'C' in column 4; all others may be varied using the 't' interpolate command. Minimum and maximum values are also indicated in columns 5 and 6. These are "soft" limits that can be over-ridden; they suggest normal range of variation, and help detect typing errors. The following paragraphs define each of the constants and variable parameters of Table 1. Though nominally variable, these parameters take on the default value for all time unless the user employs the 't' interpolate, command to specify a parameter time function in the form of a sequence of straight-line segments. 'sr' (Constant) The constant 'sr', "sampling rate", is the number of output samples computed per second of synthetic speech. It is suggested that the default value of 10,000 samples/sec not be changed unless the user understands the digital signal processing implications of such a change (for example, if only 'sr' is increased, the spectrum of the synthetic speech will tilt down). However, if a sampling rate of 16,000 samples/sec is desired, one can change 'nf', the number of formants in the cascade branch, to 8 and obtain synthesis that is nearly identical below 5 kHz to that generated at 10,000 samples/sec (see description below of parameter 'nf'). KLSYN PAGE 8 D.H. Klatt 'du' (Constant) The constant 'du', "duration", of the utterance to be synthesized, is the number of msec from beginning to end of the current synthetic utterance, including at least 25 msec at the end to allow the waveform to decay naturally after you have turned off all the sound sources. The current maximum value for 'du' is 1000 (one second). (Actually, the maximum utterance duration is 200 frames times ui). The specified value for 'du' will be rounded up to the nearest multiple of 'ui', the number of msec in a parameter update time interval. 'ui' (Constant) The constant 'ui', "update interval", is the number of msec of waveform generated between times when parameter values are updated. The default value of 5 ms is frequent enough to mimic most rapid parameter changes that occur in speech (in fact, 10 ms updates may be often enough). Under special circumstances, a shorter update interval, e.g. 1 ms, might be desirable, but note the qualification given in the next paragraph. Parameters involved in generating the glottal source waveform ('f0' 'av' 'no' 'tl' 'sk') are not changed at the exact time specified by the update interval. Instead, their change in value is delayed to the next waveform sample at which glottal opening occurs. For low values of fundamental frequency, this delay may be as much as 10 ms (the average delay is 5 ms when 'f0' is 100 Hz, and 2.5 ms when f0=200 Hz). If this were not done, it would be as if spurious excitation occurred at the update rate, resulting in perceptible auditory distortion <2>. Delaying changes to the voicing source control parameters in order to synchronize them with the time of primary excitation of the vocal tract both removes the update interval periodicity of the distortions, and better hides them under the signal. 'nf' (Constant) The constant 'nf', "number of formants in cascade vocal tract", specifies how many formants, counting from F1 up to a maximum of F8, are actually in the cascade vocal tract. The default value is 5, which is an appropriate number if the sampling ------------ <2> The fact that formant frequencies and bandwidths change at the update time means that small waveform distortions synchronized to the update rate are unavoidable. KLSYN PAGE 9 D.H. Klatt rate is 10,000 samples/sec and the speaker has a vocal tract length of 17 cm. (i.e. the average spacing between formants will then be 1000 Hz). If the speaker that you are trying to model has a vocal tract length significantly different from 17 cm, or if the 'sr' sampling rate parameter has been changed, you may wish to modify 'nf'. For example, to model a typical female voice with a vocal tract length about 20% shorter than the average male, one would set 'nf' to four. If the sampling rate is changed to 16,000 samples/sec, then a male voice should have 8 formants in the frequency range from 0 to 8 kHz, and thus 'nf' should be set to 8. Only the lower 6 formant frequencies and bandwidths are settable by the user; the frequency and bandwidth of the seventh and eighth formants are fixed at F7=6500, B7=500, F8=7500, B8=600. The parallel vocal tract has only 6 formants, so that one would have to move F6 up in frequency to generate noise spectra with peaks above the default value of F6=4990 Hz when 'sr' is increased. It should be clear that 'nf' only crudely approximates variations in vocal tract length. If, for example, a speaker had a vocal tract length 10% shorter than the typical male, one would have to use five formants in the cascade branch, setting the higher formants appropriately higher in frequency, and then use the 'tl' tilt parameter to achieve the correct general spectral tilt for this voice. 'ss' (Constant) The constant 'ss', "source switch", is a switch that determines which of two voicing source waveforms is used for synthesis. The default value, 1, causes a low-pass filtered impulse train to be generated, while the value 2 causes a more natural waveform with a definite sharp closing time to be invoked. Each has its own set of advantages and disadvantages. Impulse Train. A train of impulses is filtered by a critically damped second-order low-pass digital filter, resulting in an approximation to the glottal waveform such as is shown in Figure 2. The spectrum falls off at -12 dB per octave for low and mid frequencies and then flattens out <3>. The primary advantage of the filtered impulse train is that the source spectrum is perfectly regular, with no 'glottal zeros'. The 2-pole low-pass filter has a nominal cutoff frequency of zero ------------ <3> Above 4 kHz, harmonics are further attenuated by a down-sampling low-pass filter, but this should have little effect on the perceived quality of a vowel. KLSYN PAGE 10 D.H. Klatt Hz, and a bandwidth (which determines the width of the open portion before the time waveform asymptotically approaches zero) that is proportional to the synthesis parameter 'no', the nominal number of samples in the open portion of the waveform. The spectrum of this source can be tilted down to simulate a mode of vibration where the vocal folds do not meet at the midline, using the 'tl' tilt-of-the-glottal-source parameter described below. The disadvantage of this waveform is that primary excitation of the vocal tract occurs at glottal opening time, and there is no excitation at glottal closing time. Thus the phase of the source is incorrect <4>, even though the source magnitude spectrum is probably to be preferred for its regularity, at least in some psychophysical tests. Natural Pulse Train. The advantages of the natural glottal source are that the glottal volume velocity waveform has well-defined open and closing times, with an asymmetrical shape such that closing velocity is more rapid than opening velocity. The voicing volume velocity waveform obeys the equation <5>: 2 3 U (t) = a t - b t g during the open phase of 'no' samples, and is zero for the remainder of the period. The spectrum of the natural source is somewhat irregular, with a weak zero at about 600 Hz (assuming default settings to all of the glottal source parameters except 'ss', which is set to 2.) Waveforms and spectra for the impulsive and natural voicing sources are compared in Figure 2. The natural glottal waveform can also be modified so as to tilt the spectrum down, using either 'no' or 'tl', in order to mimic the effects of incomplete glottal closure and the concomitant rounding of the corner of the waveform at closure. The disadvantage of the natural source waveform is that the magnitude spectrum is somewhat irregular, so that a formant will be slightly attenuated as it approaches a frequency of about 600 Hz (the actual zero locations depend on 'no', the number of samples in the open phase). This formant amplitude variation seems to occur in natural speech, but may not be desirable for particular ------------ <4> Fortunately, the phase of the source spectrum is not of great perceptual importance, especially under listening conditions where room acoustics impose their own phase distortions on the sound reaching the ears. ------------ <5> The choice of synthesis waveform shape is based on suggestions contained in Rosenberg, A. (1971), "Effect of Glottal Pulse Shape on the Quality of Natural Vowels", J. Acoust. Soc. Am. 53, 1632-1645, and in Fant, G. (1983), "The Voice Source: Acoustic Modeling", Speech Transmission Laboratory QPSR 4/1982, Royal Institute of Technology, Stockholm, Sweden, 28-48. KLSYN PAGE 11 D.H. Klatt synthesis stimulus sets. 'rs' (Constant) The constant 'rs', "random seed", is the seed value given to the random number generator routine. Any number from 0 to 99999 can be specified. For each, you will get a quite different random number sequence (different frication and aspiration noises from those used to generate the previous stimuli). On the other hand, stimuli all generated with the same value for 'rs' will have identical frication source and aspiration source waveforms. This is sometimes desirable if stimuli on a continuum are not to differ due to random fluctuations in e.g. a burst of frication noise. 'os' (Constant) The constant 'os', "output waveform selector", determines which waveform is saved in the output file. If 'os' has the default value of zero, the normal final output of synthesis is saved. Other output options are given in Table 3. For example, if you wished to see and spectrally analyze the voicing source waveform of the synthesizer by itself for a particular synthetic utterance, you would set 'os' to four. Note that the radiation characteristic is applied if 'os' is greater than 4 <6>, but not if 'os' is less than 4. Thus, setting 'os'=4 results in the actual voicing source waveform being generated, while setting 'os'=5 produces the first difference of the voicing source waveform that ordinarily is routed to the parallel vocal tract model. TABLE 3: KLSYN output waveform options using 'os' 'os' WAVEFORM SAVED 0. Normal synthesis output 1. Voicing periodic component alone 2. Aspiration alone 3. Frication alone 4. Glottal source (voicing, turbulence, and aspiration) 5. Glottal source sent to parallel vocal tract (AP) + radiation char 6. Cascade vocal tract, output of nasal zero resonator " 7. Cascade vocal tract, output of nasal pole resonator " 8. Cascade vocal tract, output of fifth formant " ------------ <6> Due to computational considerations, the derivative of the voicing source is usually computed directly, so that the actual source waveform that is displayed when requested is approximated by sending the computed source waveform through a leaky integrator. KLSYN PAGE 12 D.H. Klatt 9. Cascade vocal tract, output of fourth formant " 10. Cascade vocal tract, output of third formant " 11. Cascade vocal tract, output of second formant " 12. Cascade vocal tract, output of first formant " 13. Parallel vocal tract, output of sixth formant alone " 14. Parallel vocal tract, output of fifth formant alone " 15. Parallel vocal tract, output of fourth formant alone " 16. Parallel vocal tract, output of third formant alone " 17. Parallel vocal tract, output of second formant alone " 18. Parallel vocal tract, output of first formant alone " 19. Parallel vocal tract, output of nasal formant alone " 20. Parallel vocal tract, output of bypass path alone " 'f0' The variable 'f0', "fundamental frequency", is the rate at which the vocal folds are currently vibrating in Hz times 10. I.e. if a fundamental frequency of 100 Hz is desired, then 'f0' is set to 1000. The additional accuracy resulting from a specification of fundamental frequency to 0.1 Hz adds some naturalness to a slowly changing pitch glide. A new fundamental period is computed each time the vocal folds begin to open. The value of 'f0' existing at that time instant is used to determine the new period. Several other parameters of the voicing source ('av', 'no', 'tl', 'sk') change value at this time rather than changing at the nominal update time -- otherwise discontinuities could occur in the voicing waveform. The fundamental period is quantized in a digital speech synthesizer. In this simulation, the period (time between instants when glottal opening occurs) is quantized to increments of 1/40000 sec <7>. This means that at 100 Hz, 'f0' is effectively specified in 0.25 Hz steps (0.25% quantization error), while at 200 Hz, 'f0' is quantized in 0.5 Hz steps (still a 0.25% quantization error in 'f0'). This accuracy is necessary to avoid perceptible "staircase pitch" problems for slowly gliding 'f0' in the higher pitch ranges; it is achieved by running the glottal source simulation at a sampling rate four times that specified by 'sr', and lowpass/downsampling this waveform before sending it to the vocal tract model. 'av' The variable 'av', "amplitude of voicing" is the amplitude in dB of the voicing source waveform sent through the cascade vocal ------------ <7> Patent pending by Digital Equipment Corporation KLSYN PAGE 13 D.H. Klatt tract. A value of 0 dB turns off (zeros) the signal. A value of about 60 dB produces a level for vowel synthesis that is close to the maximum non-overloading level; such values should be used to keep the signal in the higher-order bits of the digital-to-analog converter. The synthesizer does not necessarily turn voicing on and off at exactly the time specified by the 'av' time function. The effect of a change in 'av' is delayed until the instant of the next glottal waveform opening. If the natural source, 'ss'=2, is used, the primary excitation of the vocal tract actually begins even later, at glottal closure some 'no' (number of samples in the open phase of the glottal period) output samples following the time of glottal opening. If 'av' is suddenly turned off, no more glottal pulses will be issued, and the vocal tract response to the previous pulse will continue to die out, taking 10 to 20 msec to become totally inaudible. If 'av' is suddenly turned ON, and you wish a glottal pulse to be issued at exactly that time, it is necessary to have set 'f0' to zero for a period of time prior to this event, and to turn 'f0' on simultaneous with the time that 'av' is turned on. This procedure should be followed in order to specify voice onset time for a plosive as an exact number of update intervals later than burst onset. 'ah' The variable 'ah', "amplitude of aspiration", is the amplitude in dB of the aspiration noise sound source that is combined with periodic voicing, if present ('av'>0), to constitute the glottal sound source that is sent to the cascade vocal tract <8>. A value of zero turns off the aspiration source, while a value of 60 results in an output aspirated speech sound with levels in formants above F1 roughly equal to the levels obtained by setting 'av' to 60. The spectrum of the aspiration noise source is nearly flat, actually falling slightly with increasing frequency. To best approximate an aspirated speech sound, one should probably increase 'b1', the first formant bandwidth, to anywhere from 200 to 400 Hz, thus simulating the effect of additional low-frequency losses incurred when the glottis is partially open. ------------ <8> Voicing can be sent to the parallel vocal tract by making 'ap' non-zero, but aspiration cannot be sent to the parallel vocal tract. Instead, one would use 'af', the amplitude of frication noise. KLSYN PAGE 14 D.H. Klatt 'at' The variable 'at', "amplitude of turbulence", is the amplitude in dB of turbulence noise generated at the glottis during the open phase of a glottal vibration. The noise is identical to aspiration except (1) the source is turned off during the closed phase of a glottal cycle, and (2) the output level rises and falls with changes to the variable 'av'. Thus this breathiness dimension of voicing is zero when 'av' is set to zero, whereas aspiration noise is not influenced by the setting of 'av'. Usually 'ah' is used to generate aspiration for voiceless aspirated plosives and [h] sounds, while 'at' is used to add a breathiness quality to the voicing source. A value of 60 will make the voice quite breathy. To achieve a good match to natural breathiness, however, one should probably also tilt down the source spectrum, using 'tl', increase the open phase of a glottal cycle, 'no', to a little more than half the period, and perhaps increase 'b1'. 'no' The spectrum of a voicing source pulse train can vary in two fairly distinct ways. The relative amplitude of the first harmonic can increase or decrease, or the general tilt of the spectrum can go up and down. To change primarily just the first harmonic amplitude, the 'no' parameter is varied, while the parameter 'tl' affects the general spectral tilt (see Figure 3). The variable 'no', "number of samples in the open period", is a nominal indicator of the width of the glottal pulse when using the default impulse train glottal source, and it is the exact number of samples in the open period when using the natural voicing source ('ss'=2). A value of 'no'=30, the default value, corresponds to a 3 msec open portion of the fundamental period at a sampling rate of 10000 samples/sec, see Figure 2. There are many male speakers for whom the duration of the open portion of the fundamental period does not change as fundamental frequency changes over a fairly wide range. Thus, it is not necessary to change 'no' during synthesis when generating many kinds of speech stimuli. Other speakers tend to produce speech with 'no' being a constant fraction of the total period, e.g. about half of the period. To simulate the behavior of this kind of speaker, one must adjust 'no' to be inversely proportional to the fundamental frequency parameter 'f0', which is rather a bother. The effect of changes in 'no' on the spectrum is illustrated in Figure 3. A narrow glottal pulse, as may occur in creaky voice, or when trying to speak loudly, results in a spectrum KLSYN PAGE 15 D.H. Klatt relatively rich in higher-frequency components, while a wider glottal pulse, as may occur in a breathy offset to speaking, results in a spectrum rich in energy below the first formant. Thus to match an observed strong first harmonic in the spectrum of a natural utterance, increase 'no'. The synthesizer routine checks to see that 'no' does not exceed the duration of the period, and silently truncates requests that exceed the duration of the current period. 'tl' The variable 'tl', "spectral tilt of voicing", is the (additional) downward tilt of the spectrum of the voicing source, in dB, as realized by a soft one-pole low-pass filter. The effect of changes in 'tl' on the voicing source spectrum is illustrated in Figure 3. A value of zero has no effect on the source spectrum, while a value of 24 tilts the spectrum down gradually such that frequency components above about 3 kHz are attenuated by about 24 dB relative to a more normal source spectrum. The tilt parameter is an attempt to simulate the spectral effect of a "rounding of the corner" at the time of closure in the glottal volume velocity waveform due either to an incomplete closure, as in breathiness, or an asynchronous closure such that the anterior portion of the vocal folds meet at the midline before the posterior portions come together. The tilt parameter is also useful in simulating a voicebar, wherein only lower-frequency components are radiated from the closed vocal tract. For many speech synthesis situations, this would be the only use for 'tl'. However, 'tl' is a good parameter to use in attempts at matching the spectral details of a particular natural utterance. 'sk' The variable 'sk', "skew to alternate periods", is the number of 25 microsecond increments to be added to and subtracted from successive fundamental period durations in order to simulate one aspect of vocal fry, the tendency for alternate periods to be more similar in duration than adjacent periods. Such aperiodicities, when introduced, have fairly strong perceptual consequences. This kind of change to normal voicing occurs throughout speech for some voices, and at the initiation and cessation of voicing in a sentence for many others. There is no need to play with this parameter in most synthesis situations. 'F1' 'F2' 'F3' 'F4' 'F5' 'f6' KLSYN PAGE 16 D.H. Klatt The "formant frequency" variables determine the frequency in Hz of up to six resonators of the cascade vocal tract model, and of the frequency in Hz of each of six additional parallel formant resonators. Normally, the cascade branch of 'nf'=5 formants is used to generate voiced and aspirated sounds, while the parallel branches are used to generate fricatives and plosive bursts. Since formants are the natural resonant frequencies of the vocal tract, and frequency locations are independent of source location, the formant frequencies of cascade and corresponding parallel resonators must be identical. Suggested values for formant frequencies of a number of English sounds were published in Klatt (J. Acoust. Soc. Am. 67, p.971, 1980). The tables are reproduced below as Table 4 and Table 5 for easy reference, although it is recommended that synthesis parameter values be based on analysis, synthesis, and comparison of a real utterance, rather than just from theory and matches to the idiolect of D. Klatt. Formant frequencies generally move continuously and slowly in time (relative to the default 5 msec parameter update interval 'ui'). An exception is the closure and release of a stop consonant. During closure, the first formant 'F1' is typically at a frequency of about 180 Hz <9>. Upon release, the first formant frequency may rise quite rapidly over the first 5 to 10 msec, giving the appearance of a discontinuous jump to a frequency to as high as 400 Hz at the time of the first visible glottal pulse following the burst in a syllable such as [ba], see 'b1' 'b2' 'b3' 'b4' 'b5' 'b6' The "formant bandwidth" variables determine the bandwidths of resonators in the cascade vocal tract model. Since formant bandwidths depend in part on source impedance, and turbulence sources contribute more losses, the synthesizer provides separate control of bandwidths 'p1' 'p2' 'p3' 'p4' 'p5' 'p6' for the parallel formants. If the number of formants in the cascade branch is left at the default value of 'nf' = 5, then the 'b6' variable has no meaning and no effect on the synthetic waveform. The resonator bandwidth variable has two effects on the frequency-domain shape of the vocal tract transfer function. An increase in bandwidth reduces the amplitude of the formant peak and simultaneously increases the width of the peak as measured 3 dB down from the peak. Perceptual experiments indicate that both of ------------ <9> The first formant frequency does not go below about 180 Hz under any circumstances due to the mass and compliance of cavity walls and air trapped in the closed vocal tract. KLSYN PAGE 17 D.H. Klatt these changes have perceptual consequences, but that the change in peak height is much more audible than the width change. In a cascade synthesizer, adjustments to formant peak heights in order to match the spectrum of a recorded voice can be achieved either by changing the general slope of the voicing source spectrum (using 'tl') or by changing individual formant bandwidths. Changing formant bandwidths is an effective way to mimic quite closely the voice quality of a speaker, but some guidelines are offered to help avoid the perceptual problems of aberrant bandwidth specification: 1. If a bandwidth is set to a value less than the soft limits given in Table 1, there is a danger that whistle-like harmonics will be heard when a harmonic of the fundamental sweeps past the formant frequency. 2. If the bandwidths of the lower formants are wider than the suggested guidelines of Table 1, the synthetic voice will begin to sound buzzy. In this case, all bandwidths should be reduced, and then 'av' can be reduced to get back to an appropriate overall spectral level. 'fp' 'fz' The variable 'fp', "frequency nasal pole", in consort with the variable 'fz', "frequency nasal zero", can mimic the primary spectral effects of nasalization in vowel-like spectra. In a typical nasalized vowel, the first formant is split into peak-valley-peak (pole-zero-pole) such that 'fp' is at about 300 Hz, 'F1' is higher than it would be if the vowel were non-nasalized, and 'fz' is at a frequency approximately halfway between 'fp' and 'F1'. When returning to a non-nasalized vowel, 'fz' is moved down gradually to a frequency exactly the same as 'fp'. The nasal pole and nasal zero then cancel each other out, and it is as if they were not present in the cascade vocal tract model. 'bp' 'bz' The variables 'bp', "bandwidth nasal pole", and 'bz', "bandwidth nasal zero", are set to default values of 90 Hz. It is difficult to determine appropriate synthesis bandwidths for individual nasalized vowels, but, fortunately, one can achieve good synthesis results without changing these default values in most cases. 'af' KLSYN PAGE 18 D.H. Klatt The variable 'af', "amplitude frication", determines the level of frication noise sent to the various parallel formants and bypass path. The variable should be turned on gradually for fricatives (e.g. straight line from 0 to 60 dB in 90 msec), and abruptly to about 60 dB for plosive bursts. 'a1' 'a2' 'a3' 'a4' 'a5' 'a6' 'ab' The variables 'a1' 'a2' 'a3' 'a4' 'a5' 'a6' 'ab', "amplitudes parallel formants", determine the spectral shape of a fricative or plosive burst. If a formant is a front cavity resonance for a particular fricative articulation, one might set the formant amplitude to 60 dB as a first guess. Formants associated with the cavity in back of the constriction should have their amplitudes set to zero initially <10>, and then all parallel formant amplitudes should be adjusted on a trial-and-error basis, comparing synthesized frication spectra with a natural frication spectrum. The bypass path amplitude is used when the vocal tract resonance effects are negligible because the cavity in front of the main fricative constriction is too short, as in [f], [v], [th], [dh], [p], [b]. 'p1' 'p2' 'p3' 'p4' 'p5' 'p6' The variables 'p1' 'p2' 'p3' 'p4' 'p5' 'p6', "bandwidths parallel formants" are set to default values that are wider than the bandwidths used in the cascade vocal tract model. It is difficult to measure formant bandwidths accurately in noise spectra, even when a fairly long sustained fricative is available for analysis. However, these default values can be used in most situations. The only adjustment is then made to the parallel formant amplitudes in order to match details in a natural frication spectrum. ALL-PARALLEL SYNTHESIS Using 'ap' and 'an' The variable 'ap', "amplitude voicing parallel", is the amplitude, in dB, of voiced excitation of the parallel vocal tract. Normally, this would be allowed to remain at the default value of zero since the cascade vocal tract would be used for generating the voicing component of all voiced sounds (even voicebars and voiced fricatives). However, there are circumstances where a vowel with special ------------ <10> The amplitude of the first parallel formant,'a1', is therefore zero for all English fricatives. KLSYN PAGE 19 D.H. Klatt characteristics (e.g. two-formant vowels) can only be generated using the greater flexibility (individual control of formant amplitudes) of the parallel vocal tract. A value of 'ap' = 60 would be a good choice to synthesize a typical vowel using the parallel vocal tract model. Of course, 'av', would be set to zero. The parallel formant amplitude variables must then be adjusted to get the right spectral shape for the vowel. A good starting point is to set parallel formant amplitudes 'a1' 'a2' 'a3' 'a4' 'a5' to 60 dB. This will give exactly the right relative formant amplitudes for a non-nasalized vowel with formant frequencies at 500, 1500, 2500, 3500 and 4500 Hz. However, as formant frequencies are changed from these values (appropriate for a uniform tube), formant amplitudes can quickly diverge from those in a corresponding cascade vocal tract model <11>. Trial-and-error adjustment of parallel formant amplitudes is then necessary. 'an' The variable 'an', "amplitude parallel nasal formant", is normally not used. However, when employing the parallel vocal tract to synthesize vowels, as discussed above, 'an' can be used to simulate the effects of nasalization on vowels and nasal murmurs. To achieve nasalization, one would set 'fp' to about 280 Hz (the default value) and adjust both 'an' and 'a1' to levels matching a nasalized vowel spectrum. 'g0' An overall gain control, 'g0', is included to permit the user to adjust the output level without having to modify each source amplitude time function. The nominal value is 60 dB. To increase the output by e.g. 3 dB, one would simply use the 'C' command to set 'g0' to 63. In unusual circumstances, it might be desirable to make 'g0' a variable, and control it as a function of time. This is permitted, although I can't think of a very good example of when such a procedure would be advantageous. 'sc' The 'sc' parameter is a constant parameter which switches the auto scaling feature on (sc = 1) or off (sc = 0). When on, the waveform samples (in a temporary floating-point format) are scaled ------------ <11> The formant amplitude will increase/decrease as formant frequency is increased/decreased, but there is no automatic adjustment such that formants "riding on the skirt" of a lower-frequency formant are attenuated as this formant frequency is lowered. KLSYN PAGE 20 D.H. Klatt relative to the maximum signal value. The brings the peak output level to zero db. The scaling factor is computed by simply dividing the maximum absolute signal value into 32767 (the maximum short integer). The output values are then multiplied by this factor before being written to the output file. If this feature is turned off, the output values are simply written to the output file as they were originally computed (this often leads to very quiet signals or peak-clipped signals). It should be noted that the auto-scaling parameter plays no part in the reporting of the peak output level at the end of synthesizing. The peak that is reported is calculated based on the original signal values, not the scaled ones. Therefore, even if the signal has been scaled to a peak level of zero db, the peak level will still be reported as the pre-scaled one.