Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jun 1;27(4):989-1010.
doi: 10.1016/j.csl.2012.10.005.

Phrase-level speech simulation with an airway modulation model of speech production

Affiliations

Phrase-level speech simulation with an airway modulation model of speech production

Brad H Story. Comput Speech Lang. .

Abstract

Artificial talkers and speech synthesis systems have long been used as a means of understanding both speech production and speech perception. The development of an airway modulation model is described that simulates the time-varying changes of the glottis and vocal tract, as well as acoustic wave propagation, during speech production. The result is a type of artificial talker that can be used to study various aspects of how sound is generated by humans and how that sound is perceived by a listener. The primary components of the model are introduced and simulation of words and phrases are demonstrated.

Keywords: modulation; speech simulation; speech synthesis; vocal folds; vocal tract.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Diagram of the four-tier model. Tier 0 controls the kinematic vocal folds model; the L0 and T 0 are the initial length and thickness of the vocal folds, respectively. Tier I produces a vowel substrate and Tier II generates a superposition function for a consonant constriction. Time-dependent nasal coupling can be specified in Tier III. The base structural components are dependent only on a spatial dimension (i.e., i and j are indexes corresponding to spatial location), whereas the final outputs are dependent on both space and time.
Figure 2
Figure 2
Kinematic model of the medial surfaces of the vocal folds. The Post-Ant. dimension is the vocal fold length, the Inf.-Sup. dimension is the vocal fold thickness, and the Lateral-Medial dimension is vocal fold displacement. ξ02 and ξ01 are the prephonatory adductory settings of the upper and lower portions, respectively, of the posterior portion of the vocal folds. ξb is the surface bulging, and zn is a nodal point around which the rotational mode of vibration pivots.
Figure 3
Figure 3
Mapping of mode coefficients [q1, q2] (see Eqn. 3) to formant frequencies [F1,F2]. (a) coefficient space where the grid indicates the range of the q1 (x-axis) and q2 (y-axis); the white dots are coefficient pairs that would produce area functions representative of [i, ɑ, u] and the neutral vowel (at the origin); the red trajectory indicates the variation in coefficient values to generate a time-varying area function for the word “Ohio.” (b) Vowel space plot where the grid, white dots, and red trajectory represents the [F1,F2] values corresponding to the respective coefficient pairs in part (a).
Figure 4
Figure 4
Area function portion of the model. The glottis is located at 0 cm, the trachea extends from the glottis in the negative direction and the vocal tract extends in the positive direction. The black line in the positive direction indicates the area function for a neutral vowel (π/4)Ω(x)2, the blue line is a perturbation of the neutral shape based on Eqn. (1), and the red line demonstrates an area function with an occlusion located at the lips. The nasal coupling location is indicated by the dashed line and upward pointing arrow located at about 8.8 cm from the glottis. The stair-step nature of each area function indicates the concatenated tubelet structure of the model. The waveforms shown are samples of glottal flow and radiated sound pressures at the lips, nares, and skin surfaces.
Figure 5
Figure 5
Baseline settings for all simulations. (a) F0 contour. (b) Neutral vocal tract configuration on which all shape perturbations were imposed. The inset plot shows the calculated frequency response this shape, where the first three formant frequencies are indicated by the dashed lines.
Figure 6
Figure 6
Simulation of the word Ohio. Time-dependent control parameters are shown in the upper two panels; for this word, only ξ02 and the mode coefficients [q1, q2] were varied. Glottal flow ug, intraoral pressure Poral, total radiated pressure Pout, and the wide-band spectrogram of Pout are shown in the lower four panels.
Figure 7
Figure 7
Vocal tract modulation for Ohio. The red lines represent the configuration at the beginning of the utterance, the blue lines represent the utterance end, and the dotted black lines indicate the variation in shape that occurs during the utterance. The inset plot shows the variation of first three formant frequencies calculated directly based on the changing vocal tract shape.
Figure 8
Figure 8
Simulation of the word Abracadabra. Time-dependent control parameters are shown in the upper three panels: vocal process separation ξ02, mode coefficients [q1, q2], and consonant magnitude functions mck Additional parameters associated with the mck ck functions are provided in Table 1. Glottal flow ug, intraoral pressure Poral, total radiated pressure Pout, and the wideband spectrogram of Pout are shown in the lower four panels.
Figure 9
Figure 9
Vocal tract modulation for Abracadabra. The red lines represent the configuration at the beginning of the utterance, the blue lines represent the utterance end, and the dotted black lines indicate the variation in shape that occurs during the utterance. The solid black lines denote tract shapes at points in time where an imposed constriction is fully expressed. The phonetic symbols are shown at the approximate locations where the associated constrictions occur. The first three formant frequencies are shown in the inset plot as a series of dark points and the gray lines are time-variations of the three formants that would occur in the absence of any imposed constrictions.
Figure 10
Figure 10
Simulation of the phrase He had a rabbit. Parameters and waveforms are displayed in the same order as in Fig. 8. Additional parameters associated with the mck functions are provided in Table 1.
Figure 11
Figure 11
Vocal tract modulation and formant frequencies for He had a rabbit shown in the same format as in Fig. 9.
Figure 12
Figure 12
Simulation of the phrase The brown cow. Time-dependent control parameters are shown in the upper four panels as in previous figures but with the addition of nasal port area anp. Parameters associated with the mck functions are provided in Table 1. Glottal flow ug, intraoral pressure Poral, total radiated pressure Pout, and the wide-band spectrogram of Pout are shown in the lower four panels.
Figure 13
Figure 13
Vocal tract modulation for The brown cow shown in the same format as in Figs. 9 and 11.

Similar articles

Cited by

References

    1. Atal BS, Chang JJ, Mathews MV, Tukey JW. Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer sorting technique. J Acoust Soc Am. 1978;63:1535–1555. - PubMed
    1. Baer T, Gore JC, Gracco LC, Nye PW. Analysis of vocal tract shape and dimensions using magnetic resonance imaging: Vowels. J Acoust Soc Am. 1991;90:799–828. - PubMed
    1. Bauer D, Birkholz P, Kannampuzha J, Kröger BJ. Evaluation of articulatory speech synthesis: a perception study. 36th Deutsche Jahrestagung fr Akustik (DAGA 2010); Berlin, Germany. 2010. pp. 1003–1004.
    1. Båvegård M. Proceedings Eurospeech. Vol. 95. Madrid, Spain: 1995. Introducing a parametric consonantal model to the articulatory speech synthesizer; pp. 1857–1860.
    1. Birkholz P, Jackel D, Kröger BJ. Construction and control of a three-dimensional vocal tract model. Proc. Intl. Conf. Acoust., Spch, and Sig. Proc. (ICASSP 2006); Toulouse, France. 2006. pp. 873–876.

LinkOut - more resources