Visualizations by acoustics of voice stress: Is there an optical mucosal wave correlate?

Grażyna Demenko a,b, Magdalena Jastrzębskab,
Krzysztof Izdebskic,d, Yuling Yand
a Poznan Supercomputing and Networking Center, Poznan, (Poland)
b Department of Phonetics, A. Mickiewicz University of Poznan, Poznan, (Poland)
c Pacific Voice and Speech Foundation, San Francisco, CA (USA)
d Santa Clara University, Santa Clara (USA)


We present data on how stress is manifested in the human voice by analyzing acoustic and phonetic structure of life recordings from police 997 emergency call center number in Poland. From the data corpus comprising thousands of authentic Police 997 emergency phone calls, a few hundred were automatically selected according to their duration selection criteria (calls shorter than 3-4 seconds were omitted), and finally from this corpus, voices of 45 speakers were chosen for acoustic analysis and were contrasted to neutral samples. Statistical measurements for stressed and neutral speech samples showed relevance of the arousal dimension in stress processing. The MDVP analysis confirmed statistical significance of following parameters in voice stress detection: fundamental frequency and pitch variation, noise-to-harmonic-ratio, sub-harmonics and voice irregularities. In highly stressful conditions a systematic over-one-octave shift in pitch was observed. Linear Discriminant Analysis based on nine acoustic features showed that it is possible to categorize the speech samples into one of the following classes: male stressed or neutral, or female stressed or neutral.

Keywords: call centers interfaces, detection of vocal stress, stress visualization, physiological correlates


Assessment of whether or not a speaker is under stress is of importance in many civilian and military applications1-3. Automatic detections of vocal stress is also becoming increasingly important in the field of multilingual communication, security systems, banking and law enforcement, specifically since emergency call centers and police departments all over the world are overloaded with different kinds of calls, only some of which represent real danger and need immediate response. Hence, to improve decision, effectiveness, and to save lives it is of particular and pragmatic interest to detect automatically those speech signals that are marked by stress4-6.

Assessment of whether or not a speaker is under stress is of importance in many civilian and military applications1-3. Automatic detections of vocal stress is also becoming increasingly important in the field of multilingual communication, security systems, banking and law enforcement, specifically since emergency call centers and police departments all over the world are overloaded with different kinds of calls, only some of which represent real danger and need immediate response. Hence, to improve decision, effectiveness, and to save lives it is of particular and pragmatic interest to detect automatically those speech signals that are marked by stress4-6.

Several investigations1,7 showed direct applications of emotion recognition to stress recognition1,8-9 by discussing acoustical features differences between neutral and stressed speech signals brought by a variety of emotions4,10. A number of these studies have focused on the effects of emotions on stress because of a close relation between emotions and stress recognition, e.g. usage of similar acoustic features (F0, intensity, speech unit duration) and arousal dimension2,11-12. Their results point out that the speech correlates are dependent on physiological constraints and do correspond to broad classes of basic emotions, but disagree on the specific differences between the acoustic correlates of particular classes of emotions11,13. Certain emotional states can be correlated with physiological states, which in turn have predictable effects on speech and on its prosodic features. For instance, when a person is in a state of anger, fear or joy, the sympathetic nervous system is aroused and the speech becomes loud, fast and enunciated with strong high-frequency energy. When one is bored or sad, the parasympathetic nervous system is aroused, which results in a slow, low-pitched speech with little high-frequency energy2. Apart from these differences, other studies showed an increase in intensity and in fundamental frequency, a stronger concentration of energy above 500 Hz and an increase in speech rate in cases of stressed speech2.

While some progress has been made in the area of stress definition and assessment from the acoustic or visual signals2,8,11, visual correlates of vocal fold or supraglottic larynx generations of these affected signals are essentially non-existing2,14. Our study focuses on the analysis of vocal stress produced in response to the occurrences in the people’s surroundings, perceived by them as unusual and impossible to be controlled. We analyzed third order stressors, the psychological ones, which have their effect at the highest level of speech production4. External stimuli such as a threat are subject to individual mental evaluation and the emotional states they may bring about (ex. fear, anger, irritation) affect speech production at its highest level. Due to methodological difficulties that concern speech under stress analysis, literature presents results that are somewhat at variance with each other. Validity of the studies however, depends heavily on the experimental material.

We assumed that separate models trained using speech from both, stressful and neutral environment, should allow to better determine acoustic stress indicators, in particular, should help answering the question which of the F0 derivatives are most valuable stress predictors.

2. Speech corpus construction and annotation

The 997 – Emergency Calls Database is a spontaneous speech recordings collection that comprises crime/offence notifications and police intervention requests. All recordings were automatically grouped into sessions according to the phone number from which the call was made, in all comprising over 8 000 sessions.

From this corpus, a six-level preliminary phonetic annotation was performed. The annotation included: (1) background acoustics, (2) types of dialog act, (3) suprasegmental features such as speech rate (fast, slow, rising, decreasing), loudness (low voice or whisper, loud voice, decreasing or increasing voice loudness), intonation (rising, falling or sudden break of melody and unusually flat intonation), (4) context (threat, complain and depression) (5) time (passed, immediate and potential) (6) emotional coloring (up to 3 categorical labels and values for 3 dimensions: potency, valency, arousal; where potency is the level of control that a person has over the situation causing the emotion, valency states whether the emotion is positive or negative and arousal refers to the level of intensity of an emotion9,15. Voice stress detection was performed only on those speakers who manifested different arousal level in two or more dialogs. In all of the 45 selected speakers two speech samples were collected and analyzed.

3. Pitch variability

3.1 Pitch register
A key issue of stress detection by machine is defining utterance segmentation that would result in clear units with respect to perceptual and acoustic homogeneity. Vocal registering, which divides voice region ranges into registers, is an important perceptual category. There are many approaches to define vocal register. For example vocal registration is perceptually a distinct region of vocal quality that can be maintained over some ranges of pitch and loudness over consecutive voice frequencies without a break16. However, we need to bear in mind that vocal register definition and register classification terminology is one of the most controversial problems and that physiological register correlates are not defined17. Three cases have been presupposed: (1) different pitch position, same compass, (2) different pitch position different compass, (3) same pitch position, different compass.

3.2 Pitch ranges
3.2.1 Different pitch position same compass
Three pitch position settings were observed in the utterances: (1) relative constant pitch position within the utterance and dynamic pitch position changes within the utterance, (2) pitch position shifted upward, (3) pitch position shifted up and down.

1) Relative constant pitch position within the phrase
Figure 1a is an acoustic representation of an utterance informing about a burglary and a life threat, whereas Figure 1b illustrates an utterance from the same person calling off the intervention (informing that the burglar has left the apartment), recorded 1 hour after the first call. The follow-up call shows a downward F0 shift in pitch position by approximately 40 Hz (Figure 1b), as compared to F0 contour in utterance from Figure 1a.

Figure 1a: F0 contour of constant stress in the utterance: “Please, come over, there’s a house-breaking. She’s scared to death” (Fmin =240 Hz, Fmax=352 Hz).


Figure 1b: F0 contour of neutral speech in the utterance: “I called one hour ago, I want to call off the intervention” (Fmin=167 Hz, Fmax=264 Hz).

The utterances in Figure 1a and 1b have similar pitch compasses but different pitch positions (probably caused by stress).

2) Dynamic change of pitch position within the utterance. Pitch position shifted upward.

In cases of high stress levels, F0 can reach extreme values. For example female voices may be elevated up to 700 Hz. Figure 2a illustrates an utterance of a female speaking with extreme stress as she reports to the police, “a masked person has entered my apartment”. Vocal stress only decreases slightly at the end of the recording after hearing a dispatcher prompt asking her to calm down. As the stress of the speaker increases the following is noted: 1) an upward shift in the voice pitch, 2) as well as a prominence of the higher frequencies in the spectrum, 3) an increase in the signal’s energy and 4) rate changes.

Figure 2a: A gradual stress increase in the utterances: a) “Someone is entering the apartment” (Fmin =220Hz), b) “He’s masked”(Fmin =260 Hz), c) “he is somewhere

In cases of high levels of stress F0values can rich extreme values (even up to 750 Hz). Figure 2b illustrates an utterance marked by extreme stress increase and ended with a scream and an exceeding lengthening of some syllables.

Figure 2b: A gradual increase in stress in the utterances: (a)

In this case F0 changes are located in the range of 220 Hz – 750 Hz. As the stress of the speaker increases the following is observed: 1) an upward shift in the voice pitch, 2) as well as a prominence of the higher frequencies in the spectrum, 3) an increase in the signal’s energy and 4) rate changes.

3) Dynamic change of pitch position within the utterance. Pitch register shifted upward and downward.

The shaded part in the Figure 3 shows an utterance by male voice characterized by a significant, over 50Hz, upward shift of F0 position.

Figure 3: a) “I keep trying to get through…” (Fmin =121Hz), b) “I’ve reported it so many times already…”-- clearly audible irritation (Fmin =173 Hz) c) “… so I don’t know anything anymore…”-- the answer after being asked by a police officer to calm down (Fmin =115 Hz).

3.2.2Different pitch position different compass

In cases of anger and mixed emotions significant changes of both pitch position and pitch compass were observed. Figure 4 illustrates F0 contour for an utterance in a female voice classified as the voice of indignation. The speaker can easily control her emotional state so that her message is clearly perceived by the listener. Each syllable that is lexically permissible is clearly stressed.

Figure 4: F0 contour for an expressive utterance (indignation): “I’ve got here such a drunkard, he’s maltreating me, I am going to trash him…” (Fmax=675Hz, Fmin =139 Hz, first part of the utterance), “But what can I do…” (Fmax=275Hz, Fmin =206 Hz, second part of the utterance).

As a result of discourse, the final part of the recording (beginning of which has been marked by the cursor) has a different Fmin and pitch range width than its first part.

3.2.3Same pitch position different compass

Figure5a and 5b illustrate utterances of the same male speaker, in neutral state and in anger respectively. Both utterances have similar Fmin, however their ranges of F0 fluctuations differ significantly.

Figure 5a: F0 contour for an neutral utterance: “Hi, I live on XXX street…” (Fmax=137Hz, Fmin =92Hz).


Figure 5b: F0 contour for an expressive utterance of indignation: “I hear some shouting and name-calling… him…” (Fmax=252Hz, Fmin =86 Hz).

4.Stress detection

The material was divided into four groups: G1: male – stress, G2: male – neutral/mild irritation, G3: female – stress, G4: female – neutral. For the acoustical analysis of 32 MDVP features19, for LDA Linear Discriminant Analysis only 9 have been used: Average (F0), Highest (Fhi) and Lowest Fundamental Frequency (Flo), Fundamental frequency variation (vF0 /%)/, Jitter (Jitt), Amplitude perturbation Quotient (sAPQ)/%/, Degree of Subharmonic Segments (DSH) /%/, Noise to Harmonic Ratio (NHR), Degree of voiceless DUV (%).

The LDA analysis of nine parameters enabled the classification of four groups with the average 80% accuracy, for two groups (neutral and stressed speech, males and female together) the accuracy was a bit higher, 84%. The results showed that extreme stress can be clearly identified by using only the amplitude information with mean and minimum F0 values.

Figure 6 shows z-normalized Fmin (Flo) values for 4 groups: G1, G2, G3, G4. Highest pitch position (Fmin) values are demonstrated by groups G1 and G3 (speech under stress), whereas Fmin values for groups G2 and G4 are statistically substantially lower.

Figure 6: Z-normalized values Fmin for G1, G2, G3, G4.

Table 1 shows classification results. Utterances by male voices affected by stress obtained better results than those of female voices affected by stress.

Table 1: Classification matrix: rows – classification observed, columns – classification expected


An approach to characterize vocal folds vibrations (VF) from HSDI recordings using the Nyquist plot was detailed in Yan et al., (2005, 2007) 22-23, while automatic and robust procedures to generate the glottal area waveform (GAW) from HSDI images were presented in Yan et al., (2006) 24. The principles underlying this approach are summarized below and illustrated in Figure 7. The HSDI-derived GAW is normalized for all of our analyses to a range of 0~1 with 0 corresponding to complete closure and 1 corresponding to maximum opening. This operation allows for standardized dynamic measurements of VF vibration. The Nyquist plot and associated analyses are used to represent the instantaneousproperty of the VF vibration, rather than a time averaged one. This propertyisrevealed by the amplitude and phase of the complex analytic signal (e.g. in the form of Nyquist plot) that we generate from the Hilbert transform of the GAW as illustrated in Figure 7 (A,B,C).This operation is applied to as many as 200 glottal cycles taken from 4000-frames of a 2-second HSDI recording (i.e. at a 2000 Hz acquisition rate).

Figure 7: Concept of the Nyquist plot approach to characterize vocal fold vibrations.

A) a normalized GAW, representing 50 sequential frames (5 vibratory cycles) from a 2000 f/s HSDI recording; the open (0o) and closed (90o) glottal cycles are determined from automatic tracing of HSDI images (Yan et al, 2006b). B) One vibratory cycle is mapped onto the complex plane, where the magnitude-phase of the analytic signal is graphed; the complex analytic signal is constructed from the Hilbert transform of the GAW. C) Overlays of subsequent vibratory cycles generate a Nyquist plot – deviation of the points from the circle (scatter and shape distortion) reflects the effects of shimmer, jitter and nonlinearity.

Figures 8a and 8b show Nyquist plots for vowel “a” and “i” in neutral speech (upper plots) and speech under stress (bottom plots). Figures 9a and 9b show Nyquist plots for vowel “o” from two different contexts both in neutral speech (upper plots) and speech under stress (bottom plots). All vowels were extracted from continuous speech.

Fig8 Fig9

The differences in Nyquists plots for vowels in neutral speech (upper plots) and speech under stress (bottom plots) are obvious and very distinctive. Overall, more structured Nyquist patterns (for vowels “a”, “i”, “o”) are observed in speech under stress in comparison to those in neutral speech.


Despite restricting the study to 45 speakers, a clear tendency in acoustic characterization of speech under stress was observed.

The results of this study confirm the crucial role of the F0 parameter for investigating stress. Our results agree with literature20 that point to the parameter of Fmax, as being a particularly important parameter in the emotional stress detection. However, this and our previous work21 showed that a shift in the F0 contour is also a crucial stress indicator, thus an increase in Fmax in stressed speech results from a shift in the F0 register. This holds specifically for, vocalizations caused by fear. A systematic increase in the range of F0 variability for the stress related to anger and to irritation was observed. The results also confirmed the need of including shift of pitch position and change in pitch register width into prosodic structures segmentation. We are now preparing to correlate these findings with visual (optical) observations of vocal fold activity using HSDI during production of various emotional vocal components. This will, in our opinion, enable improved explanation of the factors that influence pitch register changes in utterances diversified linguistically and in terms of situational context.


This project is supported by The Polish Ministry of Sciences and Higher Education (project no O R00 0170 12) and in parts by PVSF funding. We are grateful to Ms. Emma Marriott of Palo Alto, CA, for her editing of this text.


  1. Shipp, T., Izdebski, K., “Current evidence for the existence of laryngeal macrotremor and mictrotremor,” J. Forensic Sciences, 26, 501-505 (1981).
  2. Izdebski, K. (ed.), Emotions in the Human Voice, [Vol. 1-3], Plural Publishing, San Diego, CA (2008-2009).
  3. Eisenberg, A, “Software that listens for lies,” The New York Times, Sunday December 4, 2011.
  4. Hansen, J., et al., “The Impact of Speech Under `Stress’ on Military Speech Technology,” NATO report, http://www, (2007).
  5. Lefter, J., Rothkrantz, L., Leeuwen, D., Wiggers, P., “Automatic stress detection in emergency (telephone) calls,” International Journal of Intelligent Defence Support Systems 4(2), 148-168 (21) (2011).
  6. Vidrascu, L., Devillers, L., “Detection of real-life emotions in call centers,” Proc. of Interspeech, 1841–1844 (2005).
  7. Cowie, R., Cornelius, R.R., “Describing the emotional states that are expressed in speech,” Speech Communication, 40, 5-32 (2003).
  8. Alter, K., Rank, E., Kotz, S. A., Toepel, U., Besson, M., Schirmer, A., Friederici, A. D., “Affective encoding in the speech signal and in event-related brain potential,” Speech Communication, 40 (1–2), 61–70 (2003).
  9. Oudeyer, P.-Y., “The production and recognition of emotions in speech: features and algorithms,” Int. J. of Human-Computer Studies 59 (1–2), 157–183 (2003).
  10. Huber, R., Batliner, A., Buckow, J., Noth, E., Warnke, V., Niemann H., “Recognition of emotion in a realistic dialogue scenario,” Proc. of the Int. Conf. on Spoken Language Processing Beijing, China, 665 668 (2000).
  11. Ekman, P., “An argument for basic emotions,” Cognition and Emotion 6, 169-200 (1992).
  12. Scherer, K.R., “What are emotions? And how can they be measured?” Social Science Information 44 (4), 695–729 (2005).
  13. Batliner, A., Fischer, K., Huber, R., Spilker, J., Noth, E., “Desperately seeking emotions or: Actors, wizards, and human beings,” Speech Emotion-2000, 195-200 (2000).
  14. Izdebski, K, Yan Y., “Preliminary observations of vocal fold vibratory cycle with HSDI a function of emotional load,”, In progress (ePhonscope, 2012).
  15. Fontaine, R.J., Scherer, K.R., Roesch, E.B., Ellsworth, P.C., “The World of Emotions is not Two-Dimensional,” Psychological Science 18 (12), 1050-1057 (2007).