Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Nov 13:4:798.
doi: 10.3389/fpsyg.2013.00798. eCollection 2013.

Causal inference of asynchronous audiovisual speech

Affiliations

Causal inference of asynchronous audiovisual speech

John F Magnotti et al. Front Psychol. .

Abstract

During speech perception, humans integrate auditory information from the voice with visual information from the face. This multisensory integration increases perceptual precision, but only if the two cues come from the same talker; this requirement has been largely ignored by current models of speech perception. We describe a generative model of multisensory speech perception that includes this critical step of determining the likelihood that the voice and face information have a common cause. A key feature of the model is that it is based on a principled analysis of how an observer should solve this causal inference problem using the asynchrony between two cues and the reliability of the cues. This allows the model to make predictions about the behavior of subjects performing a synchrony judgment task, predictive power that does not exist in other approaches, such as post-hoc fitting of Gaussian curves to behavioral data. We tested the model predictions against the performance of 37 subjects performing a synchrony judgment task viewing audiovisual speech under a variety of manipulations, including varying asynchronies, intelligibility, and visual cue reliability. The causal inference model outperformed the Gaussian model across two experiments, providing a better fit to the behavioral data with fewer parameters. Because the causal inference model is derived from a principled understanding of the task, model parameters are directly interpretable in terms of stimulus and subject properties.

Keywords: Bayesian observer; causal inference; multisensory integration; speech perception; synchrony judgments.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Causal structure of audiovisual speech. (A) Causal diagram for audiovisual speech emanating from a single talker (C = 1) or two talkers (C = 2). (B) Difference between auditory and visual speech onsets showing a narrow distribution for C = 1 (navy) and a broad distribution for C = 2 (gray). The first x-axis shows the onset difference in the reference frame of physical asynchrony. The second x-axis shows the onset difference in reference frame of the stimulus (audio/visual offset created by shifting the auditory speech relative to the visual speech). A recording of natural speech without any manipulation corresponds to zero offset in the stimulus reference frame and a positive offset in the physical asynchrony reference frame because visual mouth opening precedes auditory voice onset. (C) For any given physical asynchrony (Δ) there is a distribution of measured asynchronies (with standard deviation σ) because of sensory noise. (D) Combining the likelihood of each physical asynchrony (B) with sensory noise (C) allows calculation of the measured asynchrony distributions across all physical asynchronies. Between the dashed lines, the likelihood of C = 1 is greater than the likelihood of C = 2.
Figure 2
Figure 2
Synchrony judgments under the causal inference model. (A) On each trial, observers obtain a measurement of the audiovisual asynchrony by differencing measurements of the auditory (magenta) and visual (green) onsets. Because of sensory noise, the measured asynchrony (x, blue) is different than the physical asynchrony (Δ, purple). (B) For a given physical asynchrony, Δ = 100 ms, there is a range of possible measured asynchronies (x, blue). The shaded region indicates values of x for which C = 1 is more probable than C = 2 (Figure 1D). The area of the shaded region is the probability of a synchronous percept, p(Sync). (C) For a different physical asynchrony, Δ = −100 ms, there is a different distribution of measured asynchronies, with a lower probability of a synchronous percept. (D) The probability of a synchronous percept for different physical asynchronies. Purple markers show the predictions for Δ = 100 ms and Δ = −100 ms.
Figure 3
Figure 3
Model fits to behavioral data for experiment 1 (high visual intelligibility words). (A) Black circles show the behavioral data from 16 subjects performing a synchrony judgment task (mean ± standard error) for each stimulus asynchrony with visual reliable, high visual intelligibility stimuli. Curves show the model predictions for the CIMS model (orange) and Gaussian model (blue). (B) Fit error measured with Bayesian Information Criterion (BIC) for the CIMS and Gaussian models; lower values indicate a better fit for the CIMS model (**p = 0.006). Error bars show within-subject standard error (Loftus and Masson, 1994). (C) Mean proportion of synchrony responses and model predictions for visual blurred, high visual intelligibility stimuli. (D) Fit error for the CIMS and Gaussian models, showing better fit for the CIMS model (**p = 0.007).
Figure 4
Figure 4
Model fits to behavioral data for Experiment 1 (low visual intelligibility words). (A) Black circles show the behavioral data with visual reliable, low visual intelligibility stimuli, curves show model predictions for CIMS (orange) and Gaussian (blue) models. (B) Fit error showing significantly better fit for the CIMS model (**p < 0.001). (C) Mean proportion of synchrony responses and model predictions for visual blurred, low visual intelligibility stimuli. (D) Fit error showing significantly better fit for the CIMS model (**p = 0.001).
Figure 5
Figure 5
Model comparison across experiments. (A) Total fit error (Bayesian Information Criterion; BIC) across conditions averaged over the 16 subjects in experiment 1 showing better fit for the CIMS model (**p < 0.001). Error bars are within-subject standard error (Loftus and Masson, 1994). (B) Difference in fit error (BIC) for each individual subject (across all conditions). (C) Total fit error across conditions averaged over the 21 subjects in experiment 2 showing better fit for the CIMS model (**p < 10−15). (D) Difference in fit error for each individual subject (across all conditions).
Figure 6
Figure 6
Model estimates of sensory noise across stimuli in experiment 1. (A) Correlation between CIMS model sensory noise (σ) estimates for visual blurred words with high visual intelligibility (high VI) and low VI. Each symbol represents one subject, the dashed line indicates equal sensory noise between the two conditions. There was a strong positive correlation (r = 0.92, p < 10−6). (B) Correlation between sensory noise estimates for visual reliable words with high and low VI (r = 0.95, p < 10−7). (C) Mean sensory noise across subjects [± within-subject standard error of the mean (Loftus and Masson, 1994)] for visual blurred words (green line) and visual reliable words (purple line) with low VI (left) or high VI (right).

Similar articles

Cited by

References

    1. Baskent D., Bazo D. (2011). Audiovisual asynchrony detection and speech intelligibility in noise with moderate to severe sensorineural hearing impairment. Ear Hear. 32, 582–592 10.1097/AUD.0b013e31820fca23 - DOI - PubMed
    1. Beauchamp M. S., Argall B. D., Bodurka J., Duyn J. H., Martin A. (2004). Unraveling multisensory integration: patchy organization within human STS multisensory cortex. Nat. Neurosci. 7, 1190–1192 10.1038/nn1333 - DOI - PubMed
    1. Bejjanki V. R., Clayards M., Knill D. C., Aslin R. N. (2011). Cue integration in categorical tasks: insights from audio-visual speech perception. PLoS ONE 6:e19812 10.1371/journal.pone.0019812 - DOI - PMC - PubMed
    1. Brainard D. H. (1997). The psychophysics toolbox. Spat. Vis. 10, 433–436 10.1163/156856897X00357 - DOI - PubMed
    1. Buehner M. J. (2012). Understanding the past, predicting the future: causation, not intentional action, is the root of temporal binding. Psychol. Sci. 23, 1490–1497 10.1177/0956797612444612 - DOI - PubMed

LinkOut - more resources