Deep-learning models reveal how context and listener attention shape electrophysiological correlates of speech-to-language transformation

PLoS Comput Biol. 2024 Nov 11;20(11):e1012537. doi: 10.1371/journal.pcbi.1012537. eCollection 2024 Nov.

Abstract

To transform continuous speech into words, the human brain must resolve variability across utterances in intonation, speech rate, volume, accents and so on. A promising approach to explaining this process has been to model electroencephalogram (EEG) recordings of brain responses to speech. Contemporary models typically invoke context invariant speech categories (e.g. phonemes) as an intermediary representational stage between sounds and words. However, such models may not capture the complete picture because they do not model the brain mechanism that categorizes sounds and consequently may overlook associated neural representations. By providing end-to-end accounts of speech-to-text transformation, new deep-learning systems could enable more complete brain models. We model EEG recordings of audiobook comprehension with the deep-learning speech recognition system Whisper. We find that (1) Whisper provides a self-contained EEG model of an intermediary representational stage that reflects elements of prelexical and lexical representation and prediction; (2) EEG modeling is more accurate when informed by 5-10s of speech context, which traditional context invariant categorical models do not encode; (3) Deep Whisper layers encoding linguistic structure were more accurate EEG models of selectively attended speech in two-speaker "cocktail party" listening conditions than early layers encoding acoustics. No such layer depth advantage was observed for unattended speech, consistent with a more superficial level of linguistic processing in the brain.

MeSH terms

  • Adult
  • Attention* / physiology
  • Brain / physiology
  • Computational Biology
  • Deep Learning*
  • Electroencephalography* / methods
  • Female
  • Humans
  • Language*
  • Male
  • Models, Neurological
  • Speech / physiology
  • Speech Perception* / physiology
  • Young Adult

Grants and funding

AJA and ECL were supported by a Del Monte Institute Pilot Project Program grant. This research was funded in part by the Advancing a Healthier Wisconsin Endowment (AJA). AJA received salary from the Advancing a Healthier Wisconsin Endowment. ECL was supported by National Science Foundation CAREER award 1652127. AJA and ECL received salary from National Science Foundation award 1652127. CD was supported by an Australian Research Council DISCOVERY award DP200102188. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.