Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Apr 3;115(14):E3313-E3322.
doi: 10.1073/pnas.1801614115. Epub 2018 Mar 21.

Schema learning for the cocktail party problem

Affiliations

Schema learning for the cocktail party problem

Kevin J P Woods et al. Proc Natl Acad Sci U S A. .

Abstract

The cocktail party problem requires listeners to infer individual sound sources from mixtures of sound. The problem can be solved only by leveraging regularities in natural sound sources, but little is known about how such regularities are internalized. We explored whether listeners learn source "schemas"-the abstract structure shared by different occurrences of the same type of sound source-and use them to infer sources from mixtures. We measured the ability of listeners to segregate mixtures of time-varying sources. In each experiment a subset of trials contained schema-based sources generated from a common template by transformations (transposition and time dilation) that introduced acoustic variation but preserved abstract structure. Across several tasks and classes of sound sources, schema-based sources consistently aided source separation, in some cases producing rapid improvements in performance over the first few exposures to a schema. Learning persisted across blocks that did not contain the learned schema, and listeners were able to learn and use multiple schemas simultaneously. No learning was evident when schema were presented in the task-irrelevant (i.e., distractor) source. However, learning from task-relevant stimuli showed signs of being implicit, in that listeners were no more likely to report that sources recurred in experiments containing schema-based sources than in control experiments containing no schema-based sources. The results implicate a mechanism for rapidly internalizing abstract sound structure, facilitating accurate perceptual organization of sound sources that recur in the environment.

Keywords: auditory scene analysis; implicit learning; perceptual learning; statistical learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
Schema learning in melody segregation (paradigm 1). (A) Schematic of the trial structure (Upper) and a spectrogram of a sample stimulus (Lower). A target melody (green line segments) was presented concurrently with two distractor notes (red line segments), followed by a probe melody (green line segments). Listeners judged whether the probe melody matched the target melody in the mixture. The probe melody was transposed up or down in pitch by a random amount. (B) Schematic of the basic experiment structure. On every other trial the target melody was generated from a common schema. On schema-based trials, the melody in the mixture was drawn from the schema 50% of the time, while the probe was always drawn from the schema. (C) Results of experiment 1: recognition of melodies amid distractor tones with and without schemas (n = 160). Error bars throughout this figure denote the SEM. (D) Results of experiment 2: effect of an intervening trial block on learned schema (n = 192). Listeners were exposed to a schema, then completed a block without the schema, and then completed two additional blocks, one containing the original schema and one containing a new schema. The order of the two blocks was counterbalanced across participants. (Lower) The two rows of the schematic depict the two possible block orders. (Upper) The data plotted are from the last two blocks. (E) Results of experiment 3: effect of multiple interleaved schemas (n = 88). Results are plotted separately for the two schemas used for each participant, resulting in 25 and 50 trials per bin for the schema and non-schema conditions, respectively. (F) Spectrogram of a sample stimulus from experiment 4. Stimulus and task were analogous to those of experiment 1, except that noise bursts were used instead of tones. (G) Results of experiment 4: recognition of noise-burst sequences amid distractor bursts, with and without schemas (n = 68).
Fig. 2.
Fig. 2.
Schema learning in attentive tracking of synthetic voices (paradigm 2). (A) Schematic of the trial structure (Upper) and spectrogram of an example stimulus (Lower). A target voice (green curve) was presented concurrently with a distractor voice (red curve). Both voices varied smoothly but stochastically over time in three feature dimensions: f0, F1, and F2 (the fundamental frequency and first two formants; for clarity the schematic only shows variation in a single dimension). Voices in a mixture were constrained to cross at least once in each dimension. Listeners were cued beforehand with the initial portion of the target voice. Following the mixture, listeners were presented with a probe stimulus that was the ending portion of one of the voices and judged whether this probe came from the target. (B) Schematic of the experiment structure. On every other trial the target voice was generated from a common schema. Voices are depicted in three dimensions. f0, F1, and F2 are plotted in semitones relative to 200, 500, and 1,500 Hz, respectively. (C) Results of experiment 5: effect of schemas on attentive tracking (n = 86). The Inset denotes results with trials binned into 42 trials per condition to maximize power for an interaction test (reported in text). Error bars throughout this figure denote the SEM. (D) Results of experiment 6: a control experiment to ensure listeners could not perform the task with cues and probes alone (n = 146). In the last one-third of trials, the voice mixture was replaced with noise. (E) Schema learning on a finer time scale (n = 402). Data from the first 56 trials of experiments 5 and 6 were combined with new data from experiment 7 and replotted with seven trials per bin. The finer binning reveals similar performance at the experiment’s outset, as expected. n.s., not significant. *P < 0.05.
Fig. 3.
Fig. 3.
Evidence for implicit learning. Following experiment 5 and separate control experiments, participants were asked if they had noticed a recurring structure in the cued voice. In experiment 5, schema-based sources occurred on every other trial. For comparison, control experiment 1 contained no schema-based sources, while control experiment 2 contained schema-based sources on every trial. Error bars denote the SEM, derived from bootstrap.
Fig. 4.
Fig. 4.
Dependence of schema benefit on multiple dimensions and similarity to the schema. (A) Schematic of the structure of experiment 8. On each trial, listeners were cued to a target voice, heard a target-distractor voice pair, and judged if a subsequent probe was from the end of the target or the distractor (paradigm 2). Schema-based trials alternated with non–schema-based trials, but the formant trajectories on schema-based trials were randomized halfway through the experiment. (B) Results of experiment 8: effect of multiple dimensions on schema learning (n = 86). The Inset denotes results with trials binned into 42 trials per condition to maximize power. Error bars throughout this figure denote the SEM. (C) Results of subdividing non-schema trials from experiment 9 (n = 146). Performance was computed separately for non-schema trials whose feature trajectories were most and least correlated with those of the schema. (D) Results of subdividing schema trials from experiment 9. Performance was computed separately for schema trials in the middle and extremes of the dilation/transposition range.
Fig. 5.
Fig. 5.
Schema learning in the segregation of resynthesized speech utterances (paradigm 3). (A) Schematic of the trial structure (Upper) and a spectrogram of an example stimulus from experiments 10 and 11 (Lower). A target utterance (green curve) was presented concurrently with a distractor utterance (red curve) and was followed by a probe utterance (second green curve). The utterances were synthesized from the pitch and formant contours of speech excerpts. For clarity the schematic only shows variation in a single dimension. Because only the first two formants were used, and because unvoiced speech segments were replaced with silence, the utterances were unintelligible. Listeners judged whether the probe utterance had also appeared in the mixture. When this was the case, the probe utterance was transposed in pitch and formants from the target utterance in the mixture and was time-dilated or compressed. (B) Schematic of the structure of experiment 10. On every other trial the target utterance was generated from a common schema. Utterances are depicted in three dimensions. (C) Results of experiment 10: effect of schemas on the segregation of speech-like utterances (n = 89). Error bars here and in E denote the SEM. (D) Schematic of structure of experiment 11. On every other trial, the distractor utterance was generated from a common schema. (E) Results of experiment 11: effect of schema-based distractors (n = 202).

Similar articles

Cited by

References

    1. Bregman AS. Auditory Scene Analysis: The Perceptual Organization of Sound. MIT Press; Cambridge, MA: 1990.
    1. Bronkhorst AW. The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions. Acta Acust United Acust. 2000;86:117–128.
    1. Carlyon RP. How the brain separates sounds. Trends Cogn Sci. 2004;8:465–471. - PubMed
    1. Bee MA, Micheyl C. The cocktail party problem: What is it? How can it be solved? And why should animal behaviorists study it? J Comp Psychol. 2008;122:235–251. - PMC - PubMed
    1. McDermott JH. The cocktail party problem. Curr Biol. 2009;19:R1024–R1027. - PubMed

Publication types

LinkOut - more resources