Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 61 (2), 317-29

Temporal Coherence in the Perceptual Organization and Cortical Representation of Auditory Scenes


Temporal Coherence in the Perceptual Organization and Cortical Representation of Auditory Scenes

Mounya Elhilali et al. Neuron.


Just as the visual system parses complex scenes into identifiable objects, the auditory system must organize sound elements scattered in frequency and time into coherent "streams." Current neurocomputational theories of auditory streaming rely on tonotopic organization of the auditory system to explain the observation that sequential spectrally distant sound elements tend to form separate perceptual streams. Here, we show that spectral components that are well separated in frequency are no longer heard as separate streams if presented synchronously rather than consecutively. In contrast, responses from neurons in primary auditory cortex of ferrets show that both synchronous and asynchronous tone sequences produce comparably segregated responses along the tonotopic axis. The results argue against tonotopic separation per se as a neural correlate of stream segregation. Instead we propose a computational model of stream segregation that can account for the data by using temporal coherence as the primary criterion for predicting stream formation.


Figure 1
Figure 1
Schematic spectrograms of stimuli used to study the perceptual formation of auditory streams. (A) The typical stimulus used in the vast majority of psychophysical and physiological studies of auditory streaming: a sequence of tones alternating between two frequencies, A and B. The percept evoked by such sequences depends primarily on the frequency separation between the A and B tones, ΔF, and on the inter-tone interval, ΔT: for small ΔFs and relatively long ΔTs, the percept is that of a single stream of tone alternating in pitch (ABAB); for large ΔFs and relatively short ΔTs, the percept is that of two separate streams of tones of constant pitch (A-A vs. B-B). (B) A variation on the traditional stimulus, used in this study: here, the A and B tones are synchronous, rather than alternating. Such sequences usually evoke the percept of a single stream, regardless of ΔF and ΔT. (C) An alternating sequence of tones that is partially-overlapped (40 ms onset asynchrony or about 50% overlap). This sequence is usually heard like the non-overlapping tone sequence (Fig. 1A above).
Figure 2
Figure 2
Thresholds for the detection of a temporal shift imposed on the last B tone in various types of stimulus sequences. The different symbols indicate different sequence types, which are represented schematically on the right. Polygonal symbols correspond to sequences of A and B tones, with the duration of the silent gap between consecutive A tones set to 30, 50, or 70 ms, as indicated in the legend. Note that since the duration of the silent gap between consecutive B tones (excluding the last two) was kept constant at 50 ms, the use of a 50-ms gap for the A tones yielded synchronous A and B tones with identical tempi; in contrast, when the gap between consecutive A tones was equal to 30 or 70 ms, these tones were not synchronous with the B tones, and had a different (slower or faster) tempo. Crosses are used to indicate the results of a control condition, in which the A tones were turned off, and the listener’s task was to indicate in which of the two presented sequences of B tones the last tone was shifted in time, creating an heterochrony. The numbers on the abscissa indicate the frequency separation between the A and B tones, in semitones. For the control condition in which only the B tones were present, this parameter was used to determine the frequency of the B tones so that it was equal to that used in corresponding conditions where A tones were also present.
Figure 3
Figure 3
Schematic of the tone frequencies and conditions used in the physiological experiments. Both alternating and synchronous tone sequences were tested in all conditions. (A) Experiment I: The two-tone frequencies we held fixed at one of three intervals apart (ΔF = 0.25, 0.5, 1 octaves), and they were then shifted through five equally spaced positions relative to the BF of the isolated cell. (B) Experiment II: Tone-A is fixed at the BF of the isolated unit, while tone-B is shifted closer to BF in several steps.
Figure 4
Figure 4
Responses of single-units to alternating (non-overlapping and partially-overlapping) and synchronous two-tone sequences at three different intervals (ΔF = 0.25, 0.5, 1 octaves). The two-tones were shifted relative to the BF of the cell in five equal steps, from tone-B being at BF (position 1) to tone-A at BF (position 5), as described in Experiment I paradigm. (A) Average firing rates from a total of 122 single-units in the five frequency positions in the synchronous and non-overlapping modes. Overlapping tones were tested in only 75/122 units. Responses in all presentation modes exhibited a significant dip in response when tones were further apart (0.5 and 1 octaves), and neither was at BF (shaded positions 2–4). (B) The percentage of cells that exhibited a significant dip in their responses were similar in the two extreme presentation modes (synchronous and non-overlapping alternating). Only the 66 single-units that were tested at all five positions were included in this analysis (since responses from all positions are necessary to compile such histograms). The magnitude of dip showed significant difference across ΔF, but nonsignificant difference across presentation mode.
Figure 5
Figure 5
Averaged responses from a total of 64 units tested for alternating, synchronous and overlapping (tested in only 41/64 units) sequences using the paradigm of Experiment II. (A) The tuning near the BF averaged from all units. The average iso-intensity response curve is shown in black for comparison. To increase the number of cells included in the average, we folded the responses from above and below BF, but included only units that were tested with the entire 2 octave range from BF. All presentation modes show some suppression of responses as tone-A approaches the BF (1 to 1.5 octaves), and a significant increase closer to BF (about 1 octave; marked by the asterisks). (B) Histogram of the difference in bandwidth of interactions between the tones during the two extreme presentation modes (synchronous and alternating) is roughly symmetric indicating no systematic bias in the scatter.
Figure 6
Figure 6. Schematic of the coherence Analysis Model
(A) The model takes in as input a time-frequency spectrographic representation of sound. Each channel yi(t) is them processed through a temporal integration stage, implemented via a bank of filters (Ψ) operating at different time constants. Finally, the output of each rate analysis is correlated across channels, yielding a coherence matrix which evolves over time. (B) A stimulus consisting of an alternating (left) and synchronous (right) tone sequence is generated with the two tones located at channels 1 and 5 of a 5 channel spectrogram. The correlation matrices corresponding to these two sequences are generated and averaged over time (rightmost panels).
Figure 7
Figure 7. Coherence Analysis from neural population
The neural responses of N=66 neurons are averaged for each tone configuration (alternating –(A)- and synchronous –(B)- tones), and each frequency separation (ΔF = 0.25, 0.5 and 1 octaves). For each condition, a coherence matrix is derived for each neuron and averaged across the population. The final population coherence matrix has a resolution of 5×5 (5 stimulus positions along the spectral axis). For display purposes, we interpolate each matrix into 500 × 500 points using MATLAB® (The MathWorks Inc., Massachusetts, USA). The (5×5) matrices have been interpolated for display purposes only. The singular value decomposition for each matrix (from left to right) yields the values (0.97, 0.14, 0.11, 0.10, 0.10), (0.97, 0.15, 0.12, 0.12, 0.10), (0.93, 0.25, 0.21, 0.15, 0.13) for alternating sequences and (0.92, 0.30, 0.17, 0.15, 0.13), (0.78, 0.55, 0.19, 0.16, 0.15), (0.78, 0.52, 0.23, 0.21, 0.17) for the synchronous sequences. The noise floor is estimated at about 0.45.
Figure 8
Figure 8. Simulation of two-tone sequences with varying asynchrony
(A) A sequence of two-alternating tones is presented to the model. The coherence analysis and singular value decomposition of the matrix C reveals a rank 2 matrix, as indicated by the two singular values (lower panel). (B) Ratio of second-to-first (λ2/λ1) singular values as the value of ΔT is changed from 100% (alternating) to 0% (synchronous). (C) A sequence of two-synchronous tones is presented to the model. The coherence analysis and singular value decomposition of the matrix C reveals a rank 1 matrix, as indicated by one nonzero singular value (lower panel).

Similar articles

See all similar articles

Cited by 89 articles

See all "Cited by" articles

Publication types

LinkOut - more resources