Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 34 (3), 114-23

Temporal Coherence and Attention in Auditory Scene Analysis


Temporal Coherence and Attention in Auditory Scene Analysis

Shihab A Shamma et al. Trends Neurosci.


Humans and other animals can attend to one of multiple sounds and follow it selectively over time. The neural underpinnings of this perceptual feat remain mysterious. Some studies have concluded that sounds are heard as separate streams when they activate well-separated populations of central auditory neurons, and that this process is largely pre-attentive. Here, we argue instead that stream formation depends primarily on temporal coherence between responses that encode various features of a sound source. Furthermore, we postulate that only when attention is directed towards a particular feature (e.g. pitch) do all other temporally coherent features of that source (e.g. timbre and location) become bound together as a stream that is segregated from the incoherent features of other sources.


Figure 1
Figure 1
A spectrogram of a complex scene with multiple objects.The figure shows a time-frequency analysis of an acoustic recording of a scene consisting of flute, a human voice and a hammer. The hammer hits are immediately visible as repetitive and transient broadband strips of energy spanning all frequencies. Both the flute and the human voice contain a rich harmonic structure that changes over time. The human voice reveals clear pitch variations and formant transitions, shown as time-course changes in both the pitch and formant locations. Note that the flute and speech give rise to clearly distinct acoustic events that are uncorrelated in time.
Figure 2
Figure 2
Schematic of the proposed model of auditory stream formation. From left to right: Multiple sound sources constitute an auditory scene, which is initially analyzed through a feature-analysis stage. This stage consists of a cochlear frequency analysis followed by arrays of feature-selective neurons that create a multi-dimensional representation along different feature axes. The figure depicts timbre, pitch and spatial location channels. Note that for computational convenience and illustration purposes, these feature maps are shown with ordered axes when in fact such orderly representations are neither known nor are essential for the model. The outcome of this analysis is a rich set of cortical responses that explicitly represent the different sound features, as well as their timing relationships. The second stage of the model performs coherence analysis by correlating the temporal outputs of the different feature-selective neurons, and arranging them based on their degree of coherence; hence giving rise to distinct perceptual streams. Complementing this feed-forward bottom-up view are top-down processes of selective attention that operate by modulating the selectivity of cortical neurons. This feature-based selective attention translates onto object-based attentional mechanisms by virtue of the fact that selected features are coherent with other features that are part of the same stream.
Figure 3
Figure 3
Schematic of the influence of attention on the cortical selectivity of sound features and the representation of coherent features of an attended stream. (a) A schematic of the time-frequency distribution of an acoustic mixture with a regularly repeating tone sequence (target) amidst a background of random tones (maskers). The perception of the target depends critically on a number of parameters, including the frequency separation between the target and closest masker components, the repetition rate of the target, and the overall sequence duration. (b) An illustration of the frequency-response curve of a single-unit recorded in the primary auditory cortex of a behaving ferret and the changes that are observed under two different behavioral tasks. When the animal attends to the repeating target tone (“Target task” - red curve), the receptive field tuned to the target frequency sharpens in a direction that enhances the segregation of the target from the background of the maskers. When the animal performs a listening task that involves attending to the entire sound mixture (“Global task” - grey curve), the tuning curve shows a much broader tuning curve relative to the selective attention state (adapted from [78]). (c) The phase coherence between distinct neural populations as measured by distributed MEG (magnetoencephalography) channels recording neural activity in human subjects. The phase coherence contrasts a selective-attention task (where the subjects attended to the repeating target tone) versus a global-attention task (where the subjects paid attention to the background maskers). Such recordings reveal that an enhancement in phase coherence occurs exclusively at the attended target repetition rate (in this case 4Hz) (adapted from [70]). The inset represents an example of the MEG magnetic field distribution for a single listener, illustrating that the MEG channel pairs with robust phase coherence in response to the rate of the target tone sequence . Channel pairs with enhanced phase coherence are shown in green, while channel pairs with reduced coherence are shown in pink.
Box 1 Figure I
Box 1 Figure I. Principles and examples of auditory streaming: Instantaneous percepts (Tokens)
Examples of acoustic tokens with different attributes are illustrated. (a) Spectra of tonal tokens: single, 2-tones, harmonic complex, and an inharmonic complex. Tokens are relatively brief and its constituents have a common onset. (b-d) Complex tokens. Sound tokens can have various attributes such as (b) the pitch of musical notes or chords, (c) location along the azimuth, or (d) the timbre of a vowel with a specific spectral shape (right panel). In each of these panels, the feature value is represented by the pattern of activation along the ordinate. For example, each note in (b) represents the place of activation along the low-to-high pitch values; the activation pattern in (c) has a peak on the Right along the Left-to-Right ordinate; the vowel in (d) is represented by its spectral shape along the frequency axis. All these features occur over a brief time interval.
Box 2 Figure II
Box 2 Figure II. Streaming with pure tones
Examples of sequential organization of pure-tone sequences: (a) Two alternating tones of widely separated frequencies are usually perceived as two separate streams. The green color indicates a separate stream. The shaded regions denote two hypothetical neural auditory channels activated by the tones. The A,B channels are incoherent. (b) Two synchronous sequences are perceived as a single stream because the A,B channels are coherent. (c) Alternating (asynchronous) tones of nearby frequencies are usually heard as a single perceptual stream that oscillates in frequency regardless of tone presentation rates. The A,B channels here overlap and hence are driven by both tones and carry coherent responses. (d) Two synchronous tone sequences of fixed and variable frequencies. Two streams are predicted since the coherence between the A,B channels is weak. (e) “Release from informational masking” stimulus: when a target tone sequence is embedded in masker tones (surrounded by an empty or a protected zone), it evokes responses in channel A that are incoherent with channel B, and hence be heard streamed from the complex. (f) Capture and streaming of a simultaneous tone pair. A pair of simultaneous tones is normally heard as a single complex sound when presented in isolation. However, a preceding sequence of low tones (as illustrated in channel B) can perceptually “capture” the low tone, separating it from the high tone (illustrated in channel A), which is now heard clearly against the background of the low-tones.
Box 3 Figure III
Box 3 Figure III. Streaming with complex sounds
Principles of sequential organization apply equally well to complex stimuli that evoke responses in feature-selective channels (analogous to the frequency-tuned channels for tones). Examples illustrated are: (a) Streaming with harmonic complexes. Harmonic complexes are perceived usually as a fused sound with a pitch at the frequency of the fundamental (bottom) component of each complex. (i) Two alternating complexes (green and black) stream apart just like alternating pure tones [14]. (ii) A harmonic complex is perceptually fractured when one component begins earlier (e.g. the green harmonic). Because of its temporal incoherence, this component forms a separate stream from the rest of the complex (the black tones). (iii) A harmonic complex also becomes perceptually fractured when one component (the grey tone) is mistuned from a harmonic relationship and pops out from the complex. However, in this case, the two percepts within the token continue to belong to a single-stream as they remain temporally coherent. (b) Streaming of vowels. A sequence of vowel pairs is perceived either as two streams or one depending on the temporal coherence of the vowels; (i) The alternating pair of vowels, /i/ and /u/, are represented schematically by different spectra. These vowels (just like the alternating tones) segregate into two streams [3,15,17]; (ii) as with the synchronous tones, when the vowels are played simultaneously they may still be individually recognized but are nevertheless heard as a single stream. (c) Streaming of sounds from different locations. Two sounds from the left (L) and right (R) stream apart when (i) they are played alternately [15,102], but form a single stream when (ii) played coherently. In the latter case, we predict that the sound is heard as a single stream from (indeterminate) multiple locations. d. Streaming of musical instruments. The beginning of Mozart’s Concerto K299 is illustrated here. The first two bars are heard as a single rich stream as all instruments are playing coherently despite the distinct timbres of the oboe and the violin, and the different notes (pitches) played by the two violins. In the subsequent bars, two streams diverge as the oboe and the violins play incoherently. e. Streaming of two simultaneous talkers. When the waveforms from two different spoken sentences (represented by pink and green) are overlaid, they often appear as alternating sound tokens. This incoherence between the two waveforms (each with its own distinct timbre, pitch, or even location) facilitates their streaming apart. In a choir singing in unison, the waveforms from the all singers would completely overlap and hence are heard as one rich stream (analogous to a piano playing a sequence of chords).

Similar articles

See all similar articles

Cited by 132 PubMed Central articles

See all "Cited by" articles

Publication types

LinkOut - more resources