Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Aug 31;14(8):e1006437.
doi: 10.1371/journal.pcbi.1006437. eCollection 2018 Aug.

An automated approach to the quantitation of vocalizations and vocal learning in the songbird

Affiliations

An automated approach to the quantitation of vocalizations and vocal learning in the songbird

David G Mets et al. PLoS Comput Biol. .

Abstract

Studies of learning mechanisms critically depend on the ability to accurately assess learning outcomes. This assessment can be impeded by the often complex, multidimensional nature of behavior. We present a novel, automated approach to evaluating imitative learning. Conceptually, our approach estimates how much of the content present in a reference behavior is absent from the learned behavior. We validate our approach through examination of songbird vocalizations, complex learned behaviors the study of which has provided many insights into sensory-motor learning in general and vocal learning in particular. Historically, learning has been holistically assessed by human inspection or through comparison of specific song features selected by experimenters (e.g. fundamental frequency, spectral entropy). In contrast, our approach uses statistical models to broadly capture the structure of each song, and then estimates the divergence between the two models. We show that our measure of song learning (the Kullback-Leibler divergence between two distributions corresponding to specific song data, or, Song DKL) is well correlated with human evaluation of song learning. We then expand the analysis beyond learning and show that Song DKL also detects the typical song deterioration that occurs following deafening. Finally, we illustrate how this measure can be extended to quantify differences in other complex behaviors such as human speech and handwriting. This approach potentially provides a framework for assessing learning across a broad range of behaviors like song that can be described as a set of discrete and repeated motor actions.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Quantification of song learning is complicated by variety in both learning and failure to learn.
(A) Typical sample of song from an adult Bengalese finch. Song is composed of a set of categorically distinct syllable types (labeled ‘A’, ‘B’, ‘C’…) that are organized into larger, repeated, sequences (gray bars). Both the spectral structure of syllables and their sequencing are learned features of song. Hence, song is a complex, high dimensional behavior that differs across individuals. (B) Song of an adult male ‘tutor’ and (C) songs of four juvenile ‘tutees’ that were all exposed to the same tutor song, illustrating variation in the quality of song learning. (Ci) Song from a tutee that learned the spectral content of the tutor song well, producing a song with accurate copies of all syllables. (Cii) Song from a tutee that copied all syllables, but with noisier versions than those present in the tutor song. (Ciii) song from a tutee that failed to copy some of the syllables from the tutor song. (Civ) song from a tutee that included ‘new syllables’ that were not clearly present in the tutor song.
Fig 2
Fig 2. Transformation of song data into syllable similarity-space.
(A) To transform data from a given bird into similarity-space we first segment all syllables from a set of songs produced by that bird and compute their corresponding PSDs. (B) Three examples of segmented syllables, each of a different type, and their corresponding PSDs. (C) For each of 3000 sample syllables from the song to be analyzed, similarity of PSDs is calculated relative to a basis set of PSDs for 50 syllables randomly drawn from the same song. This creates an M (number of basis syllables) by N (number of sample syllables) similarity matrix. (D) Visualization of how transformation of raw syllable data into the syllable similarity-space results in a clustering of syllables by type. Each point in the plot indicates the similarity between the PSD for one sample syllable and two basis PSDs (‘basis PSD1’ and ‘basis PSD2’) from the set of 50 basis PSDs. For clarity of exposition, only data that fall into one of three regions of high density are plotted here. Each of these regions corresponds approximately to multiple instances of one syllable type (which cluster near each other because of the similarity in their PSDs). In practice, there were more than three regions of syllable clustering (corresponding approximately to the number distinct syllable types in the bird’s song), and these regions were represented in the 50 dimensional space defined by the basis set of PSDs (only two of which are illustrated here). The regions of high density in this similarity-space were fit with a Gaussian mixture model, in which the optimal number of Gaussian mixtures was determined by Bayesian Information Criteria. Individual data points here are color-coded by their assignment to one of three Gaussian mixtures. For clarity of presentation, data from only three of the 9 total Gaussian mixtures are shown. In any single dimension (top and right) data points assigned to each Gaussian mixture were approximately normally distributed. (E) Similarity matrix shown in C, reordered so that data are grouped by assignment to each of 9 Gaussian mixtures fit to the data (represented by colored blocks at the right of the similarity matrix). In this reordered representation, it is apparent that syllables assigned to each Gaussian mixture have a shared ‘bar code’ reflecting a shared pattern of PSD similarity values relative to the basis PSDs. The spectrograms at the right illustrate that syllables assigned to a given Gaussian mixture tend to be of the same type.
Fig 3
Fig 3. Syllables with similar spectral structure have overlapping distributions in syllable similarity-space.
For this analysis, for each of three birds, instances of a specific syllable type, corresponding to ‘harmonic stacks’ were identified by human inspection. (A) Spectrograms of exemplar syllables produced by each of three birds (Ref, Copy 1, Copy 2). (B-E) Distributions corresponding to syllables of a specific type (exemplars in panel A) produced by three birds (Ref (red), Copy 1 (purple), Copy 2 (blue)) in similarity-space. Consistent with the human perception that Copy 1 and Ref are more similar to each other than either is to Copy 2, in all 4 panels, distributions produced by Ref (red) and Copy 1 (purple) are more overlapping than distributions produced by Copy 2 (blue) and either Ref or Copy 1. For all panels, the marginal distributions in each single dimension are depicted above and to the right and the basis syllables are depicted above and below. Ellipses are 80% confidence intervals (1.28 standard error) derived from a multivariate Gaussian fit to each set of syllable similarities. Throughout, colors indicate bird identity.
Fig 4
Fig 4. Estimation of the amount of spectral content present in the reference (tutor) song that is absent from the comparison (tutee) song.
(A) Example reference and comparison songs. To compute the DKL for these songs, we first fit Gaussian mixture models (GMMs) to the data from each song. (B) Representation in one dimension of the GMMs fit to song spectral content for both the reference song (left, blue) and the comparison song (right, red). (C) Superimposed mixture models for the reference song (blue) and comparison song (red). Regions of the reference-song mixture model which are not shared with the comparison-song model (red arrow) correspond to reference song content which is absent in the comparison song and will result in a higher DKL. However, regions of the comparison-song model which are not shared with the reference-song model (green arrow) will not impact the DKL.
Fig 5
Fig 5. Song DKL closely parallels human assessment of learning outcomes.
The quality of learning for individuals from five cohorts, each with a distinct tutor song, were evaluated by song DKL and human inspection. (A) Example spectrograms of the tutor song from one cohort and the songs of 5 tutees from the same cohort (cohort B). Also shown, for comparison, is the song of one isolate bird raised without tutor song exposure (isolate song) and the song from one bird raised with a different tutor (unrelated bird song). Numbers at left indicate the DKL and human similarity scores for each song relative to the tutor song from cohort B. (B) There was a good correspondence between song DKL and human evaluations of learning across a broad range of song similarities. Here, human scores are the average of four human judges. Across all five cohorts, DKL and human scores were well correlated (p < 0.01, r = 0.722, OLS). (C) Comparison of song DKL and human scores for each of the five cohorts. Human-computer correlation (left) shows the correlation between DKL values and average human scores for each of the five cohorts. Human-human correlation (right) indicates the correlation between the scores of each of 4 individual humans and the average of the remaining human scores for each cohort. Medians are indicated as gray bars. (D) Summary of song DKL scores for the five cohorts (gray) were significantly lower than scores from a cohort of unrelated birds (blue, p < 0.01, Wilcoxon rank test) and from a cohort of ‘isolate birds’ raised without a tutor (red, p < 0.01, Wilcoxon rank test). Across all panels, bird cohort identity is indicated by color.
Fig 6
Fig 6. Quantification of changes to song following deafening.
(A) Spectrograms from before (Pre), two weeks, and six weeks post deafening for three zebra finches demonstrate typical disruption to the spectral content of song due to deafening. (B) Song DKL values for post deafening songs relative to baseline reference songs for nine birds at two, four, six, and eight weeks following deafening. Song DKL values indicated by ‘Pre’ were calculated by separating the baseline reference data into two groups and comparing one group to the second group. Colors indicate bird identity, with green, yellow and blue in panels A and B illustrating data from birds that had small, intermediate and large changes to song spectral structure following deafening.
Fig 7
Fig 7. Establishment of baseline parameter values for song DKL calculation.
(A) Plot of r2 values for correlations between DKL calculated using a range of input data sizes and DKL calculated using 3000 syllables of input data. (B) Plot of r2 values for correlations between DKL calculated using a range of basis set sizes and DKL calculated using a 160 syllable basis set. (C) Plot of r2 values for correlations between DKL calculated using the number of mixture components (k) determined by BIC (nBIC) and DKL calculated using a number of mixture components ranging from nBIC-4 to nBIC+4. (D) Plot of r2 values for correlations between DKL calculated using 1, 2 or 5 PSD representations of each syllable and DKL calculated using a 10 PSD representation.
Fig 8
Fig 8. GMM derived syllable classifications are correlated with human syllable classifications.
(A) Examples of labels assigned to two songs by human inspection (black) and GMM (red). For many birds, there were no differences between human assigned and GMM assigned labels (e.g. upper panel). However, for some birds, there were discrepancies (e.g. gray box, lower panel). (B) Erroneous GMM classifications can be identified by inspection of spectrograms for groups of syllables assigned to a given Gaussian mixture. Illustrated here are two examples of groups of syllables assigned to individual Gaussian mixtures where it is apparent in each case that a single syllable (gray boxes) is miss-classified relative to human assignment. For 90 animals, the number of miss-classified syllables was determined by such human inspection of groups of syllables that were assigned to each Gaussian mixture. (C) Distribution of the percent of correctly classified syllables (per-bird) is shown in red with a gamma distribution fit to these data shown in black. 50% of animals had greater than 96% correctly classified syllables (blue line) while 80% had more than 93% correctly classified syllables (purple line). (D) Distribution of the percent of correctly classified syllables per bird is shown as in C, but here with categorization carried out in which the input representation of each syllable to the GMM includes 10 PSDs evenly spaced over the duration of the syllable, rather than a single PSD for the entire syllable. Using this richer representation of a syllable, 50% of animals had more than 99% correctly classified syllables (blue line) while 80% had more than 96% correctly classified syllables (purple line).
Fig 9
Fig 9. Quantification of differences between human vocalizations.
(A) Schematic of experimental design. Subjects spoke the alphabet 40 times both without (Control, Day 1) and with (Constrained, Day 1) a constraint on jaw movement. Seven days after the initial recording subjects again spoke the alphabet 40 times (Control, Day 7). (B) Spectrograms of example 'A' vocalizations. Examples are drawn from data collected on day one under control conditions (top panel), day one under constraint (middle panel), and day 7 under control conditions (bottom panel). (C) Distributions of 'A' (red), 'F' (green), and 'J' (blue) vocalizations from a single participant plotted in similarity-space. In each pair of dimensions, renditions of 'A', 'F', and 'J' are well separated, though renditions of 'A' are closer to renditions of 'J' than to 'F'. Basis vocalizations are shown above each panel. (D) Distributions of 'A' vocalizations from each of three subjects plotted in similarity-space. Consistent with inter-individual differences in vocalizations, renditions of 'A' from each speaker are well separated. Basis vocalizations are shown above. (E) Distributions of 'A', 'F', and 'J' vocalizations from one subject in similarity-space. For each vocalized letter, the distribution of control-day-1 vocalizations (blue) more extensively overlap with the distribution of control-day-7 vocalizations (yellow) than with the distribution of constrained vocalizations (red). The basis vocalizations are shown at left. (F) Song DKL values for all individuals (denoted by data color) captured differences in spectral content between control-day-1 and control-day-7 vocalizations (left column), control-day-1 and constrained vocalizations (middle column), and control-day-1 and control-day-7 vocalizations from the other subjects (right column, median value plotted for other subjects). For each data set, gray bars indicated means and black bars indicate standard errors. * = p < 0.01. For panels C-E, ellipses represent 80% confidence intervals (1.28 standard error) on Gaussian distributions fit to each set of vocalization similarities.

Similar articles

Cited by

References

    1. Doupe AJ, Kuhl PK. Birdsong and human speech: common themes and mechanisms. Annu Rev Neurosci. 1999;22: 567–631. 10.1146/annurev.neuro.22.1.567 - DOI - PubMed
    1. Brainard MS, Doupe AJ. Translating birdsong: songbirds as a model for basic and applied medical research. Annu Rev Neurosci. 2013;36: 489–517. 10.1146/annurev-neuro-060909-152826 - DOI - PMC - PubMed
    1. Catchpole C, Slater PJB. Bird song: biological themes and variations. Cambridge University Press; 2003.
    1. Tchernichovski O, Lints T, Mitra PP, Nottebohm F. Vocal imitation in zebra finches is inversely related to model abundance. Proc Natl Acad Sci U S A. 1999;96: 12901–4. Available: http://www.ncbi.nlm.nih.gov/pubmed/10536020 - PMC - PubMed
    1. Thorpe WH. The Process of Song-Learning in the Chaffinch as Studied by Means of the Sound Spectrograph. Nature. Nature Publishing Group; 1954;173: 465–469. 10.1038/173465a0 - DOI

Publication types