Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 11 (11), e0166866
eCollection

Towards the Automatic Classification of Avian Flight Calls for Bioacoustic Monitoring

Affiliations

Towards the Automatic Classification of Avian Flight Calls for Bioacoustic Monitoring

Justin Salamon et al. PLoS One.

Abstract

Automatic classification of animal vocalizations has great potential to enhance the monitoring of species movements and behaviors. This is particularly true for monitoring nocturnal bird migration, where automated classification of migrants' flight calls could yield new biological insights and conservation applications for birds that vocalize during migration. In this paper we investigate the automatic classification of bird species from flight calls, and in particular the relationship between two different problem formulations commonly found in the literature: classifying a short clip containing one of a fixed set of known species (N-class problem) and the continuous monitoring problem, the latter of which is relevant to migration monitoring. We implemented a state-of-the-art audio classification model based on unsupervised feature learning and evaluated it on three novel datasets, one for studying the N-class problem including over 5000 flight calls from 43 different species, and two realistic datasets for studying the monitoring scenario comprising hundreds of thousands of audio clips that were compiled by means of remote acoustic sensors deployed in the field during two migration seasons. We show that the model achieves high accuracy when classifying a clip to one of N known species, even for a large number of species. In contrast, the model does not perform as well in the continuous monitoring case. Through a detailed error analysis (that included full expert review of false positives and negatives) we show the model is confounded by varying background noise conditions and previously unseen vocalizations. We also show that the model needs to be parameterized and benchmarked differently for the continuous monitoring scenario. Finally, we show that despite the reduced performance, given the right conditions the model can still characterize the migration pattern of a specific species. The paper concludes with directions for future research.

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Block diagram of the classification framework comprising 3 main blocks: (1) Feature learning (learn a codebook from the train data), (2) Feature encoding (use the learned codebook to encode the train and test data), (3) Classification (use the encoded train and test data to fit and evaluate a discriminative classifier).
Fig 2
Fig 2. Classification accuracy of the proposed model for the N-class problem using CLO-43SD.
The proposed model is compared against a baseline method which uses standard MFCC features. For additional context the preliminary result reported in [15] for a flight call dataset with a similar number of species (42) is also provided, however it is not directly comparable to the baseline and proposed model since the study used a smaller dataset of 1180 samples. The error bars represent the standard deviation over the per-fold accuracies (for [15] there is only a single value).
Fig 3
Fig 3. Per-class (per-species) classification accuracy obtained by the proposed model for each of the 43 species in CLO-43SD.
The box plots are derived using 5-fold cross validation, where the red squares represent the mean score for each species. A mapping between the abbreviations used in this plot and the full species names is provided in S1 Table.
Fig 4
Fig 4. Model sensitivity to hyper-parameter values for CLO-43SD.
Each subplot displays the classification accuracy as a function of: (a) the duration of the TF-patches dpatch, (b) the size of the codebook k, (c) the set of summary statistics used in feature encoding fstat, and (d) the penalty parameter C used for training the Support Vector Machine classifier.
Fig 5
Fig 5. Receiver Operating Characteristic (ROC) curves produced by the proposed model for CLO-WTSP: training set (blue, obtained via 5-fold cross validation) and test set (red).
The Area Under the Curve (AUC) score for each set is provided in the figure legend at the bottom right corner.
Fig 6
Fig 6. Precision-recall (PR) curves for CLO-WTSP: training set (blue, obtained via 5-fold cross validation) and test set (red).
Fig 7
Fig 7. Approximate Signal-to-Noise-Ratio (SNR) computed separately for the true positives and false negatives returned by the proposed model: (a) CLO-WTSP test set, (b) CLO-SWTH test set.
Fig 8
Fig 8. Receiver Operating Characteristic (ROC) curves produced by the proposed model for CLO-SWTH: training set (blue) and test set (red).
The Area Under the Curve (AUC) score for each set is provided in the figure legend at the bottom right corner.
Fig 9
Fig 9. Precision-recall (PR) curves for CLO-SWTH: training set (blue, obtained via 5-fold cross validation) and test set (red).
Fig 10
Fig 10. Detection curves showing the daily number of detected WTSP calls in the CLO-WTSP test set.
The true curve (the reference, computed from the expert annotations) is plotted in black. The other three curves represent detections generated by the proposed model using different threshold values: the default (0.5) in blue, the threshold that maximizes the f1 score (which quantifies the trade-off between precision and recall by computing their harmonic mean) on the training set (0.33) in red, and the “oracle threshold” (0.11) that maximizes the f1 score on the test set in green.
Fig 11
Fig 11. Detection curves showing the daily number of detected SWTH calls in the CLO-SWTH test set.
The true curve (the reference, computed from the expert annotations) is plotted in black. The other three curves represent detections generated by the proposed model using different threshold values: the default (0.5) in blue, the threshold that maximizes the f1 score (which quantifies the trade-off between precision and recall by computing their harmonic mean) on the training set (0.29) in red, and the “oracle threshold” (0.73) that maximizes the f1 score on the test set in green.

Similar articles

See all similar articles

Cited by 7 PubMed Central articles

See all "Cited by" articles

References

    1. Emlen J, Dejong M. Counting birds: the problem of variable hearing abilities. Journal of Field Ornithology. 1992;63:26–31.
    1. Rosenstock SS, Anderson DR, Giesen KM, Leukering T, Carter MF, Thompson F III. Landbird counting techniques: current practices and an alternative. The Auk. 2002;119:46–53. 10.2307/4090011 - DOI
    1. Hutto RL, Stutzman RJ. Humans versus autonomous recording units: a comparison of point-count results. Journal of Field Ornithology. 2009;80:387–398. 10.1111/j.1557-9263.2009.00245.x - DOI
    1. Bas Y, Devictor V, Moussus J, Jiguet F. Accounting for weather and time-of-day parameters when analysing count data from monitoring programs. Biodiversity and Conservation. 2008;17:3403–3416. 10.1007/s10531-008-9420-6 - DOI
    1. Diefenbach D, Marshall M, Mattice J. Incorporating availability for detection in estimates of bird abundance. The Auk. 2007;124:96–106. 10.1642/0004-8038(2007)124[96:IAFDIE]2.0.CO;2 - DOI

Grant support

This work was supported by National Science Foundation, NSF IIS-1125098, http://www.nsf.gov/awardsearch/showAward?AWD_ID=1125098, S. Kelling; Leon Levy Foundation, AF; Ingalls Foundation, S. Kelling; Center for Urban Science and Progress, JS; National Science Foundation, NSF IIS-1633259, https://nsf.gov/awardsearch/showAward?AWD_ID=1633259, JPB; and Google Faculty Award, JPB, S. Kelling. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Feedback