Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention

Sensors (Basel). 2022 Sep 9;22(18):6818. doi: 10.3390/s22186818.

Abstract

The complexity of polyphonic sounds imposes numerous challenges on their classification. Especially in real life, polyphonic sound events have discontinuity and unstable time-frequency variations. Traditional single acoustic features cannot characterize the key feature information of the polyphonic sound event, and this deficiency results in poor model classification performance. In this paper, we propose a convolutional recurrent neural network model based on the temporal-frequency (TF) attention mechanism and feature space (FS) attention mechanism (TFFS-CRNN). The TFFS-CRNN model aggregates Log-Mel spectrograms and MFCCs feature as inputs, which contains the TF-attention module, the convolutional recurrent neural network (CRNN) module, the FS-attention module and the bidirectional gated recurrent unit (BGRU) module. In polyphonic sound events detection (SED), the TF-attention module can capture the critical temporal-frequency features more capably. The FS-attention module assigns different dynamically learnable weights to different dimensions of features. The TFFS-CRNN model improves the characterization of features for key feature information in polyphonic SED. By using two attention modules, the model can focus on semantically relevant time frames, key frequency bands, and important feature spaces. Finally, the BGRU module learns contextual information. The experiments were conducted on the DCASE 2016 Task3 dataset and the DCASE 2017 Task3 dataset. Experimental results show that the F1-score of the TFFS-CRNN model improved 12.4% and 25.2% compared with winning system models in DCASE challenge; the ER is reduced by 0.41 and 0.37 as well. The proposed TFFS-CRNN model algorithm has better classification performance and lower ER in polyphonic SED.

Keywords: convolutional recurrent neural networks; feature aggregation; feature space attention; sound event detection; temporal-frequency attention.

MeSH terms

  • Acoustics*
  • Algorithms
  • Hearing
  • Neural Networks, Computer*
  • Sound

Grants and funding

This research was funded by the National Natural Science Foundation of China (No.62071135), the Project of Guangxi Technology Base and Talent Special Project (No.GuiKe AD20159018), the Project of Guangxi Natural Science Foundation (No.2020GXNSFAA159004), the Fund of Key Laboratory of Cognitive Radio and Information Processing, Ministry of Education (No.CRKL200104) and the Opening Project of Guangxi Key Laboratory of UAV Remote Sensing (No.WRJ2016KF01).