Long short-term memory for speaker generalization in supervised speech separation

J Acoust Soc Am. 2017 Jun;141(6):4705. doi: 10.1121/1.4986931.

Abstract

Speech separation can be formulated as learning to estimate a time-frequency mask from acoustic features extracted from noisy speech. For supervised speech separation, generalization to unseen noises and unseen speakers is a critical issue. Although deep neural networks (DNNs) have been successful in noise-independent speech separation, DNNs are limited in modeling a large number of speakers. To improve speaker generalization, a separation model based on long short-term memory (LSTM) is proposed, which naturally accounts for temporal dynamics of speech. Systematic evaluation shows that the proposed model substantially outperforms a DNN-based model on unseen speakers and unseen noises in terms of objective speech intelligibility. Analyzing LSTM internal representations reveals that LSTM captures long-term speech contexts. It is also found that the LSTM model is more advantageous for low-latency speech separation and it, without future frames, performs better than the DNN model with future frames. The proposed model represents an effective approach for speaker- and noise-independent speech separation.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Acoustics*
  • Deep Learning
  • Female
  • Humans
  • Male
  • Memory, Short-Term*
  • Noise / adverse effects*
  • Perceptual Masking*
  • Signal Processing, Computer-Assisted
  • Sound Spectrography
  • Speech Acoustics*
  • Speech Intelligibility
  • Speech Perception*
  • Speech Production Measurement / methods*
  • Time Factors
  • Voice Quality*