Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 13, 1420
eCollection

An Efficient and Perceptually Motivated Auditory Neural Encoding and Decoding Algorithm for Spiking Neural Networks

Affiliations

An Efficient and Perceptually Motivated Auditory Neural Encoding and Decoding Algorithm for Spiking Neural Networks

Zihan Pan et al. Front Neurosci.

Abstract

The auditory front-end is an integral part of a spiking neural network (SNN) when performing auditory cognitive tasks. It encodes the temporal dynamic stimulus, such as speech and audio, into an efficient, effective and reconstructable spike pattern to facilitate the subsequent processing. However, most of the auditory front-ends in current studies have not made use of recent findings in psychoacoustics and physiology concerning human listening. In this paper, we propose a neural encoding and decoding scheme that is optimized for audio processing. The neural encoding scheme, that we call Biologically plausible Auditory Encoding (BAE), emulates the functions of the perceptual components of the human auditory system, that include the cochlear filter bank, the inner hair cells, auditory masking effects from psychoacoustic models, and the spike neural encoding by the auditory nerve. We evaluate the perceptual quality of the BAE scheme using PESQ; the performance of the BAE based on sound classification and speech recognition experiments. Finally, we also built and published two spike-version of speech datasets: the Spike-TIDIGITS and the Spike-TIMIT, for researchers to use and benchmarking of future SNN research.

Keywords: auditory masking effects; auditory perception; neural encoding; spike database; spiking neural network.

Figures

FIGURE 1
FIGURE 1
Absolute hearing threshold Ta for the simultaneous masking. Our hearing is more sensitive to the acoustic stimulus around several thousand Hz, which covers the majority of the sounds in our daily life. The sounds below the thresholds are completely inaudible.
FIGURE 2
FIGURE 2
The frequency masking thresholds acting on a maskee (the acoustic events being masked), generated by the acoustic events from the neighboring critical bands, are shown as a surface in a 3-D plot. The acoustic events are referred to as the spectral power of the frames in a spectrogram. The spectral energy axis is the sound level of a maskee; the critical band axis is the frequency bins of the cochlear filter bank, as introduced in section “Spike-TIDIGITS and Spike-TIMIT Databases”; the masking thresholds axis indicates the overall masking levels on the maskees of different sound levels from various critical bands. For example, an acoustic event of 20dB level on the 10th critical band is masked off by the masking threshold of nearly 23dB, which is introduced by the other auditory components of its neighboring frequency channels.
FIGURE 3
FIGURE 3
The overall simultaneous masking effects on a speech utterance of “one,” in a 3-D spectrogram. Combining the two kinds of masking effects in the frequency domain (refer to Figures 1, 2), the gray surface shows the overall masking thresholds on a speech utterance (the colorful surface). All the spectral energy under the thresholds will be imperceptible.
FIGURE 4
FIGURE 4
The illustration of temporal masking: each bar represents the acoustic event received by the auditory system. In this paper, acoustic events generally referred to framing spectral power, which are the elements to be parsed to an auditory neural encoding scheme. A local peak event (red bar) forms a masking threshold represented by an exponentially decaying curve. The subsequent events that are weaker than the leading local peak will not be audible until another local peak event exceeds the masking threshold.
FIGURE 5
FIGURE 5
Both the simultaneous and temporal masking effects acting on the 3-D plot spectrogram of a speech utterance of “one.” The gray-color shaded parts of the spectrogram are masked.
FIGURE 6
FIGURE 6
(A) A speech signal of M samples; (B) Time-domain filter bank with K neurons that act as filters; (C) The output spectrogram that has K × M dimension.
FIGURE 7
FIGURE 7
The BAE scheme for temporal learning algorithms in auditory cognitive tasks. The raw auditory signals (a) are filtered by the CQT-based event-driven cochlear filter bank, resulting in a parallel stream of sub-band signals. For each sub-band, the signal is logarithmically framed, which corresponds to the processing in auditory hair cells. The framed spectral signals are then further masked in simultaneous and temporal masking.
FIGURE 8
FIGURE 8
An illustration of the intermediate results in a BAE process. Raw speech signal (A) of a speech utterance “three” is filtered and framed into a spectrogram (B), corresponding to the process in Figure 7 (a) and (c). By applying the neural threshold code, a precise spike pattern (C) is generated from the spectrogram. The masking map as described in Eq. 5 is illustrated in (D), where yellow and dark blue color blocks represent the values 1 and 0, respectively. The masking (D) is applied to the spike pattern (C) and the auditory masked spike pattern is obtained in (E).
FIGURE 9
FIGURE 9
Encoded spike patterns by threshold coding with/without masking. The two spike patterns are encoded from a speech utterance of “five” in the TIDIGITS dataset. The x-axis and y-axis represent the time and encoding neuron index. The positions of the colorful dots indicate the spike timings of the corresponding encoding neurons. The colors distinguish the centre frequencies of the cochlear filter bank. With auditory masking, the number of spikes reduces by nearly 50%, which are close to the 55% reducing rate of coding pulses as reported in Ambikairajah et al. (1997).
FIGURE 10
FIGURE 10
The reconstruction from a spike pattern into a speech signal. Parallel streams of threshold- encoded spike trains that represent the dynamics of multiple frequency channels are first decoded into sub-band digital signals. The sub-band signals are further fed into a series of synthesis filters, which are built inversely from the corresponding analysis cochlear filters as in Figure 6. The synthesis filters compensate for the gains from the analysis filters for each frequency bin. Finally, the outputs from the synthesis filter banks sum up to generate the reconstructed speech signal.
FIGURE 11
FIGURE 11
The classification accuracy for the Spike-TIDIGITS dataset under different signal-to-noise ratios, with or without masking effects. The accuracy for the Spike-TIDIGITS with masking effects is slightly higher than that for the Spike-TIDIGITS without masking effects.
FIGURE 12
FIGURE 12
Free membrane potential of trained Leaky-Integrate and Fire neurons, by feeding patterns with and without masking. The upper (A,B,C), middle (D,E,F), and lower (G,H,I) panels are for three different speech utterances “six,” “seven,” and “eight.” The spike patterns with or without masking are apparently different, but the output neuron follows similar membrane potential trajectories.

Similar articles

See all similar articles

References

    1. Abdollahi M., Liu S.-C. (2011). Speaker-independent isolated digit recognition using an aer silicon cochlea. Proceeding of the 2011 IEEE Biomedical Circuits and Systems Conference (BioCAS) (Piscataway, NJ: IEEE; ), 269–272
    1. Ambikairajah E., Davis A., Wong W. (1997). Auditory masking and mpeg-1 audio compression. Electron. Commun. Eng. J. 9 165–175 10.1049/ecej:19970403 - DOI
    1. Ambikairajah E., Epps J., Lin L. (2001). Wideband speech and audio coding using gammatone filter banks. Proceeding of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. (Piscataway, NJ: IEEE; ) 2 773–776
    1. Amir A., Taba B., Berg D., Melano T., McKinstry J., Di Nolfo C., et al. (2017). A low power, fully event-based gesture recognition system. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (Piscataway, NJ: IEEE; ) 7243–7252
    1. Anumula J., Neil D., Delbruck T., Liu S.-C. (2018). Feature representations for neuromorphic audio spike streams. Front. Neurosci. 12:23 10.3389/fnins.2018.00023 - DOI - PMC - PubMed

LinkOut - more resources

Feedback