Speech emotion recognition (SER) in health applications can offer several benefits by providing insights into the emotional well-being of individuals. In this work, we propose a method for SER using time-frequency representation of the speech signals and neural networks. In particular, we divide the speech signals into overlapping segments and transform each segment into a Mel-spectrogram. The Mel-spectrogram forms the input to YAMNet, a pretrained convolutional neural network for audio classification, which learns spectral characteristics within each Mel-spectrogram. In addition, we utilize a long short-term memory network, a type of recurrent neural network, to learn the temporal dependencies between the sequence of Mel-spectrograms in each speech signal. The proposed method is evaluated on angry, happy, and sad emotion types, and the neutral expression, on two SER datasets, achieving an average accuracy of 0.711 and 0.780, respectively. These results are a relative improvement over baseline methods and demonstrate the potential of our method in detecting emotional states using speech signals.