Introduction
Speech-to-text transcription is an important tool for mission analytics and many natural language processing tasks. In most cases, a language is manually preset before speech-to-text transcription can be applied. For scenarios where data is being aggregated in large quantities in various languages, automatic language detection capabilities would greatly streamline data processing.
Novetta has developed a technique, Language Identification For Audio Spectrograms (LIFAS), that uses spectrograms of raw audio signals as input to a convolutional neural network (CNN) to be used for language identification. This approach requires minimal pre-processing, as raw audio signals are input into the neural network and spectrograms are generated as each batch is input to the network during training. The technique can utilize short audio segments (approximately 4 seconds) for effective classification, as would be necessary with voice assistants that need to identify language directly after as a speaker begins to talk.
Audio Data Sources
Finding a dataset of audio clips in various languages suitable for training a network was an initial challenge for this task, especially since many datasets of this type are not open sourced [7]. We decided to work with VoxForge [10], an open-source corpus that consists of user-submitted audio clips in various languages. Audio clips were gathered in English, Spanish, French, German, Russian, and Italian. Speakers had various accents and were of different genders.
Residual and Convolutional Neural Networks
CNNs have been shown to give state-of-the-art results for image classification and a variety of other tasks. As neural networks using back propagation are constructed to be deeper with more layers, they run into the problem of vanishing gradient [2]. A network updates its weights based on the partial derivatives of the error function from the previous layers. Many times, the derivatives can become very small and the weight updates become insignificant. This can lead to a degradation in performance.
One way to mitigate this problem is the use of Residual Neural Networks (ResNets) [5]. ResNets utilize skip connections in layers, which connect two non-adjacent layers. ResNets have shown excellent performance on image recognition tasks, which makes them a natural choice for a network architecture for this task [6].
Spectrogram Generation
A spectrogram is an image representation of the frequencies present in a signal over time. The frequency spectrum of a signal can be generated from a time series signal using a Fourier Transform.
In practice, the Fast Fourier Transform (FFT) can be applied to a section of the time series data to calculate the magnitude of the frequency spectrum for a fixed moment in time. This will correspond to a line in the spectrogram. The time series data is then windowed, usually in overlapping chunks, and the FFT data is strung together to form the spectrogram image which allows us to see how the frequencies change over time.
Since we are generating spectrograms on audio data, the data is converted to the mel scale, generating “melspectrograms”. These images will be referred to as simply “spectrograms” below. The conversion from f hertz to m mels that we use is given by,
An example of a spectrogram generated by an English data transmission is shown in Figure 1.
Data Preparation
Each audio signal was sampled at a rate of 16kHz, with a length of 60,000 samples (a sample refers to the number of data points in the audio clip). This equates to 3.75 seconds of audio. The audio files, referred to as “clips”, were saved as WAV files and loaded into Python using the librosa library.
The training set consisted of 5000 clips per language, and the validation set consisted of 2000 clips per language. The same speakers may be speaking in more than one clip, but there was no cross contamination in the training and validation sets.
Spectrograms were generated using parameters similar to the process discussed in [1] which used a frequency spectrum of 20Hz to 8,000Hz and 40 frequency bins. Each FFT was computed on a window of 1024 samples. No other pre-processing was done on the audio files. Spectrograms were generated on-the-fly on a per-batch basis with a batch size of 64 images while the network was running (i.e. spectrograms were not saved to disk).
Network
We utilized the fast.ai [3] deep learning library built on PyTorch [9]. The network used was a pretrained Resnet50. Each image was 432288 pixels in size.
During training, we used the 1-cycle-policy described in [11]. In this process, the learning rate is gradually increased and then decreased in a linear fashion during one cycle [4]. The learning rate finder within the fast.ai library was used to determine the maximum learning rate to be used in the 1-cycle training of the network, set at 1 10-2. The learning rate increases until it hits the maximum learning rate, and then it gradually decreases again. The length of the cycle was set to be 8 epochs, meaning that throughout the cycle 8 epochs are evaluated.
Experiments
Two types of experiments were conducted: binary language classification and multiple-language classification.
Binary Language Classification with Varying Number of Samples
Binary language classification was performed on two languages using clips of 60,000 samples. English and Russian were chosen for training and validation. To test whether the use of more samples impacted performance, binary classification was also performed on clips of 100,000 samples.
Multiple-Language Classification
For each language (English, Spanish, German, French, Russian, and Italian), 5000 clips were placed in the training set. Each clip was 60,000 samples in length. 2000 clips per language were placed in the validation set, and no speakers or clips appeared in both the training and validation sets.
Results
Accuracy was calculated for both binary language classification and multiple-language classification as:
Binary language classification accuracy for Russian and English clips of length 60,000 samples was 94%; on the clips of 100,000 samples accuracy was 97%. These figures are calculated on the total number of clips in the validation set.
There was essentially no difference in accuracy for English and Russian clips. To confirm that the network performance was not dependent on English and Russian language data, binary language classification was tested on other languages. Little to no impact on validation accuracy was observed.
Multiple-Language Classification accuracy across six languages was 89%. The confusion matrix for the multi-class classification is shown in Figure 2.
The highest rate of incorrect classifications occurred when Spanish clips were classified as Russian, and when Russian clips were classified as Spanish. Oddly, almost no other language was misclassified as Russian or Spanish. One hypothesis for this observation is the fact that Russian is the only Slavic language in the training set. Therefore, the network may be performing some thresholding at one layer that separates Russian from other languages, and by chance Spanish clips are near the threshold.
Discussion
While these results are highly promising, their extensibility may be limited by the fact that all data came from the same dataset. Since audio formats can have a wide variety of parameters such as bit rate, sampling rate, and bits per sample, we would expect clips from other datasets collected in different formats to potentially confuse the network. Performance in this case may be a function of the degree to which appropriate pre-processing steps were taken for the audio signals, so that spectrograms do not contain dataset-specific artifacts. This is an aspect of LIFAS performance that we plan to evaluate in the near future using test data from additional (e.g. non-VoxForge) datasets.
Additionally, we would want the network to be performant on environments with varying levels of noise. VoxForge data is comprised of user-submitted audio clips, so the noise profiles of the clips vary, but more structured tests could be done to see how robust the network is to different measured levels of noise. Simulated additive white Gaussian noise could be added to the training data to simulate low quality audio, but this still might not fully mimic the effect of background noise such as car horns or multiple speakers in a real life environment.
Another way to potentially increase the robustness of the model would be to implement SpecAugment [8], a method that distorts spectrogram images to help overfitting and increase performance of networks by feeding in deliberately corrupted images. This may help to add scalability and robustness to the network, assuming the spectral distortions generated in SpecAugment accurately represent distortions in audio signals observed in the real world.
Conclusion
Results show the viability of using deep network architectures commonly used for image classification in identifying languages from images generated from audio data. Robust performance can be accomplished using relatively short files with minimal pre-processing. We believe that this model can be extended to classify more languages so long as sufficient, representative training and validation data is available.
Sources
[1] Audio Classification using FastAI and On-the-Fly Frequency Transforms. url: https://towardsdatascience.com/audio-classification- using-fastai-and-on-the-fly-frequency-transforms-4dbe1b540f89.
[2] The Vanishing Gradient Problem. url: https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484.
[3] fast.ai. url: www.fast.ai.
[4] The 1cycle Policy. url: https://sgugger.github.io/the-1cycle-policy.html
[5] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: CoRR abs/1512.03385 (2015). arXiv: 1512.03385. url: http://arxiv. org/abs/1512.03385.
[6] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 2016.
[7] Mozilla. Common Voice by Mozilla. url: voice.mozilla.org/en.
[8] Daniel S. Park et al. “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition”. In: arXiv (2019). url: http:// arxiv.org/abs/1904.08779.
[9] Adam Paszke et al. “Automatic differentiation in PyTorch”. In: NIPS-W. 2017.
[10] VoxForge. url: voxforge.org.
[11] Leslie N. Smith. “A disciplined approach to neural network hyper-parameters: Part 1 – learning rate, batch size, momentum, and weight decay”. In: CoRR abs/1803.09820 (2018). url: http://arxiv.org/abs/1803.09820.