ABSTRACT
The serious setback in modern day Automatic Speech Recognition (ASR) technology is posed by intonation differences which results in acoustic and durational variability; a problem that hindered the full exploration of this area of research and the development of the speech technology product globally. This work developed an algorithm that is capable of adapting to the accent of the focused Nigerian major tribes (Hausa, Yoruba, Ibo, and Fulani) thereby improving the development of local Speech Technology Product (STP) in Nigeria. Speech recognition is a wide area of technological research currently being explored. Though there has been huge advancement in developed nations but the developing nations are still lacking behind. There are approximately 130 million English speakers in Nigeria either as a second language or official language. The performance of the ASR system is in serious need of improvement based on acoustic and durational variability of accented speakers. Speech accent adaptation system offers the opportunity not only to the study of phonological and prosodic representation of speech but also to the development of a system that is robust to accent variability of different speakers from any tribe or race.
This work focused on the modelling of the accented speech of Nigerian major tribes; Hausa, Ibo, Yoruba and Fulani. The speech recognizer takes an untranscribed speech; adapt it to a targeted accent by picking out the matched utterance from a large corpus which comprises of the datasets of the above mentioned tribes. The method employed is very simple; the algorithm takes in the speech samples input recorded from the four Nigerian accent speakers (Hausa, Yoruba, Ibo, and Fulani) and stored as “.wav” files with the help of Audacity software set at 16bits and 8kHz sampling frequency which is twice the human voice frequency range according to Nyquist Criterion. These speech samples are well labelled accordingly for the algorithm to be able to differentiate between speaker “A” and Speaker “B” and all of these speech samples are stored in the MATLAB folder where the MATLAB software can easily upload in during the algorithm simulation.
The algorithm takes in the speech sample and pre-processed it to facilitate the extraction of the features of the speech signal. The first of this pre-processing method is called Pre-emphasis which does the filtering of the speech signal and the boosting of the high frequency format which is susceptible to noise. After this is the framing of the speech signal and this is done because speech signal is not stationary (that is it changes continuously) which will make it difficult to extract the features of the speech signal. What was done is that the speech signal was broken into overlapping frames of 25ms and the overlapping caters for any information that could be lost in between frames. The next is the windowing of the framed signal and this is done emphasis the main part of the speech signal and to supress the spectral distortion that was introduced by framing.
The feature extraction started with FFT (Fast Fourier Transform) which converted the signal from time domain to frequency domain for easy manipulation of the spectrum of the signal. The Mel filter bank is another step which was used to model human hearing property that is sensitive to frequency below 1 kHz but behaves differently at any frequency above 1 kHz. We obtained the logarithm of the energy of the spectrum through “Log Energy” because human ear is logarithmic to signal level. The final step at the feature extraction stage is called Discrete Transform Coefficient (DCT). At DCT stage, all the triangular filter banks that are correlated are de-correlated through diagonal covariance matrix. The result of this stage is (13 coefficients for each frame) called Mel Frequency Cepstral coefficient (MFCC).
The last stage of this work is the feature matching with the aid of Vector Quantisation (VQ) which has both Encoder and Decoder. The code vectors of the feature extraction stage (MFCC) is fed as the input of the VQ algorithm, it clusters the code vectors and finds the mean for each of the speakers which are then stored in the database as a template; this is the training section. When an unknown speaker (Test Speaker) is fed into the VQ, the algorithm matches it with closest trained speaker template, the index of this sent to the Decoder which reconstructed the matched trained speaker and the word of the pronounced is displayed at the output.
After analysis of the results it shows a very good performance in terms of its adaptation and speaker recognition. The main challenge in this work is that the algorithm tends to slow down as the number of the speech samples increases. Continuous speech also posed a great challenge due to concatenation of phonemes, rate of speaking and acoustic and durational variability. This work will definitely serve as a bed rock for many ASR researches in Nigeria and its can used by many communication companies to develop user friendly applications.
Arowosafe, M. & Mayowa, A (2019). Development of speech accent adaptation algorithm. Afribary. Retrieved from https://track.afribary.com/works/development-of-speech-accent-adaptation-algorithm
Arowosafe, Mayowa, and Arowosafe Mayowa "Development of speech accent adaptation algorithm" Afribary. Afribary, 20 Feb. 2019, https://track.afribary.com/works/development-of-speech-accent-adaptation-algorithm. Accessed 18 Jan. 2025.
Arowosafe, Mayowa, and Arowosafe Mayowa . "Development of speech accent adaptation algorithm". Afribary, Afribary, 20 Feb. 2019. Web. 18 Jan. 2025. < https://track.afribary.com/works/development-of-speech-accent-adaptation-algorithm >.
Arowosafe, Mayowa and Mayowa, Arowosafe . "Development of speech accent adaptation algorithm" Afribary (2019). Accessed January 18, 2025. https://track.afribary.com/works/development-of-speech-accent-adaptation-algorithm