Building a Keyword Spotting Model for Arabic LanguageUsing Self-Supervised Learning Approach
Keywords:
Contextual Representation, Self-Supervised Learning, Keyword Spotting, HuBERT, DatasetAbstract
This research paper presents a comprehensive investigation into the efficiency of using contextual representation models trained via self-supervised
learning for keyword spotting (KWS) in the Arabic language, in view of reducing the amount of data required for training, while maintaining high accuracy in KWS. We employed Hidden Unit Bidirectional Encoder Representations from Transformers (HuBERT), a pre-trained model on Arabic data for extracting the contextual representation of the speech signal and developed a head model for the KWS downstream task. This head model was fine-tuned using the Arabic Speech Command dataset, and multiple experiments were conducted to ascertain the minimum number of training samples required to attain a specific level of accuracy.
Remarkably, using only ten training samples per word, the achieved detection accuracy exceeded 98.5%, and by increasing the number to more than 11 training samples, the accuracy increased to 99.7%. The performance of the model was evaluated on English language data and obtained similar outcomes regarding accuracy and the number of training samples needed for training. The results demonstrate the effectiveness of self-supervised learning for the KWS task in Arabic regarding the reduction of required training samples and suggest the potential for broader applications in speech processing.