Self-supervised learning with a contrastive VideoMoCo framework for Saudi Arabic sign language recognition using 3D convolutional networks

Sci Rep. 2025 Nov 13;15(1):39827. doi: 10.1038/s41598-025-23494-x.

Abstract

Saudi Arabic Sign Language (SArSL) recognition poses significant challenges due to its complex spatio-temporal structure and the scarcity of annotated datasets. This paper introduces a self-supervised learning framework built upon the Video Momentum Contrast (VideoMoCo) paradigm integrated with a 3D ResNet-50 backbone, designed to jointly capture spatial and temporal gesture dependencies. The proposed model is pretrained on 18,000 unlabeled gesture videos and subsequently fine-tuned on the KARSL-502 dataset containing 15,400 labeled samples covering 502 distinct classes. Experimental evaluation shows that the model attains an F1-score of 92.7%, outperforming CNN-LSTM (86.0%) and Two-Stream CNN (84.5%) baselines-an improvement of nearly 9% points. Beyond accuracy, the framework demonstrates strong robustness to class imbalance, motion variation, and visual noise, while maintaining efficient deployment performance with an inference latency of 12 ms per batch. The ablation study verifies the contribution of the momentum encoder and large negative sample queue in achieving stable and discriminative feature learning. Overall, the VideoMoCo-ResNet-50 framework establishes a scalable and inclusive foundation for real-time SArSL recognition, advancing accessibility for the Saudi Deaf community and supporting future multimodal extensions.

Keywords: 3D convolutional neural networks; Arabic sign language recognition; Contrastive learning; Saudi arabic sign language (ArSL); Self-Supervised learning; VideoMoCo.

MeSH terms

  • Deep Learning
  • Gestures
  • Humans
  • Neural Networks, Computer*
  • Saudi Arabia
  • Sign Language*
  • Supervised Machine Learning*
  • Video Recording