Bengali cyberbullying detection: A comprehensive dataset for advanced analysis

Data Brief. 2025 Oct 21:63:112205. doi: 10.1016/j.dib.2025.112205. eCollection 2025 Dec.

Abstract

Cyberbullying has become a major concern in the digital world, with social media platforms facilitating harmful interactions that target individuals. Although there has been significant research, studies specifically focusing on Bengali cyberbullying remain scarce. This paper addresses this gap by analyzing over 70,000 social media comments in the Bengali language. The dataset is preprocessed for sentiment analysis to classify positive and negative comments, followed by topic modeling for negative comments using Latent Dirichlet Allocation (LDA) to extract features from age, gender, ethnicity, religion-based, and miscellaneous comments. Various models are applied, including Support Vector Machine, XGBoost, CNN+BiLSTM+GRU, mBERT, XLM-R, etc. Among the evaluated models, mBERT demonstrated the highest performance, attaining an accuracy of 92%. While the CNN+BiLSTM+GRU hybrid model outperformed with an accuracy of 91%. Further improvements are done by incorporating BERT embeddings into CNN and ANN models, and achieved an accuracy of 93%. To enhance model transparency and trust, Local Interpretable Model-agnostic Explanations is applied to interpret predictions.

Keywords: BERT embeddings; Cyberbullying; Explainable AI (XAI); Sentiment analysis; Topic modeling.