Addressing the class imbalance problem in twitter spam detection using ensemble learning

Liu, Shigang, Wang, Yu, Zhang, Jun, Chen, Chao, and Xiang, Yang (2017) Addressing the class imbalance problem in twitter spam detection using ensemble learning. Computers & Security, 69. pp. 35-49.

[img] PDF (Published Version) - Published Version
Restricted to Repository staff only

View at Publisher Website:


In recent years, microblogging sites like Twitter have become an important and popular source for real-time information and news dissemination, and they have become a prime target of spammers inevitably. A series of incidents have shown that the security threats caused by Twitter spam can reach far beyond the social media platform to impact the real world. To mitigate the threat, a lot of recent studies apply machine learning techniques to classify Twitter spam and promising results are reported. However, most of these studies overlook the class imbalance problem in real-world Twitter data. In this paper, we experimentally demonstrate that the unequal distribution between spam and non-spam classes has a great impact on spam detection rate. To address the problem, we propose FOS, a fuzzy-based oversampling method that generates synthetic data samples from limited observed samples based on the idea of fuzzy-based information decomposition. Moreover, we develop an ensemble learning approach that learns more accurate classifiers from imbalanced data in three steps. In the first step, the class distribution in the imbalanced data set is adjusted by using various strategies, including random oversampling, random undersampling and FOS. In the second step, a classification model is built upon each of the redistributed data sets. In the final step, a majority voting scheme is introduced to combine the predictions from all the classification models. We conduct experiments on real-world Twitter data for the purpose of evaluation. The results indicate that the proposed learning approach can significantly improve the spam detection rate in data sets with imbalanced class distribution.

Item ID: 64425
Item Type: Article (Research - C1)
ISSN: 1872-6208
Keywords: online social networks, Twitter, spam detection, machine learning, class imbalance
Copyright Information: © 2016 Elsevier Ltd. All rights reserved.
Date Deposited: 22 Sep 2020 19:43
FoR Codes: 46 INFORMATION AND COMPUTING SCIENCES > 4604 Cybersecurity and privacy > 460499 Cybersecurity and privacy not elsewhere classified @ 100%
SEO Codes: 89 INFORMATION AND COMMUNICATION SERVICES > 8902 Computer Software and Services > 890299 Computer Software and Services not elsewhere classified @ 100%
Downloads: Total: 1
More Statistics

Actions (Repository Staff Only)

Item Control Page Item Control Page