A performance evaluation of machine learning-based streaming spam tweets detection

Chen, Chao, Zhang, Jun, Xie, Yi, Xiang, Yang, Zhou, Wanlei, Hassan, Mohammad Mehedi, AlElaiwi, Abdulhameed, and Alrubaian, Majed (2015) A performance evaluation of machine learning-based streaming spam tweets detection. IEEE Transactions on Computational Social Systems, 2 (3). pp. 65-76.

[img] PDF (Published Version) - Published Version
Restricted to Repository staff only

View at Publisher Website: https://doi.org/10.1109/TCSS.2016.251603...


The popularity of Twitter attracts more and more spammers. Spammers send unwanted tweets to Twitter users to promote websites or services, which are harmful to normal users. In order to stop spammers, researchers have proposed a number of mechanisms. The focus of recent works is on the application of machine learning techniques into Twitter spam detection. However, tweets are retrieved in a streaming way, and Twitter provides the Streaming API for developers and researchers to access public tweets in real time. There lacks a performance evaluation of existing machine learning-based streaming spam detection methods. In this paper, we bridged the gap by carrying out a performance evaluation, which was from three different aspects of data, feature, and model. A big ground-truth of over 600 million public tweets was created by using a commercial URL-based security tool. For real-time spam detection, we further extracted 12 lightweight features for tweet representation. Spam detection was then transformed to a binary classification problem in the feature space and can be solved by conventional machine learning algorithms. We evaluated the impact of different factors to the spam detection performance, which included spam to nonspam ratio, feature discretization, training data size, data sampling, time-related data, and machine learning algorithms. The results show the streaming spam tweet detection is still a big challenge and a robust detection technique should take into account the three aspects of data, feature, and model.

Item ID: 64419
Item Type: Article (Research - C1)
ISSN: 2329-924X
Copyright Information: © 2016 IEEE.
Funders: Australian Research Council (ARC), Natural Science Foundation of China (NSFC), Natural Science Foundation of Guangdong Province (NSFGP)
Projects and Grants: ARC LP120200266, NSFC61401371, NSFGP Grant 2014A030313130
Date Deposited: 30 Sep 2020 22:02
FoR Codes: 08 INFORMATION AND COMPUTING SCIENCES > 0803 Computer Software > 080303 Computer System Security @ 100%
SEO Codes: 89 INFORMATION AND COMMUNICATION SERVICES > 8902 Computer Software and Services > 890299 Computer Software and Services not elsewhere classified @ 100%
More Statistics

Actions (Repository Staff Only)

Item Control Page Item Control Page