A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter

Naseem, Usman, Razzak, Imran, and Eklund, Peter W. (2020) A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter. Multimedia Tools and Applications, 80. pp. 35239-35266.

[img] PDF (Published Version) - Published Version
Restricted to Repository staff only

View at Publisher Website: https://doi.org/10.1007/s11042-020-10082...


Abstract

Pre-processing plays an essential role in disambiguating the meaning of short-texts, not only in applications that classify short-texts but also for clustering and anomaly detection. Pre-processing can have a considerable impact on overall system performance; however, it is less explored in the literature in comparison to feature extraction and classification. This paper analyzes twelve different pre-processing techniques on three pre-classified Twitter datasets on hate speech and observes their impact on the classification tasks they support. It also proposes a systematic approach to text pre-processing to apply different pre-processing techniques in order to retain features without information loss. In this paper, two different word-level feature extraction models are used, and the performance of the proposed package is compared with state-of-the-art methods. To validate gains in performance, both traditional and deep learning classifiers are used. The experimental results suggest that some pre-processing techniques impact negatively on performance, and these are identified, along with the best performing combination of pre-processing techniques.

Item ID: 79235
Item Type: Article (Research - C1)
ISSN: 1573-7721
Copyright Information: © Springer Science+Business Media, LLC, part of Springer Nature 2020.
Date Deposited: 11 Jul 2023 02:41
FoR Codes: 46 INFORMATION AND COMPUTING SCIENCES > 4602 Artificial intelligence > 460208 Natural language processing @ 100%
SEO Codes: 22 INFORMATION AND COMMUNICATION SERVICES > 2204 Information systems, technologies and services > 220403 Artificial intelligence @ 100%
More Statistics

Actions (Repository Staff Only)

Item Control Page Item Control Page