A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter
Naseem, Usman, Razzak, Imran, and Eklund, Peter W. (2020) A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter. Multimedia Tools and Applications, 80. pp. 35239-35266.
PDF (Published Version)
- Published Version
Restricted to Repository staff only |
Abstract
Pre-processing plays an essential role in disambiguating the meaning of short-texts, not only in applications that classify short-texts but also for clustering and anomaly detection. Pre-processing can have a considerable impact on overall system performance; however, it is less explored in the literature in comparison to feature extraction and classification. This paper analyzes twelve different pre-processing techniques on three pre-classified Twitter datasets on hate speech and observes their impact on the classification tasks they support. It also proposes a systematic approach to text pre-processing to apply different pre-processing techniques in order to retain features without information loss. In this paper, two different word-level feature extraction models are used, and the performance of the proposed package is compared with state-of-the-art methods. To validate gains in performance, both traditional and deep learning classifiers are used. The experimental results suggest that some pre-processing techniques impact negatively on performance, and these are identified, along with the best performing combination of pre-processing techniques.
Item ID: | 79235 |
---|---|
Item Type: | Article (Research - C1) |
ISSN: | 1573-7721 |
Copyright Information: | © Springer Science+Business Media, LLC, part of Springer Nature 2020. |
Date Deposited: | 11 Jul 2023 02:41 |
FoR Codes: | 46 INFORMATION AND COMPUTING SCIENCES > 4602 Artificial intelligence > 460208 Natural language processing @ 100% |
SEO Codes: | 22 INFORMATION AND COMMUNICATION SERVICES > 2204 Information systems, technologies and services > 220403 Artificial intelligence @ 100% |
More Statistics |