Cost Effective Annotation Framework Using Zero-Shot Text Classification

Kasthuriarachchy, Buddhika, Chetty, Madhu, Shatte, Adrian, and Walls, Darren (2021) Cost Effective Annotation Framework Using Zero-Shot Text Classification. In: Proceedings of the 2021 International Joint Conference on Neural Networks. From: IJCNN: 2021 International Joint Conference on Neural Networks, 18-22 July 2021, Shenzhen, China.

[img] PDF - Published Version
Restricted to Repository staff only

View at Publisher Website: https://doi.org/10.1109/IJCNN52387.2021....
 
1


Abstract

Manual and high-quality annotation of social media data has enabled companies and researchers to develop improved implementations using natural language processing. However, human text-annotation is expensive and time-consuming. Crowd-sourcing platforms such as Amazon's Mechanical Turk (MTurk) can be leveraged for the creation of large training corpora for text classification tasks using social media data. Nevertheless, the quality of annotations can vary significantly, based on the interpretations and motivations of annotators completing the tasks. Further, the labelling cost of data through MTurk will increase if target messages are small and having a significant amount of noise (e.g. promotional messages on Twitter). In this work, we propose a new annotation framework to create high-quality human-annotated datasets for text classification from social media data. We present a zero-shot text classification based pre-annotation technique reducing the adverse effects arising due to the highly skewed distribution of data across target classes. The proposed framework significantly reduces the cost and time while maintaining the quality of the annotations. Being generic, it can be applied to annotating text data from any discipline. Our experiment with a Twitter data annotation using the proposed annotation framework shows a cost reduction of 80% with no compromise to quality.

Item ID: 81637
Item Type: Conference Item (Research - E1)
ISBN: 978-1-6654-3900-8
Keywords: Training; Costs; Annotations; Social networking (online); Text categorization; Blogs; Training data
Copyright Information: © 2021 IEEE
Date Deposited: 13 Feb 2024 00:58
FoR Codes: 46 INFORMATION AND COMPUTING SCIENCES > 4602 Artificial intelligence > 460208 Natural language processing @ 100%
SEO Codes: 22 INFORMATION AND COMMUNICATION SERVICES > 2204 Information systems, technologies and services > 220403 Artificial intelligence @ 100%
Downloads: Total: 1
Last 12 Months: 1
More Statistics

Actions (Repository Staff Only)

Item Control Page Item Control Page