RHMD: A Real-World Dataset for Health Mention Classification on Reddit

Naseem, Usman, Khushi, Matloob, Kim, Jinman, and Dunn, Adam G. (2022) RHMD: A Real-World Dataset for Health Mention Classification on Reddit. IEEE Transactions on Computational Social Systems. (In Press)

[img] PDF (Published Version) - Published Version
Restricted to Repository staff only

View at Publisher Website: http://doi.org/10.1109/TCSS.2022.3186883


Abstract

People on social media share their thoughts and experiences using diseases and symptoms words other than to mention their health, which can introduce biases in data-driven public health applications. For the advancement of HMC research, in this study, we present a Reddit health mention dataset (RHMD), a new dataset of multi-domain Reddit data for the HMC. RHMD is composed of 10 015 manually annotated Reddit posts that include 15 common disease or symptom terms and are labeled with four labels: personal health mentions (HMs), nonpersonal HMs, figurative HMs, and hyperbolic HMs. Empirical evaluation using recently proposed methods demonstrates the challenge of labeling user-generated text across these four types. Contributions to this work include the public release of a robustly annotated Reddit dataset (RHMD) for HM tasks and a comprehensive performance analysis of baseline methods. We expect the release of the dataset, and the evaluations will help facilitate the development of new methods for detecting HMs in the user-generated text.

Item ID: 79227
Item Type: Article (Research - C1)
ISSN: 2329-924X
Copyright Information: © 2022 IEEE.
Date Deposited: 03 Aug 2023 03:00
FoR Codes: 46 INFORMATION AND COMPUTING SCIENCES > 4602 Artificial intelligence > 460208 Natural language processing @ 100%
SEO Codes: 22 INFORMATION AND COMMUNICATION SERVICES > 2204 Information systems, technologies and services > 220403 Artificial intelligence @ 100%
More Statistics

Actions (Repository Staff Only)

Item Control Page Item Control Page