WETM: A word embedding-based topic model with modified collapsed Gibbs sampling for short text
Rashid, Junaid, Kim, Jungeun, Hussain, Amir, and Naseem, Usman (2023) WETM: A word embedding-based topic model with modified collapsed Gibbs sampling for short text. Pattern Recognition Letters, 172. pp. 158-164.
PDF (Published Version)
- Published Version
Restricted to Repository staff only |
Abstract
Short texts are a common source of knowledge, and the extraction of such valuable information is beneficial for several purposes. Traditional topic models are incapable of analyzing the internal structural information of topics. They are mostly based on the co-occurrence of words at the document level and are often unable to extract semantically relevant topics from short text datasets due to their limited length. Although some traditional topic models are sensitive to word order due to the strong sparsity of data, they do not perform well on short texts. In this paper, we propose a novel word embedding-based topic model (WETM) for short text documents to discover the structural information of topics and words and eliminate the sparsity problem. Moreover, a modified collapsed Gibbs sampling algorithm is proposed to strengthen the semantic coherence of topics in short texts. WETM extracts semantically coherent topics from short texts and finds relationships between words. Extensive experimental results on two real-world datasets show that WETM achieves better topic quality, topic coherence, classification, and clustering results. WETM also requires less execution time compared to traditional topic models.
Item ID: | 79246 |
---|---|
Item Type: | Article (Research - C1) |
ISSN: | 1872-7344 |
Keywords: | Classification, Short text, Topi modeling, Topic coherence |
Copyright Information: | © 2023 Elsevier B.V. All rights reserved. |
Date Deposited: | 13 Dec 2023 01:04 |
FoR Codes: | 46 INFORMATION AND COMPUTING SCIENCES > 4602 Artificial intelligence > 460208 Natural language processing @ 100% |
SEO Codes: | 22 INFORMATION AND COMMUNICATION SERVICES > 2204 Information systems, technologies and services > 220403 Artificial intelligence @ 100% |
More Statistics |