From General Language Understanding to Noisy Text Comprehension

Kasthuriarachchy, Buddhika, Chetty, Madhu, Shatte, Adrian, and Walls, Darren (2021) From General Language Understanding to Noisy Text Comprehension. Applied Sciences, 11 (17). 7814.

Preview

PDF (Published Version) - Published Version
Available under License Creative Commons Attribution.
Download (340kB) | Preview

DOI: 10.3390/app11177814

View at Publisher Website: https://doi.org/10.3390/app11177814

Abstract

Obtaining meaning-rich representations of social media inputs, such as Tweets (unstructured and noisy text), from general-purpose pre-trained language models has become challenging, as these inputs typically deviate from mainstream English usage. The proposed research establishes effective methods for improving the comprehension of noisy texts. For this, we propose a new generic methodology to derive a diverse set of sentence vectors combining and extracting various linguistic characteristics from latent representations of multi-layer, pre-trained language models. Further, we clearly establish how BERT, a state-of-the-art pre-trained language model, comprehends the linguistic attributes of Tweets to identify appropriate sentence representations. Five new probing tasks are developed for Tweets, which can serve as benchmark probing tasks to study noisy text comprehension. Experiments are carried out for classification accuracy by deriving the sentence vectors from GloVe-based pre-trained models and Sentence-BERT, and by using different hidden layers from the BERT model. We show that the initial and middle layers of BERT have better capability for capturing the key linguistic characteristics of noisy texts than its latter layers. With complex predictive models, we further show that the sentence vector length has lesser importance to capture linguistic information, and the proposed sentence vectors for noisy texts perform better than the existing state-of-the-art sentence vectors.


Item ID:	81640
Item Type:	Article (Research - C1)
ISSN:	2076-3417
Keywords:	sentence representation ;probing tasks; language understanding; noisy text
Copyright Information:	© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Date Deposited:	06 Feb 2024 03:11
FoR Codes:	46 INFORMATION AND COMPUTING SCIENCES > 4602 Artificial intelligence > 460208 Natural language processing @ 100%
SEO Codes:	22 INFORMATION AND COMMUNICATION SERVICES > 2299 Other information and communication services > 229999 Other information and communication services not elsewhere classified @ 100%
Downloads:	Total: 35 Last 12 Months: 5
	More Statistics

Actions (Repository Staff Only)

Item Control Page