ChemicalTagger: A tool for semantic text-mining in chemistry

Hawizy, Lezan, Jessop, David M., Adams, Nico, and Murray-Rust, Peter (2011) ChemicalTagger: A tool for semantic text-mining in chemistry. Journal of Cheminformatics, 3. 17.

[img]
Preview
PDF (Pubished Version) - Published Version
Available under License Creative Commons Attribution.

Download (876kB) | Preview
View at Publisher Website: https://doi.org/10.1186/1758-2946-3-17
 
7


Abstract

Background

The primary method for scientific communication is in the form of published scientific articles and theses which use natural language combined with domain-specific terminology. As such, they contain free owing unstructured text. Given the usefulness of data extraction from unstructured literature, we aim to show how this can be achieved for the discipline of chemistry. The highly formulaic style of writing most chemists adopt make their contributions well suited to high-throughput Natural Language Processing (NLP) approaches.

Results

We have developed the ChemicalTagger parser as a medium-depth, phrase-based semantic NLP tool for the language of chemical experiments. Tagging is based on a modular architecture and uses a combination of OSCAR, domain-specific regex and English taggers to identify parts-of-speech. The ANTLR grammar is used to structure this into tree-based phrases. Using a metric that allows for overlapping annotations, we achieved machine-annotator agreements of 88.9% for phrase recognition and 91.9% for phrase-type identification (Action names).

Conclusions

It is possible parse to chemical experimental text using rule-based techniques in conjunction with a formal grammar parser. ChemicalTagger has been deployed for over 10,000 patents and has identified solvents from their linguistic context with >99.5% precision.

Item ID: 74839
Item Type: Article (Research - C1)
ISSN: 1758-2946
Copyright Information: © 2011 Hawizy et al; licensee Chemistry Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Date Deposited: 26 Jun 2024 00:47
FoR Codes: 34 CHEMICAL SCIENCES > 3499 Other chemical sciences > 349999 Other chemical sciences not elsewhere classified @ 100%
SEO Codes: 28 EXPANDING KNOWLEDGE > 2801 Expanding knowledge > 280105 Expanding knowledge in the chemical sciences @ 100%
Downloads: Total: 7
More Statistics

Actions (Repository Staff Only)

Item Control Page Item Control Page