Annotating and Inferring Compositional Structures in Numeral Systems Across Languages

Rubehn, Arne, Rzymski, Christoph, Ciucci, Luca, Bocklage, Katja, Kučerová, Alžběta, Snee, David, Abishek, Stephen, van Dam, Kellen Parker, and List, Johann-Mattis (2025) Annotating and Inferring Compositional Structures in Numeral Systems Across Languages. In: Proceedings of the 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP. pp. 29-42. From: SIGTYP 2025: 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, 1 August 2025, Vienna, Austria.

[img]
Preview
PDF (Published Version) - Published Version
Available under License Creative Commons Attribution.

Download (470kB) | Preview
View at Publisher Website: https://doi.org/10.18653/v1/2025.sigtyp-...
 
2


Abstract

Numeral systems across the world’s languages vary in fascinating ways, both regarding their synchronic structure and the diachronic processes that determined how they evolved in their current shape. For a proper comparison of numeral systems across different languages, however, it is important to code them in a standardized form that allows for the comparison of basic properties. Here, we present a simple but effective coding scheme for numeral annotation, along with a workflow that helps to code numeral systems in a computer-assisted manner, providing sample data for numerals from 1 to 40 in 25 typologically diverse languages. We perform a thorough analysis of the sample, focusing on the systematic comparison between the underlying and the surface morphological structure. We further experiment with automated models for morpheme segmentation, where we find allomorphy as the major reason for segmentation errors. Finally, we show that subword tokenization algorithms are not viable for discovering morphemes in low-resource scenarios.

Item ID: 86613
Item Type: Conference Item (Research - E1)
ISBN: 979-8-89176-281-7
Related URLs:
Copyright Information: © 2025 Association for Computational Linguistics. Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License.
Date Deposited: 12 Aug 2025 02:40
FoR Codes: 46 INFORMATION AND COMPUTING SCIENCES > 4602 Artificial intelligence > 460208 Natural language processing @ 50%
47 LANGUAGE, COMMUNICATION AND CULTURE > 4704 Linguistics > 470409 Linguistic structures (incl. phonology, morphology and syntax) @ 50%
SEO Codes: 28 EXPANDING KNOWLEDGE > 2801 Expanding knowledge > 280116 Expanding knowledge in language, communication and culture @ 70%
28 EXPANDING KNOWLEDGE > 2801 Expanding knowledge > 280115 Expanding knowledge in the information and computing sciences @ 30%
Downloads: Total: 2
Last 12 Months: 2
More Statistics

Actions (Repository Staff Only)

Item Control Page Item Control Page