Comparing hard and overlapping clusterings

Horta, Danilo, and Campello, Ricardo (2015) Comparing hard and overlapping clusterings. Journal of Machine Learning Research, 16. pp. 2949-2997.

[img]
Preview
PDF (Published Version) - Published Version
Available under License Creative Commons Attribution.

Download (1MB) | Preview
View at Publisher Website: http://www.jmlr.org/papers/volume16/hort...
 
4
8


Abstract

Similarity measures for comparing clusterings is an important component, e.g., of evaluating clustering algorithms, for consensus clustering, and for clustering stability assessment. These measures have been studied for over 40 years in the domain of exclusive hard clusterings (exhaustive and mutually exclusive object sets). In the past years, the literature has proposed measures to handle more general clusterings (e.g., fuzzy/probabilistic clusterings). This paper provides an overview of these new measures and discusses their drawbacks. We ultimately develop a corrected-for-chance measure (13AGRI) capable of comparing exclusive hard, fuzzy/probabilistic, non-exclusive hard, and possibilistic clusterings. We prove that 13AGRI and the adjusted Rand index (ARI, by Hubert and Arabie) are equivalent in the exclusive hard domain. The reported experiments show that only 13AGRI could provide both a fine-grained evaluation across clusterings with different numbers of clusters and a constant evaluation between random clusterings, showing all the four desirable properties considered here. We identified a high correlation between 13AGRI applied to fuzzy clusterings and ARI applied to hard exclusive clusterings over 14 real data sets from the UCI repository, which corroborates the validity of 13AGRI fuzzy clustering evaluation. 13AGRI also showed good results as a clustering stability statistic for solutions produced by the expectation maximization algorithm for Gaussian mixture.

Item ID: 47066
Item Type: Article (Research - C1)
ISSN: 1533-7928
Keywords: overlapping, fuzzy, probabilistic, clustering evaluation
Additional Information:

This article is openly available with CC-by license to the public. The copyright remains with the author.

Funders: CNPq-Brazil, São Paulo Research Foundation (FAPESP)
Projects and Grants: CNPq grant #304137/2013-8, FAPESP grant #2009/17469-6, FAPESP grant #2013/18698-4
Date Deposited: 04 Jan 2017 08:04
FoR Codes: 01 MATHEMATICAL SCIENCES > 0104 Statistics > 010401 Applied Statistics @ 100%
SEO Codes: 97 EXPANDING KNOWLEDGE > 970101 Expanding Knowledge in the Mathematical Sciences @ 100%
Downloads: Total: 8
More Statistics

Actions (Repository Staff Only)

Item Control Page Item Control Page