Supervised hierarchical clustering using CART
Hancock, T.P., Coomans, D.H., and Everingham, Y.L. (2003) Supervised hierarchical clustering using CART. In: Proceedings of MODSIM 2003 International Congress on Modelling and Simulation (4), pp. 1880-1885. From: MODSIM 2003 International Congress on Modelling and Simulation, 14-17 July 2003, Townsville, QLD, Australia.
PDF (Published Version)
Restricted to Repository staff only
The size and complexity of current data mining data sets have eclipsed the limits of traditional statistical techniques. Such large datasets frequently require some form of cluster analysis, usually in the form of a hierarchical cluster analysis. However the implementation of a traditional hierarchical scheme on large datasets requires an additional cluster validation analysis. Classification and Regression Trees (CART) are a non-parametric regression and classification technique that have become popular within the biotechnology and ecological fields. CARTs intuitive interpretation, and ability to handle large datasets make it easily accessible to the non-statistician by presenting the statistical relationships found in the form of a binary tree. This paper proposes a supervised clustering algorithm capable of finding real clusters within large datasets by using CART as a means of filtering the clusters found using any hierarchical technique. The supervision performed by CART acts as a filter of the results from a hierarchical cluster analysis by merging or removing poorly defined groups. It is common practice to validate a cluster analysis using descriminant analysis, however this assumes that the correct number of clusters is known. CART implements a selective classification of groups allowing for some groups not to be explicitly classified, a feature not supported by standard descriminant analysis. This selective classification acts in two fold, firstly by filtering or merging clusters that are not validated by the data, and secondly, as a relationship model for the clusters found and provides statistical measures of certainty over the analysis. An example of this method is presented using Sea Surface Temperatures (SST). This is an ideal choice as very little statistical cluster analysis has been implemented on this dataset, yet knowledge of such structure is in high demand. The analysis is performed for one month November for the years 1940 through to 2002, where some of the most useful variation is expected. The supervised clustering technique successful extracted seven meaningful clusters, which predicted with a cross-validated classification rate of 0.50.
|Item Type:||Conference Item (Refereed Research Paper - E1)|
|Keywords:||cartography; sea surface temperatures; supervised clustering|
|Date Deposited:||20 Jun 2010 23:22|
|FoR Codes:||01 MATHEMATICAL SCIENCES > 0104 Statistics > 010401 Applied Statistics @ 100%|
|SEO Codes:||96 ENVIRONMENT > 9603 Climate and Climate Change > 960399 Climate and Climate Change not elsewhere classified @ 51%
97 EXPANDING KNOWLEDGE > 970104 Expanding Knowledge in the Earth Sciences @ 49%
|Citation Count from Web of Science||