Proximity measures for clustering gene expression microarray data: a validation methodology and a comparative analysis

Jaskowiak, Pablo. A., Campello, Ricardo J.G.B., and Costa, Ivan G. (2013) Proximity measures for clustering gene expression microarray data: a validation methodology and a comparative analysis. IEEE-ACM Transactions on Computational Biology and Bioinformatics, 10 (4). pp. 845-857.

[img] PDF (Published Version) - Published Version
Restricted to Repository staff only

View at Publisher Website: http://dx.doi.org/10.1109/TCBB.2013.9
 
20
5


Abstract

Cluster analysis is usually the first step adopted to unveil information from gene expression microarray data. Besides selecting a clustering algorithm, choosing an appropriate proximity measure (similarity or distance) is of great importance to achieve satisfactory clustering results. Nevertheless, up to date, there are no comprehensive guidelines concerning how to choose proximity measures for clustering microarray data. Pearson is the most used proximity measure, whereas characteristics of other ones remain unexplored. In this paper, we investigate the choice of proximity measures for the clustering of microarray data by evaluating the performance of 16 proximity measures in 52 data sets from time course and cancer experiments. Our results support that measures rarely employed in the gene expression literature can provide better results than commonly employed ones, such as Pearson, Spearman, and euclidean distance. Given that different measures stood out for time course and cancer data evaluations, their choice should be specific to each scenario. To evaluate measures on time-course data, we preprocessed and compiled 17 data sets from the microarray literature in a benchmark along with a new methodology, called Intrinsic Biological Separation Ability (IBSA). Both can be employed in future research to assess the effectiveness of new measures for gene time-course data.

Item ID: 47648
Item Type: Article (Research - C1)
ISSN: 1545-5963
Keywords: cancer; clustering; correlation coefficient; distance; gene expression; proximity measure; similarity; time course
Funders: Coordination for the Improvement of Higher Education Personnel (CAPES, Brazil), Brazilian National Council for Scientific and Technological Development (CNPq), São Paulo Research Foundation (FAPESP), Federação das Associações Comerciais do Estado de São Paulo (FACESP), Interdisciplinary Centre for Clinical Research within the faculty of Medicine at RWTH Aachen University (IZKF Aachen)
Projects and Grants: FACESP process #2011/04247-5
Date Deposited: 10 Mar 2017 00:45
FoR Codes: 01 MATHEMATICAL SCIENCES > 0104 Statistics > 010401 Applied Statistics @ 100%
SEO Codes: 97 EXPANDING KNOWLEDGE > 970101 Expanding Knowledge in the Mathematical Sciences @ 100%
Downloads: Total: 5
More Statistics

Actions (Repository Staff Only)

Item Control Page Item Control Page