On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study

Campos, Guilherme O., Zimek, Arthur, Sander, Jörg, Campello, Ricardo, Micenková, Barbora, Schubert, Erich, Assent, Ira, and Houle, Michael E. (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Mining and Knowledge Discovery, 30 (4). pp. 891-927.

[img] PDF (Published Version) - Published Version
Restricted to Repository staff only

View at Publisher Website: http://dx.doi.org/10.1007/s10618-015-044...
 
337
5


Abstract

The evaluation of unsupervised outlier detection algorithms is a constant challenge in data mining research. Little is known regarding the strengths and weaknesses of different standard outlier detection models, and the impact of parameter choices for these algorithms. The scarcity of appropriate benchmark datasets with ground truth annotation is a significant impediment to the evaluation of outlier methods. Even when labeled datasets are available, their suitability for the outlier detection task is typically unknown. Furthermore, the biases of commonly-used evaluation measures are not fully understood. It is thus difficult to ascertain the extent to which newly-proposed outlier detection methods improve over established methods. In this paper, we perform an extensive experimental study on the performance of a representative set of standard k nearest neighborhood-based methods for unsupervised outlier detection, across a wide variety of datasets prepared for this purpose. Based on the overall performance of the outlier detection methods, we provide a characterization of the datasets themselves, and discuss their suitability as outlier detection benchmark sets. We also examine the most commonly-used measures for comparing the performance of different methods, and suggest adaptations that are more suitable for the evaluation of outlier detection results.

Item ID: 47064
Item Type: Article (Research - C1)
ISSN: 1573-756X
Keywords: unsupervised outlier detection, evaluation, measures, datasets
Funders: São Paulo Research Foundation (FAPESP), CNPq-Brazil, Natural Sciences and Engineering Research Council of Canada (NSERC), Danish Council for Independent Research (DCIR) Technology and Production Sciences (FTP)
Projects and Grants: FAPESP Brazil Grant #2013/18698-4, CNPq Brazil Grant #304137/2013-8, CNPq Brazil Grant #400772/2014-0, DCIR FTP Grant 10-081972
Date Deposited: 04 Jan 2017 08:04
FoR Codes: 49 MATHEMATICAL SCIENCES > 4905 Statistics > 490501 Applied statistics @ 100%
SEO Codes: 97 EXPANDING KNOWLEDGE > 970101 Expanding Knowledge in the Mathematical Sciences @ 100%
Downloads: Total: 5
More Statistics

Actions (Repository Staff Only)

Item Control Page Item Control Page