Distributed processing for statistical data mining
Sim, Nigel Graham Donald (2012) Distributed processing for statistical data mining. PhD thesis, James Cook University.
|
PDF (Thesis)
Download (3MB) | Preview |
Abstract
The use of information technology (IT) in scientific investigations is now commonplace, due largely to the increased efficiency of IT procedures in managing and organising data sets now able to be generated through technologically aided data recording methods. While such data sets can be advantageous to investigative work, the size and complexity of these pose special challenges to exploring and revealing their information content.
Data mining procedures offer many general purpose tools that can be used to explore large volumes of data to find patterns and structures within the data sets that are able to relate response variables to observations. However data mining techniques need to be matched to the attributes of the input data sets. In general, data sets with larger numbers of input variables require robust and sophisticated techniques to reliably identify patterns and processes within the data structures. Additionally, more sophisticated data mining techniques take more computational time to execute than simple data mining techniques. However, some computational problems typical of data mining are amenable to being easily divided into discrete tasks able to be executed independently, in parallel across many computational resources.
The project reported in this thesis generated the components of an eResearch framework. A workflow language was developed to capture the critical aspects of a data mining process, allowing the parallel components to be exploited. Subsequent development of a distributed computing framework enabled leverage of existing data mining tools such as MATLAB and R to perform actual data processing. This distributed computing framework controls movement of data and execution of tasks based on the workflow submitted by the practitioner initiating the workflow.
The coordinating element within the distributed computing framework is a new task scheduling algorithm, termed "Neglected". This algorithm is the major research contribution of this project. "Neglected" is a task matching algorithm that optimises total execution time of an experiment by minimising the unnecessary movement of data. This is achieved by matching resources to tasks, where a task's estimated completion time is within a margin of that task's best matching option.
The "Neglected" task scheduling algorithm was tested in simulation against a commonly used distributed computing scheduling algorithm, the "MinMin" greedy scheduler. The new algorithm significantly outperformed "MinMin" in terms of data transfer, and in most scenarios it also outperformed in terms of total compute time. This is attributed to the reduced transfer overhead required to satisfy the tasks assigned to each resource.
The "Neglected" scheduling algorithm offers improved efficiency in the use of resources and improved time to solution for workflow execution. This, together with the data mining workflow and execution framework, extend and improve overall efficiency, robustness and repeatability in the analysis of new and existing data sets by computationally intensive data mining techniques.