Robust cross-validation of linear regression QSAR models

Konovalov, Dmitry A., Llewellyn, Lyndon E., Vander Heyden, Yvan, and Coomans, Danny (2008) Robust cross-validation of linear regression QSAR models. Journal of Chemical Information and Modeling, 48 (10). pp. 2081-2094.

[img] PDF (Published Version)
Restricted to Repository staff only

View at Publisher Website: http://dx.doi.org/10.1021/ci800209k

Abstract

A quantitative structure−activity relationship (QSAR) model is typically developed to predict the biochemical activity of untested compounds from the compounds' molecular structures. "The gold standard" of model validation is the blindfold prediction when the model's predictive power is assessed from how well the model predicts the activity values of compounds that were not considered in any way during the model development/calibration. However, during the development of a QSAR model, it is necessary to obtain some indication of the model's predictive power. This is often done by some form of cross-validation (CV). In this study, the concepts of the predictive power and fitting ability of a multiple linear regression (MLR) QSAR model were examined in the CV context allowing for the presence of outliers. Commonly used predictive power and fitting ability statistics were assessed via Monte Carlo cross-validation when applied to percent human intestinal absorption, blood-brain partition coefficient, and toxicity values of saxitoxin QSAR data sets, as well as three known benchmark data sets with known outlier contamination. It was found that (1) a robust version of MLR should always be preferred over the ordinary-least-squares MLR, regardless of the degree of outlier contamination and that (2) the model's predictive power should only be assessed via robust statistics. The Matlab and java source code used in this study is freely available from the QSAR-BENCH section of www.dmitrykonovalov.org for academic use. The Web site also contains the java-based QSAR-BENCH program, which could be run online via java's Web Start technology (supporting Windows, Mac OSX, Linux/Unix) to reproduce most of the reported results or apply the reported procedures to other data sets.

Item ID: 8715
Item Type: Article (Refereed Research - C1)
Keywords: QSAR; cross-validation
ISSN: 1549-960X
Date Deposited: 02 Mar 2010 03:25
FoR Codes: 01 MATHEMATICAL SCIENCES > 0104 Statistics > 010401 Applied Statistics @ 100%
SEO Codes: 97 EXPANDING KNOWLEDGE > 970101 Expanding Knowledge in the Mathematical Sciences @ 100%
Citation Count from Web of Science Web of Science 23
Downloads: Total: 2
More Statistics

Actions (Repository Staff Only)

Item Control Page Item Control Page