Cross-validation and normalization for high-dimensional microarray data
Background: In the context of small sample high-dimensional microarray data analysis, the prediction accuracy of classifiers is often estimated via resampling procedures (such as cross-validation) that consist in a succession of splits into training and test data sets. Normalization of microarray data sets is suspected to somehow affect statistical analyses in general and the fitted classifiers in particular. Intuitively, we expect to obtain a positively biased accuracy estimate if normalization is performed based on all data simultaneously, i.e. without separating training and test data. Conversely, performing normalization separately for training and test data sets probably leads to bad predictions. The goal of this project is to quantitatively assess these two normalization approaches as well as a novel intermediate "addon-strategy" recently proposed in the literature.
Results: In our first cross-validation analyses, simultaneous normalization indeed yielded smaller error rates than separate normalization, while the addon-strategy produced intermediate error rates. The preliminary conclusion of the study is thus that normalization issues affect prediction accuracy and that the addon-strategy should be recommended. Moreover, increasing the number of processors up to 250 decreased the computation time substantially. We also implemented a parallel version of the CMA package for cross-validation of classifiers. Parallelization was especially beneficial for computationally intensive classifiers like lasso or support vector machines and for methods requiring parameter tuning via internal cross-validation.
- KONWIHR funding: two months during Multicore-Software-Initiative 2009/2010
- Prof. Dr. Anne-Laure Boulesteix, Institut für Medizinische Informationsverarbeitung, Biometrie und Epidemiologie, Ludwig-Maximilians-Universität München