Assessing the performance of PLS regression and CCA with high-dimensional two-block data structure
C. Lalanne (a), E. Duchesnay (a,b,c), V. Frouin (a), B. Thyreau (a), A. Tenenhaus (d), B. Thirion (a,e), J.-B. Poline (a,b,c,e)
(a) CEA, I2BM, Neurospin, Gif-sur-Yvette, France
(b) IFR49, Institut d’Imagerie Neurofonctionnelle, Paris, France
(c) INSERM-CEA U1000, Neuroimaging & Psychiatry Unit, SHFJ, Orsay, France
(d) Supélec, Department of Signal Processing and Electronic Systems, Gif-sur-Yvette, France
(e) INRIA Saclay-Ile-de-France, Parietal project, France
Introduction
Partial Least Squares (PLS) regression or Canonical Correlation Analysis (CCA) allow to study co-varying networks of variables between two-block multivariate structures [Wegelin, 2000]. Such reduction methods are particularly useful when dealing with high-dimensional data sets, such as omics data (e.g., sequence polymorphisms or gene product, Waaijenborg & Zwinderman, 2007; Parkhomenko, Tritchler & Beyene, 2009; Lê Cao, Rossouw, Robert-Granié & Besse, 2008) or anatomical neuroimaging studies [Hardoon et al., 2009]. Comparing the performance of these two techniques is, however, challenging because they do not give rise to a common statistic whose p-values might be compared, nor any likelihood allowing model selection. We here propose a new way to assess the covariation between two data sets of dimensions p and q measured on n subjects (p+q>>n), following the simultaneous reduction of both data sets and the estimation of a common indicator of their degree of association. The statistical significance of the two-block link is estimated with empirical p-values obtained from permutation.
Methods
We generate an artificial data set of mixed categorical (X, n x p) and continuous (Y, n x q) variables using a generative latent variable model, where we vary both the reliability of the measurements and the intra- and inter-block correlations. This way, we were able to generate different kind of cross-covariance matrices whose sparsity reflects the amount of variance shared between the two blocks. Next, we select the top k ranked features in the X block that were maximally correlated with any one of the q variables in the Y block, using optimized F-tests. This feature selection was done using 10-fold CV. We then estimated on the training subjects the X and Y canonical variates from PLS and CCA models under varying conditions of L1 regularization applied on each block, namely {0, 25, 50, 75, 100}% of penalization on variables loadings. Factorial scores for the test subjects were then computed based on this canonical variates and a test correlation was computed as the mean cross-product of the first X and Y factor scores for the ke fold. To estimate the statistical significance of this statistic under the null, the whole procedure---feature selection, regularization and computation of test correlation---was iterated over 1000 permutations, which ensures that we are able to conclude at a nominal 5% level when assessing significance of the results.
Results
Results of our simulations indicate that PLS and CCA are good candidates for dealing with two-block data in the case n<