Correlated Component Regression (CCR)Correlated Component Regression (CCR) is part of:
Easy and user-friendly
Easy and user-friendly XLSTAT is flawlessly integrated with Microsoft Excel which is the most popular spreadsheet worldwide. This integration makes it one of the simplest available tools to work with as it utilizes the same philosophy as Microsoft Excel. The program is accessible in a dedicated XLSTAT tab. The analyses are grouped into functional menus. The dialog boxes are user-friendly and setting up an analysis is straightforward.
Data and results shared seamlessly
Data and results shared seamlessly One of the greatest advantages of XLSTAT is the way you can share data and results seamlessly. As the results are stored in Microsoft Excel, anyone can access them. There is no need for the receiver to have an XLSTAT license or any additional viewer which makes your team-work easier and more affordable. In addition, results are easily integrable into other Microsoft Office software such as PowerPoint, so that you can create striking presentation in minutes.
Modular XLSTAT is a modular product. XLSTAT-Pro is a core statistical module of XLSTAT which includes all the mainstream functionalities in statistics and multivariate analysis. More advanced features contained in add-on modules can be added for specific applications. This way you can adapt the software to your needs making the software more cost-efficient.
Didactic The results of XLSTAT are organized by analysis and are easy to navigate. Moreover useful information is provided along with the results to assist you in your interpretation.
Affordable XLSTAT is a complete and modular analytical solution that can suit any analytical business needs. It is very reasonably priced so that the return of your investment is almost immediate. Any XLSTAT license comes with top level support and assistance.
Accessible - Available in many languages
Accessible - Available in many languages We have ensured XLSTAT is accessible to everyone by making the program available in many languages, including Chinese, English, French, German, Italian, Japanese, Polish, Portuguese and Spanish.
Automatable and customizable
Automatable and customizable Most of the statistical functions available in XLSTAT can be called directly from the Visual Basic window of Microsoft Excel. They can be modified and integrated to more code to fit to the specificity of your domain. Adding tables and plots as well as modifying existing outputs becomes easy. Furthermore, XLSTAT includes some special tools on the dialog boxes to generate automatically the VBA code in order to reproduce your analysis using the VBA editor or to simply load pre-set settings. This effortless automation of routine analysis will be a huge time saver on your part.
The XLSTAT-CCR module is based on the CORExppress® technology developed by Statistical Innovations. Statistical Innovations, a Boston-based firm which specializes in innovative applications of statistical modeling, was started in 1981 by Dr. Jay Magidson. Statistical Innovations has since then been a precursor for many techniques that have become a standard in data analysis and data mining.
For more information Statistical Innovations, please visit www.statisticalinnovations.com.
The four regression methods available in the Correlated Component Regression (CCR) module use fast cross-validation to determine the amount of regularization to produce reliable predictions from data with P correlated explanatory (X) variables, where multicollinearity may exist and P can be greater than the sample size N. The methods are based on Generalized Linear Models (GLM). As an option, the CCR step-down algorithm may be activated to exclude irrelevant Xs.
The linear part of the model is a weighted average of K components S = (S1, S2, … , SK) where each component itself is a linear combination of the predictors. For Y dichotomous, these methods provide an alternative to Logistic regression (CCR-Logistic) and linear discriminant analysis (CCR-LDA). For a continuous Y, these procedures provide an alternative to traditional linear regression methods, where components may be correlated (CCR-LM procedure), or restricted to be uncorrelated with components obtained by PLS regression techniques (CCR-PLS). Typically K < P, resulting in model regularization that reduces prediction error.
Traditional maximum likelihood regression methods, which employ no regularization at all, can be obtained as a special case of these models when K=P (the saturated model). Regularization, inherent in the CCR methods, reduces the variance (instability) of prediction and also lowers the mean squared error of prediction when the predictors have moderate to high correlation. The smaller the value for K, the more regularization is applied. Typically, K will be less than 10 (quite often K = 2, 3 or 4) regardless of P. M-fold cross-validation techniques are used to determine the amount of regularization K* to apply, and the number of predictors P* to include in the model when the step-down algorithm is utilized.
When the CCR step-down option is activated with M-fold cross-validation, output includes a table of predictor counts, reporting the number of times each predictor is included in a model estimated with one omitted fold. The counts can be used as an alternative measure of variable importance (Tenenhaus, 2010), as a supplement to the standardized regression coefficients. Additional options can limit the number of predictors to be included in the model.
The regression methods in the XLSTAT-CCR module differ according to the assumptions made about the scale type of the dependent variable Y (continuous vs. dichotomous), and the distributions (if any) assumed about the predictors.
Linear regression (CCR.LM, PLS)
Predictions for the dependent variable Y based on the linear regression model are obtained as follows:
Pred(Y) = S(S'DS)-1S'DY
where D is a diagonal matrix with case weights as the diagonal entries.
For example, with K=2 components we have:
Pred(Y) = α + b1.2S1 + b2.1S2
where b1.2 and b2.1 are the component weights, the components defined as:
S1 = Σg=l:P(λg.1Xg) and S2 = Σg=l:P(λg.2Xg)
where λg.1 and λg.2 are component coefficients (loadings) for the gth predictor on components S1 and S2respectively. The component weights and loadings are obtained from traditional OLS regression. By substitution we get the reduced form expression:
Pred(Y) = α + Σg=1:P(b1.2λg.1 + b2.1λg.2) Xg
where βg = b1.2λg.1 + b2.1λg.2 is the (regularized) regression coefficient associated with predictor Xg.
Regardless which linear regression model (CCR-LM, or PLS) is used to generate the predictions, when the number of components K equals the number of predictors P, the results are identical to those obtained from traditional least squares (OLS or WLS) regression. Traditional least squares regression produces unbiased predictions, but such predictions may have large variance and hence higher mean squared error than regularized solutions (K < P). Thus, predictions obtained from the CCR module are typically more reliable than those obtained from a traditional regression model.
Methods CCR.LM and PLS assume that the dependent variable Y is continuous:
- CCR.LM is invariant to standardization and also allows the components to be correlated (recommended)
- PLS produces different results depending upon whether or not the predictors are standardized to have variance 1. By default, the PLS ‘standardize’ option is activated.
Logistic Regression (CCR.Logistic) and Linear Discriminant Analysis (CCR.LDA)
Logistic regression is the standard regression (classification) approach for predicting a dichotomous dependent variable. Both Linear and Logistic regression are GLM (Generalized Linear Models) in that a linear combination of the explanatory variables (‘ linear predictor’) is used to predict a function of the dependent variable. In the case of linear regression, the mean of Y is predicted as a linear function of the X variables. For logistic regression, the logit of Y is predicted as a linear function of X.
Logit(Y|S) = α + b1.2S1 + b2.1S2
which in reduced form yields:
Logit(Y|X) = α + Σg=1:P(b1.2λg.1 + b2.1λg.2) Xg
Logit(Y), defined as the natural logarithm of the probability of being in dependent variable group 1 (say Y=1) divided by the probability of being in group 2 (say Y=0), can easily be transformed to yield the probability of being in either category. For example, the conditional probability of being in group 1 can be expressed as:
Prob(Y=1|X) = exp(Logit(Y|X)) / (1+exp(Logit(Y|X))) = 1 / [1+exp(-Logit(Y|X))] and Prob(Y=0|X) = 1 / [1+exp(Logit(Y|X))]
Thus, the logistic regression model is a model for predicting the probability of being in a particular group. Predictions are reported for group 1, which is defined as the category of Y associated with the higher of the 2 numeric values taken on by Y. Linear Discriminant Analysis (LDA) is another model used commonly to obtain predicted probabilities for a dichotomous Y:
- CCR.LDA assumes that the X variables follow a multivariate normal distribution within each Y group, with different group means but common variances and covariances.
- CCR.Logistic makes no distributional assumptions.
The CCR.LM method applies CCR techniques to obtain a regularized linear regression based on the Correlated Component Regression (CCR) model for a continuous Y (Magidson, 2010; Magidson and Wassmann, 2010). It is recommended especially in cases where several explanatory variables have moderate to high correlation.
Method CCR.LM differs from Method PLS in that the components are allowed to be correlated, there is no need to deflate (and then restore) predictors, and similar to traditional OLS regression, predictions are invariant to linear transformations applied to the predictors. Thus, the explanatory variables do not need to be standardized prior to estimation.
The PLS method applies CCR techniques to obtain a regularized linear regression based on the PLS regression (PLS) model for a continuous Y. For an introduction to PLS regression see Tenenhaus (1998). For a comparison of the CCR.LM and PLS methods see Tenenhaus (2011).
Unlike CCR.LM which is invariant with respect to the scale of the predictors, when K < P, PLS regression can yield substantially different predictions depending upon whether the predictors are standardized or not. For a detailed comparison of CCR.LM, PLS with unstandardized Xs and PLS with standardized Xs, see Magidson (2011).
CCR.LDA and CCR.Logistic
The CCR.LDA and CCR.Logistic methods apply CCR techniques to obtain regularized regressions based on the Correlated Component Regression (CCR) model for a dichotomous Y.
Notes on Correlated Component Regression:
Case of K = P
Depending upon which method is selected, CCR.LM, PLS, CCR.LDA, or CCR.logistic, in the case where P < N, setting K = P yields the corresponding (saturated) regression models:
- Method CCR.LM (or PLS) is equivalent to OLS regression (for K = P)
- Method CCR.Logistic yields traditional Logistic regression (for K = P)
- Method CCR.LDA yields traditional Linear Discriminant Analysis (for K = P)
where prior probabilities are computed from group sizes.
R rounds of M-fold Cross-validation (CV) may be used to determine the number of components K* and number of predictors P* to include in a model. For R>1 rounds, the standard error of the relevant CV statistic is also reported. When multiple records (rows) are associated with the same case ID (in XLSTAT, case IDs are specified using ‘Observation labels’), for each round, the CV procedure assigns all records corresponding to the same case to the same fold.
The Automatic Option in M-fold Cross-Validation
When the CV option is performed in Automatic mode (see ‘Automatic’ option in Options tab) a maximum number K is specified for the number of components, all K models containing between 1 and K components are estimated, and the K* model selected as the one with the best CV statistic. When the step-down option is also activated, the K models are estimated with all predictors prior to beginning the step-down algorithm.
The CV statistic used to determine K* depends upon the model type as follows:
- For CCR.LM or PLS: The CV-R2 is the default statistic. Alternatively, the Normed Mean Squared Error (NMSE) can be used instead of CV-R2.
- For CCR.LDA or CCR.Logistic: The CV-Accuracy (ACC), based on the probability cut-point of .5, is used by default. In the case of two or more values of K yielding identical values for CV-Accuracy, the one with the higher value for the Area Under the ROC Curve (AUC) is selected.
Predictor Selection Using the CCR/Step-Down Algorithm
In step 1 of the step-down option, a model containing all predictors is estimated with K* components (where K* is specified by the user or determined by the program if the Automatic option is activated), and the relevant CV statistics are computed. In step 2, the model is then re-estimated after excluding the predictor whose standardized coefficient is smallest in absolute value, and CV statistics are computed again. Note that both steps 1 and 2 are performed within each subsample formed by eliminating one of the folds. This process continues until the user-specified minimum number of predictors remains in the model (by default, Pmin = 1). The number of predictors included in the reported model, P*, is the one with the best CV statistic.
In any step of the algorithm, if the number of predictors remaining in the model falls below K*, the number of components is automatically reduced by 1, so that the model remains saturated. For example, suppose that K*=5, but after a certain number of predictors are eliminated P=4 predictors remain. Then, the K* is reduced to 4 and the step-down algorithm continues. If a maximum number of predictors to be included in a model, Pmax, is specified, the step-down algorithm still begins with all predictors included in the model, but results are reported only for P less than or equal to Pmax, and the CV statistics are only examined for P in the range [Pmin , Pmax].
Copyright ©2011 Statistical Innovations Inc. All rights reserved. Patent Pending.
- Getting started with Correlated Component Regression (CCR) in XLSTAT-CCR
- Using Correlated Component Regression with a Dichotomous Y and Many Correlated Predictors
- Obtaining Predictions from a 2-class Regression