Using Correlated Component Regression with a Dichotomous Y and Many Correlated Predictors

Dataset for Correlated Component Regression (CCR) XLS9.91 MB

Tutorial video
Correlated Component Regression (CCR) is part of: Download Trial version More details See users' feedback
  • CCR Correlated Component Regression software

  • System configuration

    • Windows:
      • Versions: 9x/Me/NT/2000/XP/Vista/Win 7
      • Excel: 97 and later
      • Processor: 32 or 64 bits
      • Hard disk: 150 Mb

Benefits

  • Easy and user-friendly
    Easy and user-friendly XLSTAT is flawlessly integrated with Microsoft Excel which is the most popular spreadsheet worldwide. This integration makes it one of the simplest available tools to work with as it utilizes the same philosophy as Microsoft Excel. The program is accessible in a dedicated XLSTAT tab. The analyses are grouped into functional menus. The dialog boxes are user-friendly and setting up an analysis is straightforward.
  • Data and results shared seamlessly
    Data and results shared seamlessly One of the greatest advantages of XLSTAT is the way you can share data and results seamlessly. As the results are stored in Microsoft Excel, anyone can access them. There is no need for the receiver to have an XLSTAT license or any additional viewer which makes your team-work easier and more affordable. In addition, results are easily integrable into other Microsoft Office software such as PowerPoint, so that you can create striking presentation in minutes.
  • Modular
    Modular XLSTAT is a modular product. XLSTAT-Pro is a core statistical module of XLSTAT which includes all the mainstream functionalities in statistics and multivariate analysis. More advanced features contained in add-on modules can be added for specific applications. This way you can adapt the software to your needs making the software more cost-efficient.
  • Didactic
    Didactic The results of XLSTAT are organized by analysis and are easy to navigate. Moreover useful information is provided along with the results to assist you in your interpretation.
  • Affordable
    Affordable XLSTAT is a complete and modular analytical solution that can suit any analytical business needs. It is very reasonably priced so that the return of your investment is almost immediate. Any XLSTAT license comes with top level support and assistance.
  • Accessible - Available in many languages
    Accessible - Available in many languages We have ensured XLSTAT is accessible to everyone by making the program available in many languages, including Chinese, English, French, German, Italian, Japanese, Polish, Portuguese and Spanish.
  • Automatable and customizable
    Automatable and customizable Most of the statistical functions available in XLSTAT can be called directly from the Visual Basic window of Microsoft Excel. They can be modified and integrated to more code to fit to the specificity of your domain. Adding tables and plots as well as modifying existing outputs becomes easy. Furthermore, XLSTAT includes some special tools on the dialog boxes to generate automatically the VBA code in order to reproduce your analysis using the VBA editor or to simply load pre-set settings. This effortless automation of routine analysis will be a huge time saver on your part.

Dataset for running Correlated Component Regression LDA model (CCR.LDA)

This tutorial is based on data simulated according to the assumptions of Linear Discriminant Analysis (LDA) with 2 groups (ZPC1=1,0). The number of available predictors is P = 84 including 28 valid predictors (listed in Table 1 with their true coefficients), some with high within-group correlation, and 56 irrelevant predictors ‘INDPT1’ – ‘INDPT28’ and ‘extra1’ – ‘extra28’ (with true coefficients equal to 0). We generated 100 simulated samples, each consisting of N=50 cases, with equal group sizes N1 = N2 = 25.

Table 1: True LDA Logit Model Coefficients

True LDA Logit Model Coefficients

Goal of the CCR.LDA model in this example

CCR will apply the proper amount of regularization (K components) to reduce the confounding effects of high predictor correlation, and the CCR step-down algorithm will be used to exclude irrelevant and weak predictors, resulting in a model with a relatively small number of predictors P*.  This results in a sparse model that provides better prediction (better classification) and coefficient estimates closer to the true values than traditional stepwise LDA, which imposes no regularization at all.

For illustration, this tutorial focuses on simulation #1 (N=50). A summary of the results from all 100 simulations can be found in Magidson (2010).

Setting up a Correlated Component Regression LDA

To activate the Correlated Component Regression dialog box, first start XLSTAT by clicking on the button XLSTAT Start button in the Excel toolbar, then select the XLSTAT / Modeling data / Correlated Component Regression command in the Excel menu or click the corresponding button on the Modeling data toolbar.

Correlated Component Regression menu

Once you have clicked the button, the Correlated Component Regression dialog box is displayed with the Method=CCR.LM (linear regression model) selected by default. In the Method section, select the CCR.LDA (linear discriminant analysis model) option.

Correlated Component Regression General tab

Figure 1. General Tab

In the Y/ Dependent variables field, use your mouse to select the (column A) variable ‘ZPC1’.

The ZPC1 values are the "Ys" of the model as we want to predict the probability of being in group ZPC1=1 as a function of the 84 predictors. Specifically, Logit(Y) is determined as a linear function of the predictors, where Logit(Y)=exp(Prob[Y=1|X])/(1+exp(Prob[Y=1|X])

In the X/ Predictors field, select the 84 predictors.

The case ID of the subjects (ID) has also been selected as Observation labels.

Correlated Component Regression Filled General Tab

Figure 2. General Tab

In the Options tab of the dialog box, enter ‘5’ as the number of components and activate the Step-down option. Make sure that the settings are as shown below.  

Correlated Component Regression Options Tab

Figure 3. Options Tab

In the Validation tab of the dialog box, activate the Validation option and select ‘N last rows’ from the  Validation set drop down menu. In the Number of observations field, type ‘4950’.  We have now specified the ‘Training set’ as the first 50 rows of the data file (simulation #1) and the last 4,950 rows of the data file will be used as the validation set (simulations #2-100).  Activate the Cross-validation option and change the default number of folds from ‘10’ to ‘5’.  Activate the ‘Stratify’ option.

Make sure that the settings are as shown below.

Correlated Component Regression Validation Tab

Figure 4. Validation Tab

Estimate the 5-component model

Click OK to estimate the model.

Interpreting the Results of a CCR Model with 10 Predictors

The Cross-Validation Step-down Plot shows that for K=5 components the Cross-validation Accuracy (CV-ACC) is best with P=10 predictors.

Correlated Component Regression: Cross-Validation results

Correlated Component Regression: Cross-Validation Step-Down Plot 

Figure 5. Plot of Cross-validated Area Under ROC Curve (CV-AUC) and Cross-validated Accuracy (CV-ACC) for K=5, N=50

Correlated Component Regression: Unstandardized coefficients for the 5-component model with 10 predictors are given below.

Correlated Component Regression: Goodness of Fit Statistics

Correlated Component Regression: Unstandardized coefficients for the 5-component model

These results obtained from CCR.LDA outperform step-wise linear discriminant analysis in the following respects:

  • More valid predictors included in the model: 10 for CCR.LDA vs. 4 for step-wise LDA.
  • Fewer irrelevant predictors included in the model: 0 for CCR.LDA vs. 2 for step-wise LDA.
  • Higher accuracy as determined from the validation sample: 83.6% for CCR.LDA vs. 77.8% for step-wise LDA.

The results for step-wise LDA are provided below.

Correlated Component Regression: Classification functions and beta

Correlated Component Regression: Confusion matrix 

Overall, the results based on all simulated samples show that CCR.LDA outperforms step-wise LDA as well as penalized regression on these data (Magidson, 2010) : Correlated Component Regression: A Prediction/Classification Methodology for Possibly Many Features. 2010 Proceedings of the American Statistical Association.)