Getting started with Correlated Component Regression (CCR) in XLSTAT-CCR

Datensatz für Correlated Component Regression (CCR) XLS121 KB

Videos für dieses Tutorial
Correlated Component Regression (CCR) ist enthalten in: Download Demoversion Mehr Details Kommentare ansehen
  • CCR Software zur Correlated Component Regression

  • Voraussetzungen

    • Windows:
      • Versionen: 9x/Me/NT/2000/XP/Vista/Win 7
      • Excel: 97 oder höher
      • Prozessor: 32 oder 64 bit
      • Festplattenspeicher: 150 MB

Vorteile von XLSTAT

  • Praktisch und einfache Benutzung
    Praktisch und einfache Benutzung XLSTAT ist perfekt in Microsoft Excel integriert, das das am meisten benutzte Tabellenkalkulationsprogramm ist. Dank dieser Integration und der gleichen Philosophie wie Excel, ist die Benutzung von XLSTAT leicht. Die Software ist in einem speziellen Reiter verfügbar, der das Menu der XLSTAT-Module enthält. Die verfügbaren Analysen sind in Gruppen ähnlicher Funktion zusammengefasst. Die Dialogfenster sind einfach und Ihre Einstellungen verständlich.
  • Einfaches Mitteilen der Daten und Ergebnisse
    Einfaches Mitteilen der Daten und Ergebnisse Einer der größten Vorteile von XLSTAT ist die Tatsache, das die Daten und Ergebnisse ohne Einschränkung kommuniziert werden können. Denn die Daten und Ergebnisse werden in Microsoft Excel gespeichert und sind daher allen zugänglich. Eine XLSTAT Lizenz oder ein Programm zur Ansicht ist nicht erforderlich, um die Daten und Ergebnisse zu empfangen und anzusehen. Darüber hinaus sind die Ergebnisse leicht in andere Programme von Microsoft Office, wie PowerPoint kopierbar, was Ihnen das Erstellen von Präsentationen mit ausgezeichneten Diagrammen in wenigen Minuten erlaubt.
  • Modular
    Modular XLSTAT ist ein modulares Produkt um XLSTAT-Pro herum, das die Basissoftware von XLSTAT darstellt. XLSTAT-Pro schließt bereits alle verbreiteten statistischen Funktionen und multivariaten Datenanalysen ein. Die fortgeschrittenen Funktionen sind ebenfalls in zusätzlichen Modulen verfügbar, die speziellen Anforderungen gerecht werden. So können Sie Ihre Software an Ihre eigenen Anforderungen anpassen, was sie attraktiver werden lässt.
  • Didaktisch
    Didaktisch Die XLSTAT Ergebnisse sind nach Analysen aufgebaut und einfach durchzublättern. Darüber hinaus sind den Ergebnissen nützliche Informationen hinzugefügt, um die Interpretation zu erleichtern.
  • Preiswert
    Preiswert XLSTAT ist eine modulare, komplette Statistik- und Datenanalysesoftware, die sich an alle Ihre analytischen Anforderungen Ihrer Organisation anpasst. Der Preis ist sehr gering, was Ihnen eine quasi sofortige Amortisierung erlaubt. Alle XLSTAT Lizenzen schließen ein Support und eine Unterstützung hervorragender Qualität ein.
  • Zugänglich
    Zugänglich Wir setzen uns ein, XLSTAT so vielen Personen wie möglich durch eine Benutzerschnittstelle in vielen Sprachen darunter Deutsch, englisch, französisch, spanisch, italienisch, portugiesisch, polnisch, chinesisch und japanisch zugänglich zu machen.
  • Automatisierbar und personalisierbar
    Automatisierbar und personalisierbar Die Mehrzahl der in XLSTAT verfügbaren Funktionen können direkt aus Visual Basic for Applications von Microsoft Excel heraus aufgerufen werden. Sie können in Ihre Routinen integriert werden, um Ihren Anforderungen einer speziellen Anwendung gerecht zu werden. Das Hinzufügen von Ergebnistabellen, Diagrammen, oder das Verändern bereits existierender ist vereinfacht. Darüber hinaus schließt XLSTAT die Werkzeuge zur Erzeugen des VBA Kodes für die Dialogfenster ein, so dass Sie Ihre Analysen vom VBA Editor heraus einfach reproduzieren können, indem Sie die Einstellungen automatisch laden. Diese Automatisierung der Analysen wird Ihnen viel Zeit einsparen.

Dataset for running Correlated Component Regression

This tutorial is based on data provided by Michel Tenenhaus and used in Magidson (2011), “Correlated Component Regression: A Sparse Alternative to PLS Regression”, 5th ESSEC-SUPELEC Statistical Workshop on PLS (Partial Least Squares) Developments.

The data consists of N=24 car models, the dependent variable PRICE = price of a car, and P=6 explanatory variables (predictors), each of which has a positive correlation with PRICE

Correlation with price 

but each predictor also has a moderate correlation with the other predictor variables

Correlation matrix


Goal of CCR for this example

CCR will apply the proper amount of regularization to reduce confounding effects of high predictor correlation, thus allowing us to obtain more interpretable regression coefficients, better predictions, and include more significant predictors in a model than traditional OLS regression.

The OLS regression solution maximizes R2 in the training sample, yielding R2= .85. However, since this solution is based on a relatively small sample (N=24) and correlated predictors, it is likely that this model overfits the data and that .85 is an overly optimistic estimate of the true population R2. Consistent with an overfit model, Table 1 shows that the OLS solution yields large standard errors and unrealistic negative coefficient estimates for the predictors CYLINDER, SPEED, and WIDTH.

OLS parameters

Table 1: Results from traditional OLS regression: CV-R2 = 0.63

Moreover, POWER is the only predictor that achieves statistical significance (p=.05) according to the traditional t-test.

CCR utilizes the cross-validated R2 as its criterion for determining the proper amount of regularization (K) to use in a regression model. Fig.1 shows that substantial decay in CV-R2 occurs for K>2. Thus, a substantial amount of regularization is required (K2 = .75 for CCR compared to .63 for OLS regression.

 Cross-Validation component plot

Fig. 1. Cross-Validation Component (CV-R2) Plot showing deterioration for K>2

Also, in contrast to OLS regression which yields some negative coefficient estimates, CCR yields more reasonable positive coefficients for all 6 predictors as shown below.

Unstandardized coefficients

Standardized coefficients

Table 2. CCR unstandardized / standardized coefficients with K=2 components.

The first part of this tutorial shows how to use XLSTAT-CCR to obtain these results.  The second part (see ‘Activating the Step-down Algorithm’) shows how to activate the CCR step-down procedure to eliminate extraneous predictors and obtain even better results (CV-R2 = .77) as indicated in the following table.

 Step-down unstandardized coefficients

  Table 3. Results from CCR with step-down algorithm

 

Setting up a Correlated Component Regression

To activate the Correlated Component Regression dialog box, first start XLSTAT by clicking on the XLSTAT start button button in the Excel toolbar, then select the XLSTAT / Modeling data / Correlated Component Regression command in the Excel menu or click the corresponding button on the Modeling data toolbar.

Correlated Component Regression menu

Once you have clicked the button, the Correlated Component Regression dialog box is displayed with the Method=CCR.LM (linear regression model) selected by default.

Correlated Component Regression General dialog box

Fig. 2: General Tab

In the Y/ Dependent variables field, use your mouse to select the variable PRICE (see the tutorial on Selecting data for more information on this topic).

The prices are the "Ys" of the model as we want to predict these prices as a linear function of the other car attributes.

In X/ Predictors field, select the other 6 car attributes.

The name of the car models (MODEL) has also been selected as Observation labels.

To obtain the OLS regression solution, fix the number of components at 6, so it equals the number of predictors.  To accomplish this, in the Options tab set Number of components to ‘6’ and uncheck ‘Automatic’.

In the Options tab of the dialog box, make sure that the settings are as shown below.

Correlated Component Regression Options dialog box

Fig. 3: Options Tab

The fast computations start when you click on OK.

Interpreting CCR Model Output

Following the basic statistics output section, the coefficients (unstandardized and standardized) are presented. For example, Table 3A presents the unstandardized coefficients.  Comparing Table 3A to Table 1, we see that the results match the OLS regression coefficients. 

6 components coefficients

Table 3A. Unstandardized coefficient estimates obtained from the 6-component (saturated) CCR model

These coefficients can be decomposed into parts associated with each of the 6 components using the component weights provided in Table 3B and the component coefficients (loadings) provided in Table 3C.

6 componenents weights

Table 3B. Unstandardized component weights 

6 componenets loadings

Table 3C. Unstandardized loadings

 

For example, the coefficient     -1.94 for CYLINDER, can be decomposed as follows:

-1.94 = .006*(92.774) + .124*(1.381) + .804*(-3.728) + .627*(-11.016) + .422*(15.190) + .167*(5.053)

Activating the Automatic and M-fold Cross-validation options

Re-open the CCR dialog box by selecting the Modeling data / Correlated Component Regression command in the Excel menu or click the corresponding button on the Modeling data toolbar.

Since N is relatively small (N=24) and the correlation between the predictors is fairly high, this saturated regression model overfits these data. We will now show how to activate the M-fold cross-validation (CV) option and show that this model is overfit, and that eliminating CCR components 3-6 provides the proper amount of regularization to produce more reliable results. To allow CV to assess all possible degrees of regularization, we will estimate all 6 CCR models (K≤6). We do this by activating the Automatic option in the Options tab.

The number of folds M is generally taken to be between 5 and 10, so we select M=6, the only integer between 5 and 10 that divides evenly into 24. In the Validation tab we activate ‘Cross-validation’ and request 100 rounds of 6-folds. By requesting more than 1 round, we obtain a standard error for the CV-R2.  

Correlated Component Regression Validation dialog box

Fig. 4: Validation Tab

Note that activating the ‘Automatic’ option also requests the Cross-Validation Component Plot to be generated (this is checked in the Charts tab) shown earlier in Fig. 1.

Click OK to perform these analyses. The Goodness of Fit Statistics show that the resulting model has K=2 components. For this model, the CV-R2 increases to .750 with a standard error of only .014, providing a significant improvement over the OLS regression CV-R2 =.64.

Unstandardized coefficients

 Table 4A. Coefficients obtained from the 2-component model

2 componenets weights

 Table 4B. Component weights obtained from the 2-component model

2 components loadings

Table 4C. Loadings obtained from the 2-component model

From the Coefficients Output in Tables 4A, 4B and 4C we see how the coefficients are now constructed based on only 2 components. For example, the coefficient for CYLINDER can be decomposed as follows:

20.944 = .221*92.774 + .349*1.381

 

Activating the Step-down Algorithm

Re-open the CCR dialog box by selecting the Modeling data / Correlated Component Regression command in the Excel menu or click the corresponding button on the Modeling data toolbar.

To eliminate extraneous and weak predictors, in the options tab we will now activate the step-down algorithm as shown below:

Correlated Component Regression Options dialog box 

Figure 5. Options Tab

Activation of the step-down option automatically requests the step-down predictor selection plot in the Charts tab and the Predictor Count table from the Output tab.

Click on OK to estimate.

The predictor selection plot suggests that inclusion of 3 predictors in the model is optimal.

 Correlated Component Regression Cross-validation Step-down Plot

Figure 6. Cross-validation Step-down Plot

The Cross-validation Predictor Count table suggests that POWER and WEIGHT are the most important predictors, being included in 600 and 584 of the 1800 cross-validated regressions.

Cross validation count table

The final model has CV-R2 = .77 and includes the predictors POWER, SPEED and WEIGHT:

Goodness of fit statistics

 

Predictors retained in the model:

  • POWER
  • SPEED
  • WEIGHT


General Discussion

Key driver regression attempts to ascertain the importance of several key explanatory variables (predictors) X1, X2, … , XP that influence a dependent variable. For example, a typical dependent variable in key driver regression is “Customer Satisfaction”. Traditional OLS regression methods have difficulty with such derived importance tasks because the predictors usually have moderate to high correlation with each other, resulting in problems of confounding, making parameter estimates unstable and thus unusable as measures of importance.

Correlated Component Regression (CCR) is designed to handle such problems, and as shown in Tutorial 2 it even works with high-dimensional data where there are more predictors than cases!  Parameter estimates become more interpretable and cross-validation is used to avoid over-fitting, thus producing better out-of-sample predictions.