Obtaining Predictions from a 2-class Regression

Dataset for Correlated Component Regression (CCR) XLS489 KB

Tutorial video
Correlated Component Regression (CCR) is part of: Download Trial version More details See users' feedback
  • CCR Correlated Component Regression software

  • System configuration

    • Windows:
      • Versions: 9x/Me/NT/2000/XP/Vista/Win 7
      • Excel: 97 and later
      • Processor: 32 or 64 bits
      • Hard disk: 150 Mb

Benefits

  • Easy and user-friendly
    Easy and user-friendly XLSTAT is flawlessly integrated with Microsoft Excel which is the most popular spreadsheet worldwide. This integration makes it one of the simplest available tools to work with as it utilizes the same philosophy as Microsoft Excel. The program is accessible in a dedicated XLSTAT tab. The analyses are grouped into functional menus. The dialog boxes are user-friendly and setting up an analysis is straightforward.
  • Data and results shared seamlessly
    Data and results shared seamlessly One of the greatest advantages of XLSTAT is the way you can share data and results seamlessly. As the results are stored in Microsoft Excel, anyone can access them. There is no need for the receiver to have an XLSTAT license or any additional viewer which makes your team-work easier and more affordable. In addition, results are easily integrable into other Microsoft Office software such as PowerPoint, so that you can create striking presentation in minutes.
  • Modular
    Modular XLSTAT is a modular product. XLSTAT-Pro is a core statistical module of XLSTAT which includes all the mainstream functionalities in statistics and multivariate analysis. More advanced features contained in add-on modules can be added for specific applications. This way you can adapt the software to your needs making the software more cost-efficient.
  • Didactic
    Didactic The results of XLSTAT are organized by analysis and are easy to navigate. Moreover useful information is provided along with the results to assist you in your interpretation.
  • Affordable
    Affordable XLSTAT is a complete and modular analytical solution that can suit any analytical business needs. It is very reasonably priced so that the return of your investment is almost immediate. Any XLSTAT license comes with top level support and assistance.
  • Accessible - Available in many languages
    Accessible - Available in many languages We have ensured XLSTAT is accessible to everyone by making the program available in many languages, including Chinese, English, French, German, Italian, Japanese, Polish, Portuguese and Spanish.
  • Automatable and customizable
    Automatable and customizable Most of the statistical functions available in XLSTAT can be called directly from the Visual Basic window of Microsoft Excel. They can be modified and integrated to more code to fit to the specificity of your domain. Adding tables and plots as well as modifying existing outputs becomes easy. Furthermore, XLSTAT includes some special tools on the dialog boxes to generate automatically the VBA code in order to reproduce your analysis using the VBA editor or to simply load pre-set settings. This effortless automation of routine analysis will be a huge time saver on your part.

Dataset for a two-class model

This tutorial illustrates a reanalysis of data analyzed by Tenenhaus, et al. (2005): Tenenhaus, M., Pagès, J., Ambroisine L. and & Guinot, C. (2005); PLS methodology for studying relationships between hedonic judgments and product characteristics; Food Quality and Preference. 16, 4, pp 315-325.

The data consists of liking ratings on each of 6 different orange juice (OJ) products by 96 judges. Each of the 6 juices is also described by 16 physico-chemical attributes. In addition, the data contains classification information for weighting the judges according to their (posterior membership) probability of being in two different segments which have distinctly different OJ preferences.  Click here for details of the random intercept latent class (LC) regression analysis used to obtain these posterior membership probabilities.  

Goal of the Correlated Component Regressions in this example

When data consists of multiple records per case, traditional (1-class) regression methods suffer from violation of the independent observations assumption which yields suboptimal prediction, since residuals from records associated with the same case will typically be correlated. In this tutorial we show how CCR can improve prediction of the liking ratings from the OJ attributes by allowing differing attribute effects for each of the 2 LC segments which show different OJ preferences.

In particular, this tutorial illustrates the second step in a 2-step process. In step 1, a 2-class regression model is developed based solely on dummy variables associated with the OJ products. In step 2, CCR is used to predict ratings based on the 16 product descriptors (rather than the dummy variables) to determine those that are the most important in predicting OJ liking. We develop separate models for each LC segment, obtained in step 1, and then combine models for both segments to obtain a single best set of predicted ratings. Use of this 2-step, 2-class regression analysis provides substantial improvement over the traditional regression (cross-validated R-square increases from .28 to .48).

Setting up a Correlated Component Regression (CCR) model

To activate the Correlated Component Regression dialog box, start XLSTAT by clicking on the XLSTAT start button button in the Excel toolbar, and then select the XLSTAT / Modeling data / Correlated Component Regression command in the Excel menu or click the corresponding button on the Modeling data toolbar.

Correlated Component Regression menu

Once you have selected CCR, the Correlated Component Regression dialog box is displayed.

To setup the CCR runs, in the Y / Dependent variable(s) field, select with the mouse Column D (rating), containing the ratings for each of the 6 juices given by the judges (6 rows for each of the 96 judges). The ratings are the "Ys" of the model as we want to explain these ratings given by the judges as a function of the juice attributes.

In the X / Predictors field, select columns I through Y corresponding to the variable CFactor1 (column I) plus the 16 juice attributes.CFactor1, a random intercept obtained from the LC regression analysis based solely on the OJ ratings, is highly correlated with the variableRating_mean (column E), representing each judges’ mean rating across all 6 juices. Its inclusion as a predictor serves a function similar to ‘centering’.

The case ID variable (column B) is entered in the Observation labels field, so that all 6 records for each judge are grouped and assigned to the same fold during cross-validation.

Separate models will be developed for each segment. For Segment #1, we select the probability of being in that segment (Posterior1) as the weight variable (column G). (For theory on the use of posterior membership probabilities as weights see Magidson, 2005).

Correlated Component Regression General Tab

Figure 1. General Tab

To determine the number of components, in the Options tab of the dialog box, activate the ‘Automatic’ option and enter ‘17’ in the ‘Max components’ field. To determine the number of predictors, activate the step-down procedure.

Correlated Component Regression Options Tab

Figure 2. Options Tab

Note that the Cross-validation option in the Validation tab is automatically activated with the default parameters (1 round of 10-folds).

Correlated Component Regression - Dialog box / Validation tab

Figure 3. Validation Tab

The fast computations start when you click on OK.

Interpreting CCR results for Segment #1

From the correlation matrix output it can be seen that the correlation between rating and Acidity equals -.433, suggesting that Segment #1 judges tend to dislike OJs with a high acidic nature. We will see later that in contrast to Segment #1 judges, Segment #2 judges tend toprefer OJs that have a high acid content (correlation = .252).

From the CV components table and associated plot we see that the maximum CV-R2 = .398 occurs with K = 5 components (note also that the CV-R2 deteriorates rapidly after K=9 components indicating substantial amount of collinearity for K>9).

Cross-validation component plot (Segment #1)

Figure 4. Cross-validation component plot (Segment #1)

From the CV-step down plot we see that the maximum CV-R2= .402 occurs with P*=4 predictors (and since P* < 5, K is reduced to 4.

Correlated Component Regression Cross-validation step-down plot (Segment #1)

Figure 5. Cross-validation step-down plot (Segment #1)

Table 1 shows that Acidity is an important predictor in the model. The negative standardized coefficient (-.325) supports the inference that Segment #1 judges tend to dislike OJs with high acid content.

Correlated Component Regression Standardized Coefficients

Table 1. Standardized coefficients based on the 4-component model for Segment #1 

For comparison, we will obtain results for Segment #2 next.

Developing the Corresponding CCR Model for Segment #2

Re-open the CCR dialog box by click on Modeling data / Correlated Component Regression command in the Excel menu or click the corresponding button on the Modeling data toolbar.

The previous model specifications are currently displayed. In the General tab, replace the current observation weights by the corresponding values (column H) associated with Segment #2 (Posterior2). To produce the Segment #2 model output on the same sheet as the Segment #1 model output, change the output option from ‘Sheet’ to ‘Range’ and select cell V1 in the ‘CCR.LM’ tab (the tab which contains the output from our previous model estimation).

Correlated Component Regression General Tab

Figure 6. General Tab

Click on OK to estimate.

The relevant output for Segment #2 is shown below.

Correlated Component Regression Cross-Validation Component Plot

Figure 7. Cross-validation component plot (Segment #2). CV-R2 = .409

Correlated Component Regression Cross-Validation Step-Down Plot

Figure 8. Cross-validation step-down plot (Segment #2). CV-R2 = .411

Table 2 shows that Acidity is an important predictor for Segment #2 as well as Segment #1. However, in contrast to the model result for Segment #1, the standardized coefficient for Acidity is now positive. Table 2 show that Segment #2 judges prefer juices with higher acidity (.214), low sweetening power (-.169), and low smell intensity (-.129).

 Correlated Component Regression Standardized Coefficients 

Table 2. Standardized coefficients for Segment #2

Obtaining predictions from a 2-class model

Improved prediction over the 1-class model is due to the value of the additional information provided by the LC segmentation results. If we knew that a judge was from Segment #1 (i.e., preferred OJs that had lower acidity), we would use the Segment #1 model for prediction. Similarly, if we knew that a judge was from Segment #2 (i.e., preferred OJs that had higher acidity), we would use the Segment #2 model for prediction. While we do not know with certainty to which segment each judge belongs, we have the posterior membership probabilities to use as weights.

Our prediction from the 2-class CCR model is a weighted average of the 2 sets of predictions obtained from the 2 models. For example, our prediction for the rating for OJ#1 (fruvita fr.) given by judge #1 is obtained as a weighted average of the corresponding predictions from the 2 models, where the weights are the posterior membership probabilities:

Prediction = .98(3.441) + .02(2.373) = 3.42            

For judge #1, the probability of being in Segment #1 is about .98, and thus the probability of being in Segment #2 is about .02. The predicted rating from the Segment #1 model (3.441) is weighted more heavily for this judge than that from the Segment #2 model (2.373), resulting in a prediction of 3.42 based on the 2-class regression model.

For illustrative purposes, these and other calculations are provided in sheet ‘CCR.LM’ (highlighted in yellow). These yellow highlighted cells were added manually to the output provided by XLSTAT-CCR. For example, cell L237 provides the formula for computing the predicted value 3.42 from the corresponding Segment #1 and Segment #2 output.

Correlated Component Regression Predictions and Residuals

Table 3A. Predictions and residuals output for model with Posterior1 weights (first 2 rows)

Correlated Component Regression Predictions and Residuals

Table 3B. Predictions and residuals output for model with Posterior2 weights (first 2 rows)

Correlated Component Regression Predictions and Residuals

Table 3C. Predictions and residuals computed for 2-class regression model (first 2 rows)

Row 1 in Tables 3A, 3B and 3C corresponds to OJ#1 (fruvita fr.). Since this juice has a lower acidity level, Segment #1 judges are predicted to rate it higher than Segment #2 judges (3.441 vs. 2.373).

Note that judge #1 (corresponding to Observation = 1), tends to rate the 6 juices somewhat lower than the average judge (e.g., for Observation = 1, CFactor1 = -.214 and rating mean = 2.67). As mentioned above, the predictions provided by this 2-class model are substantially better than those provided by a 1-class model which ignores the segments. A food product manager might use these results to customize separate OJ products for each segment, based on the attributes used in each model.

References

  • Popper, R., J. Kroll, Jeff and J. Magidson (2004). Applications of latent class models to food product development: a case study. Sawtooth Software Proceedings, 2004. (pdf[W4] )
  • Magidson, J., and Vermunt, J.K. (2006). Use of latent class regression models with a random intercept to remove overall response level effects in ratings data. In: A. Rizzi and M Vichi (eds.), Proceedings in Computational Statistics, 351-360, Heidelberg: Springer (pdf).
  • Magidson, J., and Vermunt, J.K. (2005). An Extension of the CHAID Tree-based Segmentation Algorithm to Multiple Dependent Variables. C. Weihs & W. Gaul, Classification: The Ubiquitous Challenge, 176-183. Heidelberg: Springer (pdf).