Obtaining Predictions from a 2-class Regression

Datensatz für Correlated Component Regression (CCR) XLS489 KB

Videos für dieses Tutorial
Correlated Component Regression (CCR) ist enthalten in: Download Demoversion Mehr Details Kommentare ansehen
  • CCR Software zur Correlated Component Regression

  • Voraussetzungen

    • Windows:
      • Versionen: 9x/Me/NT/2000/XP/Vista/Win 7
      • Excel: 97 oder höher
      • Prozessor: 32 oder 64 bit
      • Festplattenspeicher: 150 MB

Vorteile von XLSTAT

  • Praktisch und einfache Benutzung
    Praktisch und einfache Benutzung XLSTAT ist perfekt in Microsoft Excel integriert, das das am meisten benutzte Tabellenkalkulationsprogramm ist. Dank dieser Integration und der gleichen Philosophie wie Excel, ist die Benutzung von XLSTAT leicht. Die Software ist in einem speziellen Reiter verfügbar, der das Menu der XLSTAT-Module enthält. Die verfügbaren Analysen sind in Gruppen ähnlicher Funktion zusammengefasst. Die Dialogfenster sind einfach und Ihre Einstellungen verständlich.
  • Einfaches Mitteilen der Daten und Ergebnisse
    Einfaches Mitteilen der Daten und Ergebnisse Einer der größten Vorteile von XLSTAT ist die Tatsache, das die Daten und Ergebnisse ohne Einschränkung kommuniziert werden können. Denn die Daten und Ergebnisse werden in Microsoft Excel gespeichert und sind daher allen zugänglich. Eine XLSTAT Lizenz oder ein Programm zur Ansicht ist nicht erforderlich, um die Daten und Ergebnisse zu empfangen und anzusehen. Darüber hinaus sind die Ergebnisse leicht in andere Programme von Microsoft Office, wie PowerPoint kopierbar, was Ihnen das Erstellen von Präsentationen mit ausgezeichneten Diagrammen in wenigen Minuten erlaubt.
  • Modular
    Modular XLSTAT ist ein modulares Produkt um XLSTAT-Pro herum, das die Basissoftware von XLSTAT darstellt. XLSTAT-Pro schließt bereits alle verbreiteten statistischen Funktionen und multivariaten Datenanalysen ein. Die fortgeschrittenen Funktionen sind ebenfalls in zusätzlichen Modulen verfügbar, die speziellen Anforderungen gerecht werden. So können Sie Ihre Software an Ihre eigenen Anforderungen anpassen, was sie attraktiver werden lässt.
  • Didaktisch
    Didaktisch Die XLSTAT Ergebnisse sind nach Analysen aufgebaut und einfach durchzublättern. Darüber hinaus sind den Ergebnissen nützliche Informationen hinzugefügt, um die Interpretation zu erleichtern.
  • Preiswert
    Preiswert XLSTAT ist eine modulare, komplette Statistik- und Datenanalysesoftware, die sich an alle Ihre analytischen Anforderungen Ihrer Organisation anpasst. Der Preis ist sehr gering, was Ihnen eine quasi sofortige Amortisierung erlaubt. Alle XLSTAT Lizenzen schließen ein Support und eine Unterstützung hervorragender Qualität ein.
  • Zugänglich
    Zugänglich Wir setzen uns ein, XLSTAT so vielen Personen wie möglich durch eine Benutzerschnittstelle in vielen Sprachen darunter Deutsch, englisch, französisch, spanisch, italienisch, portugiesisch, polnisch, chinesisch und japanisch zugänglich zu machen.
  • Automatisierbar und personalisierbar
    Automatisierbar und personalisierbar Die Mehrzahl der in XLSTAT verfügbaren Funktionen können direkt aus Visual Basic for Applications von Microsoft Excel heraus aufgerufen werden. Sie können in Ihre Routinen integriert werden, um Ihren Anforderungen einer speziellen Anwendung gerecht zu werden. Das Hinzufügen von Ergebnistabellen, Diagrammen, oder das Verändern bereits existierender ist vereinfacht. Darüber hinaus schließt XLSTAT die Werkzeuge zur Erzeugen des VBA Kodes für die Dialogfenster ein, so dass Sie Ihre Analysen vom VBA Editor heraus einfach reproduzieren können, indem Sie die Einstellungen automatisch laden. Diese Automatisierung der Analysen wird Ihnen viel Zeit einsparen.

Dataset for a two-class model

This tutorial illustrates a reanalysis of data analyzed by Tenenhaus, et al. (2005): Tenenhaus, M., Pagès, J., Ambroisine L. and & Guinot, C. (2005); PLS methodology for studying relationships between hedonic judgments and product characteristics; Food Quality and Preference. 16, 4, pp 315-325.

The data consists of liking ratings on each of 6 different orange juice (OJ) products by 96 judges. Each of the 6 juices is also described by 16 physico-chemical attributes. In addition, the data contains classification information for weighting the judges according to their (posterior membership) probability of being in two different segments which have distinctly different OJ preferences.  Click here for details of the random intercept latent class (LC) regression analysis used to obtain these posterior membership probabilities.  

Goal of the Correlated Component Regressions in this example

When data consists of multiple records per case, traditional (1-class) regression methods suffer from violation of the independent observations assumption which yields suboptimal prediction, since residuals from records associated with the same case will typically be correlated. In this tutorial we show how CCR can improve prediction of the liking ratings from the OJ attributes by allowing differing attribute effects for each of the 2 LC segments which show different OJ preferences.

In particular, this tutorial illustrates the second step in a 2-step process. In step 1, a 2-class regression model is developed based solely on dummy variables associated with the OJ products. In step 2, CCR is used to predict ratings based on the 16 product descriptors (rather than the dummy variables) to determine those that are the most important in predicting OJ liking. We develop separate models for each LC segment, obtained in step 1, and then combine models for both segments to obtain a single best set of predicted ratings. Use of this 2-step, 2-class regression analysis provides substantial improvement over the traditional regression (cross-validated R-square increases from .28 to .48).

Setting up a Correlated Component Regression (CCR) model

To activate the Correlated Component Regression dialog box, start XLSTAT by clicking on the XLSTAT start button button in the Excel toolbar, and then select the XLSTAT / Modeling data / Correlated Component Regression command in the Excel menu or click the corresponding button on the Modeling data toolbar.

Correlated Component Regression menu

Once you have selected CCR, the Correlated Component Regression dialog box is displayed.

To setup the CCR runs, in the Y / Dependent variable(s) field, select with the mouse Column D (rating), containing the ratings for each of the 6 juices given by the judges (6 rows for each of the 96 judges). The ratings are the "Ys" of the model as we want to explain these ratings given by the judges as a function of the juice attributes.

In the X / Predictors field, select columns I through Y corresponding to the variable CFactor1 (column I) plus the 16 juice attributes.CFactor1, a random intercept obtained from the LC regression analysis based solely on the OJ ratings, is highly correlated with the variableRating_mean (column E), representing each judges’ mean rating across all 6 juices. Its inclusion as a predictor serves a function similar to ‘centering’.

The case ID variable (column B) is entered in the Observation labels field, so that all 6 records for each judge are grouped and assigned to the same fold during cross-validation.

Separate models will be developed for each segment. For Segment #1, we select the probability of being in that segment (Posterior1) as the weight variable (column G). (For theory on the use of posterior membership probabilities as weights see Magidson, 2005).

Correlated Component Regression General Tab

Figure 1. General Tab

To determine the number of components, in the Options tab of the dialog box, activate the ‘Automatic’ option and enter ‘17’ in the ‘Max components’ field. To determine the number of predictors, activate the step-down procedure.

Correlated Component Regression Options Tab

Figure 2. Options Tab

Note that the Cross-validation option in the Validation tab is automatically activated with the default parameters (1 round of 10-folds).

Correlated Component Regression - Dialog box / Validation tab

Figure 3. Validation Tab

The fast computations start when you click on OK.

Interpreting CCR results for Segment #1

From the correlation matrix output it can be seen that the correlation between rating and Acidity equals -.433, suggesting that Segment #1 judges tend to dislike OJs with a high acidic nature. We will see later that in contrast to Segment #1 judges, Segment #2 judges tend toprefer OJs that have a high acid content (correlation = .252).

From the CV components table and associated plot we see that the maximum CV-R2 = .398 occurs with K = 5 components (note also that the CV-R2 deteriorates rapidly after K=9 components indicating substantial amount of collinearity for K>9).

Cross-validation component plot (Segment #1)

Figure 4. Cross-validation component plot (Segment #1)

From the CV-step down plot we see that the maximum CV-R2= .402 occurs with P*=4 predictors (and since P* < 5, K is reduced to 4.

Correlated Component Regression Cross-validation step-down plot (Segment #1)

Figure 5. Cross-validation step-down plot (Segment #1)

Table 1 shows that Acidity is an important predictor in the model. The negative standardized coefficient (-.325) supports the inference that Segment #1 judges tend to dislike OJs with high acid content.

Correlated Component Regression Standardized Coefficients

Table 1. Standardized coefficients based on the 4-component model for Segment #1 

For comparison, we will obtain results for Segment #2 next.

Developing the Corresponding CCR Model for Segment #2

Re-open the CCR dialog box by click on Modeling data / Correlated Component Regression command in the Excel menu or click the corresponding button on the Modeling data toolbar.

The previous model specifications are currently displayed. In the General tab, replace the current observation weights by the corresponding values (column H) associated with Segment #2 (Posterior2). To produce the Segment #2 model output on the same sheet as the Segment #1 model output, change the output option from ‘Sheet’ to ‘Range’ and select cell V1 in the ‘CCR.LM’ tab (the tab which contains the output from our previous model estimation).

Correlated Component Regression General Tab

Figure 6. General Tab

Click on OK to estimate.

The relevant output for Segment #2 is shown below.

Correlated Component Regression Cross-Validation Component Plot

Figure 7. Cross-validation component plot (Segment #2). CV-R2 = .409

Correlated Component Regression Cross-Validation Step-Down Plot

Figure 8. Cross-validation step-down plot (Segment #2). CV-R2 = .411

Table 2 shows that Acidity is an important predictor for Segment #2 as well as Segment #1. However, in contrast to the model result for Segment #1, the standardized coefficient for Acidity is now positive. Table 2 show that Segment #2 judges prefer juices with higher acidity (.214), low sweetening power (-.169), and low smell intensity (-.129).

 Correlated Component Regression Standardized Coefficients 

Table 2. Standardized coefficients for Segment #2

Obtaining predictions from a 2-class model

Improved prediction over the 1-class model is due to the value of the additional information provided by the LC segmentation results. If we knew that a judge was from Segment #1 (i.e., preferred OJs that had lower acidity), we would use the Segment #1 model for prediction. Similarly, if we knew that a judge was from Segment #2 (i.e., preferred OJs that had higher acidity), we would use the Segment #2 model for prediction. While we do not know with certainty to which segment each judge belongs, we have the posterior membership probabilities to use as weights.

Our prediction from the 2-class CCR model is a weighted average of the 2 sets of predictions obtained from the 2 models. For example, our prediction for the rating for OJ#1 (fruvita fr.) given by judge #1 is obtained as a weighted average of the corresponding predictions from the 2 models, where the weights are the posterior membership probabilities:

Prediction = .98(3.441) + .02(2.373) = 3.42            

For judge #1, the probability of being in Segment #1 is about .98, and thus the probability of being in Segment #2 is about .02. The predicted rating from the Segment #1 model (3.441) is weighted more heavily for this judge than that from the Segment #2 model (2.373), resulting in a prediction of 3.42 based on the 2-class regression model.

For illustrative purposes, these and other calculations are provided in sheet ‘CCR.LM’ (highlighted in yellow). These yellow highlighted cells were added manually to the output provided by XLSTAT-CCR. For example, cell L237 provides the formula for computing the predicted value 3.42 from the corresponding Segment #1 and Segment #2 output.

Correlated Component Regression Predictions and Residuals

Table 3A. Predictions and residuals output for model with Posterior1 weights (first 2 rows)

Correlated Component Regression Predictions and Residuals

Table 3B. Predictions and residuals output for model with Posterior2 weights (first 2 rows)

Correlated Component Regression Predictions and Residuals

Table 3C. Predictions and residuals computed for 2-class regression model (first 2 rows)

Row 1 in Tables 3A, 3B and 3C corresponds to OJ#1 (fruvita fr.). Since this juice has a lower acidity level, Segment #1 judges are predicted to rate it higher than Segment #2 judges (3.441 vs. 2.373).

Note that judge #1 (corresponding to Observation = 1), tends to rate the 6 juices somewhat lower than the average judge (e.g., for Observation = 1, CFactor1 = -.214 and rating mean = 2.67). As mentioned above, the predictions provided by this 2-class model are substantially better than those provided by a 1-class model which ignores the segments. A food product manager might use these results to customize separate OJ products for each segment, based on the attributes used in each model.

References

  • Popper, R., J. Kroll, Jeff and J. Magidson (2004). Applications of latent class models to food product development: a case study. Sawtooth Software Proceedings, 2004. (pdf[W4] )
  • Magidson, J., and Vermunt, J.K. (2006). Use of latent class regression models with a random intercept to remove overall response level effects in ratings data. In: A. Rizzi and M Vichi (eds.), Proceedings in Computational Statistics, 351-360, Heidelberg: Springer (pdf).
  • Magidson, J., and Vermunt, J.K. (2005). An Extension of the CHAID Tree-based Segmentation Algorithm to Multiple Dependent Variables. C. Weihs & W. Gaul, Classification: The Ubiquitous Challenge, 176-183. Heidelberg: Springer (pdf).