Running a Partial Least Squares regression with XLSTAT

Dataset for Partial Least Squares regression XLS3.77 MB

Tutorial video
Partial Least Squares regression is part of: Download Trial version More details See users' feedback
  • PLS Partial Least Squares regression software

  • System configuration

    • Windows:
      • Versions: 9x/Me/NT/2000/XP/Vista/Win 7
      • Excel: 97 and later
      • Processor: 32 or 64 bits
      • Hard disk: 150 Mb
    • Mac OS X:
      • OS: OS X
      • Excel: X, 2004 and 2011
      • Hard disk: 150Mb.

Benefits

  • Easy and user-friendly
    Easy and user-friendly XLSTAT is flawlessly integrated with Microsoft Excel which is the most popular spreadsheet worldwide. This integration makes it one of the simplest available tools to work with as it utilizes the same philosophy as Microsoft Excel. The program is accessible in a dedicated XLSTAT tab. The analyses are grouped into functional menus. The dialog boxes are user-friendly and setting up an analysis is straightforward.
  • Data and results shared seamlessly
    Data and results shared seamlessly One of the greatest advantages of XLSTAT is the way you can share data and results seamlessly. As the results are stored in Microsoft Excel, anyone can access them. There is no need for the receiver to have an XLSTAT license or any additional viewer which makes your team-work easier and more affordable. In addition, results are easily integrable into other Microsoft Office software such as PowerPoint, so that you can create striking presentation in minutes.
  • Modular
    Modular XLSTAT is a modular product. XLSTAT-Pro is a core statistical module of XLSTAT which includes all the mainstream functionalities in statistics and multivariate analysis. More advanced features contained in add-on modules can be added for specific applications. This way you can adapt the software to your needs making the software more cost-efficient.
  • Didactic
    Didactic The results of XLSTAT are organized by analysis and are easy to navigate. Moreover useful information is provided along with the results to assist you in your interpretation.
  • Affordable
    Affordable XLSTAT is a complete and modular analytical solution that can suit any analytical business needs. It is very reasonably priced so that the return of your investment is almost immediate. Any XLSTAT license comes with top level support and assistance.
  • Accessible - Available in many languages
    Accessible - Available in many languages We have ensured XLSTAT is accessible to everyone by making the program available in many languages, including Chinese, English, French, German, Italian, Japanese, Polish, Portuguese and Spanish.
  • Automatable and customizable
    Automatable and customizable Most of the statistical functions available in XLSTAT can be called directly from the Visual Basic window of Microsoft Excel. They can be modified and integrated to more code to fit to the specificity of your domain. Adding tables and plots as well as modifying existing outputs becomes easy. Furthermore, XLSTAT includes some special tools on the dialog boxes to generate automatically the VBA code in order to reproduce your analysis using the VBA editor or to simply load pre-set settings. This effortless automation of routine analysis will be a huge time saver on your part.

Dataset for running a Partial Least Squares regression

This tutorial is based on data that have been extensively analyzed in [Tenenhaus, M., Pagès, J., Ambroisine L. and & Guinot, C. (2005); PLS methodology for studying relationships between hedonic judgements and product characteristics; Food Quality an Preference. 16, 4, pp 315-325].

The data used in this article correspond to 6 orange juices described by 16 physico-chemical descriptors and evaluated by 96 judges.

An Excel sheet containing both the data and the results for use in this tutorial can be downloaded by clicking here.

Goal of the Partial Least Squares regression in this example

Partial Least Squares regression is going to allow us to obtain a simultaneous map of the judges, the descriptors and the products, and then to analyze for some judges which descriptors are related to their preferences.

Setting up a Partial Least Squares regression

To activate the Partial Least Squares regression dialog box, start first XLSTAT, then select the XLSTAT / Modeling data / Partial Least Squares Regression command in the Excel menu or click the corresponding button on the Modeling data toolbar.

barpls.gif

Once you have clicked the button, the Partial Least Squares regression dialog box is displayed.

In the Dependent variable(s) field, select with the mouse the ratings of the 96 judges (see the tutorial on Selecting data for more information on this topic).

The ratings are the "Ys" of the model as we want to explain the ratings given by the judges.

In Quantitative variable(s) field, select the explanatory variables, that are in our case the physico-chemical descriptors.

The name of the orange juices have also been selected as Observation labels.

In the Options tab of the dialog box, make sure that Automatic is activated.

Last, in the Charts tab, the Colored labels option has been activated in order to make the reading of the charts easier. The Vectors option has been unchecked in order not to saturate the charts.

pls1.gif

The extremely fast computations start when you click on OK. The display of the results is stopped to allow you to select the axes for the maps.

pls11.gif

You only need to click on "Done" so that the charts are only displayed for the firts two axes.

Interpreting the results of a Partial Least Squares regression

The display of the results may take few seconds as there are many tables and charts because of the 96 dependent variables.

After the tables displaying the basic statistics and the correlations between all the selected variables (dependent variables are displayed in blue and quantitative explanatory variables in black), the results specific to the PLS regression are presented.

The first table and the corresponding bar chart allow to visualize the quality of the Partial Least Squares regression as a function of the number of components.

pls2.gif

The Q² cumulated index measures the global goodness of fit and the predictive quality of the 96 models.

XLSTAT-PLS has automatically selected 4 components. We see that Q² remains low even with 4 components (ideally it should be close to 1). This suggests that the quality of the fit varies a lot depending on the judge.

The cumulated R²Y and R²X cum that correspond to the correlations between the explanatory (X) and dependent (Y) variables with the components are very close to 1 with 4 components. This indicates that the 4 components generated by the Partial Least Squares regression summarize well both the Xs and the Ys.

The first correlations map allows to visualize on the first two components the correlations between the Xs and the components, and the Ys and the components.

pls3.gif

We can see that for some judges displayed at the center of the map, the correlations are low. By looking at the corresponding table, we see that for example, the J54 judge is only correlated with the fourth component, that is globally little correlated with the explanatory variables.

Regarding the explanatory variables we notice that the Vitamin C is not well represented on the first two dimensions. We can interpret this as the fact that this variable explains only litlle the preferences of the judges, which is not surprising as it does not have a strong effect on taste or other criteria that could easily influence the judges' preference. We notice the strong correlations between the fructose and the glucose, between the two pH, and the negative correlation between the pH and the acidity and titer. One should also notice how different the judges are: they are not concentrated on one part of the correlations circle, but well distributed all around it.

The map that displays the dependent variables on the c vectors, and the explanatory variables on the w* vectors allows to visualize the global relationship between the variables. The w* are related to the weights of the variables in the models.

pls4.gif

If one projects an explanatory variable on the vector of a dependent variable (the vectors are displayed only if there are less than 50 dependent variables) we have an idea of the influence of the explanatory variable in the modeling of the dependent variable.

The coordinates of the orange juices in the space of the t coordinates are available in a table and displayed on a map. We notice that the products are well distinguished.

pls5.gif

A new correlations map allows to superimpose the products on the previous correlations map. In the legend we replaced "Obs" by "Juices", by modidying the series name in the Excel toolbar, after we selected the series by clicking on one of the points. As almost always with XLSTAT, the charts are Excel charts and can easily be modified.

pls6.gif

Running a second Partial Least Squares regression on clustered data

In their article, Tenenhaus et al. interpret this chart in detail. They deduce from it the existence of 4 well identified clusters of judges. They advise re-running Partial Least Squares regression on each of these groups. This way they obtain better Q² and R². For the first group, the R²Y is 0.63 instead of the 0.53 we obtained with all the judges.

Two tables giving results for the u and u~ components are then displayed. A chart allows to visualize the observations (in our case the juices) in the space of the u~.

The following tables allow to visualize for each dependent variable the Q² et cumulated Q² indexes as a function of the number of components. We notice that for several variables, the maximum of the cumulated Q² is obtained for one or two components only (for example J5, J6, J7).

A series of tables with the R² for each entry variable with the t components is optionally displayed. The option is not activated by default, and the tables are not taken into account in this tutorial.

The following table displays the VIPs (Variable Importance for the Projection) for each explanatory variable, for an increasing number of components. This allows to quickly identify which are the explanatory variables that contribute the most to the models. For the model with one component we can see that the Vitamin C, the Sweetening power, the Odor intensity and the Taste intensity have a low infludence on the models.

pls7.gifpls71.gif

The next table displays the parameters (or coefficients) of the models corresponding to each dependent variable. The equations of the models are displayed below that table. The equations can be later reused for simulation or prediction purposes.

For each model, XLSTAT-PLS displays the goodness of fit coefficients, the standardized coefficients table, and the table of predictions and residuals. The analysis of the model corresponding to judge J1 allows to conclude that the model is well fitted (R’² equals 0.88). However the number of degrees of freedom is low, and might be facing an overfitting problem. This is also confirmed when we look at the standardized coefficients: for each coefficient the confidence intervals are wide and include 0. As we noticed that the cumulated Q’² corresponding to this model reaches its maximum value with 2 components, it is likely that a model with only two components would be better.

We have performed a new Partial Least Squares regression analysis, using just J1 as a dependent variable, and forcing the number of components (see the options tab) to 2. The results are displayed on the PLS2 sheet. We can now see that the quality of the results has been improved. The following chart correspond to the standardized coefficients of the new model.

pls8.gif

We can see here that the coefficients are significantly different from zero only for "Smell intensity" and "Odor typicity". The predictions and residuals table allows to verify that the rates given by the judge 1 are well reproduced by the model.

Last, the table with the DModX and DModY, and the corresponding charts that allow to quickly identify potential outliers, are displayed. In our case there aren't any outlier as all values are lower than DCritX or DCritY.

Watch this video to view a demonstration.