Running a Discriminant Analysis with XLSTAT

Dataset for Discriminant Analysis (DA) XLS184 KB

Tutorial video
Discriminant Analysis (DA) is part of: Download Trial version More details See users' feedback
  • Pro Core statistical software

  • System configuration

    • Windows:
      • Versions: 9x/Me/NT/2000/XP/Vista/Win 7/Win 8
      • Excel: 97 and later
      • Processor: 32 or 64 bits
      • Hard disk: 150 Mb
    • Mac OS X:
      • OS: OS X
      • Excel: X, 2004 and 2011
      • Hard disk: 150Mb.

Benefits

  • Easy and user-friendly
    Easy and user-friendly XLSTAT is flawlessly integrated with Microsoft Excel which is the most popular spreadsheet worldwide. This integration makes it one of the simplest available tools to work with as it utilizes the same philosophy as Microsoft Excel. The program is accessible in a dedicated XLSTAT tab. The analyses are grouped into functional menus. The dialog boxes are user-friendly and setting up an analysis is straightforward.
  • Data and results shared seamlessly
    Data and results shared seamlessly One of the greatest advantages of XLSTAT is the way you can share data and results seamlessly. As the results are stored in Microsoft Excel, anyone can access them. There is no need for the receiver to have an XLSTAT license or any additional viewer which makes your team-work easier and more affordable. In addition, results are easily integrable into other Microsoft Office software such as PowerPoint, so that you can create striking presentation in minutes.
  • Modular
    Modular XLSTAT is a modular product. XLSTAT-Pro is a core statistical module of XLSTAT which includes all the mainstream functionalities in statistics and multivariate analysis. More advanced features contained in add-on modules can be added for specific applications. This way you can adapt the software to your needs making the software more cost-efficient.
  • Didactic
    Didactic The results of XLSTAT are organized by analysis and are easy to navigate. Moreover useful information is provided along with the results to assist you in your interpretation.
  • Affordable
    Affordable XLSTAT is a complete and modular analytical solution that can suit any analytical business needs. It is very reasonably priced so that the return of your investment is almost immediate. Any XLSTAT license comes with top level support and assistance.
  • Accessible - Available in many languages
    Accessible - Available in many languages We have ensured XLSTAT is accessible to everyone by making the program available in many languages, including Chinese, English, French, German, Italian, Japanese, Polish, Portuguese and Spanish.
  • Automatable and customizable
    Automatable and customizable Most of the statistical functions available in XLSTAT can be called directly from the Visual Basic window of Microsoft Excel. They can be modified and integrated to more code to fit to the specificity of your domain. Adding tables and plots as well as modifying existing outputs becomes easy. Furthermore, XLSTAT includes some special tools on the dialog boxes to generate automatically the VBA code in order to reproduce your analysis using the VBA editor or to simply load pre-set settings. This effortless automation of routine analysis will be a huge time saver on your part.

Dataset for running a Discriminant Analysis

An Excel sheet containing both the data and the results for use in this tutorial can be downloaded by clicking here.

The data are from [Fisher M. (1936). The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7, 179 -188] and correspond to 150 Iris flowers, described by four variables (sepal length, sepal width, petal length, petal width) and their species. Three different species have been included in this study: setosa, versicolor and virginica.

Goal of this Discriminant Analysis

Our goal is to test if the four variables allow to discrimine the species, and to visualize the observations on a 2-dimensional map that shows as well as possible how separated the groups are.

iris_setosa.jpgiris_versicolor.jpgiris_virginica.jpg

Iris setosa, versicolor and virginica.

Setting up a Discriminant Analysis

After opening XLSTAT, select the XLSTAT / Analyzing data / Discriminant analysis command, or click on the corresponding button of the Analyzing data toolbar (see below).

barda.gif

Once you've clicked on the button, the Discriminant analysis dialog box appears. The qualitative dependent variable corresponds here to the "Species" variable. The quantitative Explanatory variables are the four descriptive variables.

da1.gif

We uncheck the Equality of covariance matrices option because, as we will see with the Box's test, assuming that the covariance matrices of the three species are equal would be wrong.

da1-2.gif

Many results are optionally displayed by XLSTAT. We can see below which options have been activated for this particular case.

da1-3.gif

In order to avoid adding too much information on the plots, we have unchecked the Labels option in the Charts tab.

da1-4.gif

The computations begin once you have clicked on OK. The results will then be displayed.

Interpreting the results of a Discriminant Analysis

The first results displayed are the various matrices used for the computations. The two Box's tests confirm that we need to reject the hypothesis that the covariance matrices are equal between the groups.

da2.gif

The Wilks' Lambda test allows to test if the vector of the means for the various groups are equal or not (you can understand it as a multidimensional version of the Fisher's LSD or the Tukey's HSD tests). We see that the difference between the means vectors of the groups is significant.

da3.gif

The next table shows the eigenvalues and the corresponding % of variance. We can see that 99% of the variance is represented with the first factor. There are only two factors: the maximum number of factors is equal to k-1, when n>p>k, where n is the number of observations, p the number of explanatory variables, and k the number of groups.

da4.gif

The following chart shows how the initial variables are correlated with the two factors (this chart corresponds to the factor loadings table). We can see that the factor F1 is correlated with Sepal length, Petal length, and Petal width and that F2 is correlated with Sepal width.

da5.gif

The following table displays the discriminant functions. When we assume the equality of the covariance matrices, the discriminant functions are linear. When the equality is not assumed, which is the case in this tutorial, the discriminant functions are quadratic. The rule based on these functions is that we allocate an observation to the group corresponding to the function that gives the greatest value. These functions can be used in predictive mode on new observations to allocate them to a group.

da6.gif

The next table lists for each observation the factor scores (the coordinates of the observations in the new space), the probability to belong to each group, and the squared Mahalanobis distances to the centroid of the group. Each observation is classified into the group for the which the probability of belonging is the greatest. The probabilities are posterior probabilities that take into account the prior probabilities through the Bayes formula. We notice that three observations (5,9,12) have been reclassified. There are several ways in which these results can be interpreted: either the person who made the measures made an error when recording the values, or the corresponding iris flowers have had a very unusual growth or the criteria used by the specialist to determine the species was not precise enough, or some information necessary to discrimine the flowers is not available here.

da7.gif

The following chart represents the observations on the factor axes. It allows to confirm that the species are very well discriminated on the factor axes extracted from the original explanatory variables.

da8.gif

The confusion matrix summarizes the reclassification of the observations, and allows to quickly see the % of well classified observations, which is the ratio of the number of observations that have been well classified over the total number of observations. It is here equal to 98%.

da9.gif

As the corresponding option has been activated in the "Outputs" tab of the dialog box, the predictions for the cross-validation are computed. Cross-validation allows to see what would be the prediction for a given observation if it is left out of the estimation sample. We can see here that only one more observation (Obs8) is miss-classified.

da10.gif

The confusion matrix of the cross-validation is displayed below.

da11.gif