Creating a CHAID classification tree with XLSTAT

Dataset for Classification and regression trees XLS148 KB

Tutorial video
Classification and regression trees is part of: Download Trial version More details See users' feedback
  • Pro Core statistical software

  • System configuration

    • Windows:
      • Versions: 9x/Me/NT/2000/XP/Vista/Win 7
      • Excel: 97 and later
      • Processor: 32 or 64 bits
      • Hard disk: 150 Mb
    • Mac OS X:
      • OS: OS X
      • Excel: X, 2004 and 2011
      • Hard disk: 150Mb.

Benefits

  • Easy and user-friendly
    Easy and user-friendly XLSTAT is flawlessly integrated with Microsoft Excel which is the most popular spreadsheet worldwide. This integration makes it one of the simplest available tools to work with as it utilizes the same philosophy as Microsoft Excel. The program is accessible in a dedicated XLSTAT tab. The analyses are grouped into functional menus. The dialog boxes are user-friendly and setting up an analysis is straightforward.
  • Data and results shared seamlessly
    Data and results shared seamlessly One of the greatest advantages of XLSTAT is the way you can share data and results seamlessly. As the results are stored in Microsoft Excel, anyone can access them. There is no need for the receiver to have an XLSTAT license or any additional viewer which makes your team-work easier and more affordable. In addition, results are easily integrable into other Microsoft Office software such as PowerPoint, so that you can create striking presentation in minutes.
  • Modular
    Modular XLSTAT is a modular product. XLSTAT-Pro is a core statistical module of XLSTAT which includes all the mainstream functionalities in statistics and multivariate analysis. More advanced features contained in add-on modules can be added for specific applications. This way you can adapt the software to your needs making the software more cost-efficient.
  • Didactic
    Didactic The results of XLSTAT are organized by analysis and are easy to navigate. Moreover useful information is provided along with the results to assist you in your interpretation.
  • Affordable
    Affordable XLSTAT is a complete and modular analytical solution that can suit any analytical business needs. It is very reasonably priced so that the return of your investment is almost immediate. Any XLSTAT license comes with top level support and assistance.
  • Accessible - Available in many languages
    Accessible - Available in many languages We have ensured XLSTAT is accessible to everyone by making the program available in many languages, including Chinese, English, French, German, Italian, Japanese, Polish, Portuguese and Spanish.
  • Automatable and customizable
    Automatable and customizable Most of the statistical functions available in XLSTAT can be called directly from the Visual Basic window of Microsoft Excel. They can be modified and integrated to more code to fit to the specificity of your domain. Adding tables and plots as well as modifying existing outputs becomes easy. Furthermore, XLSTAT includes some special tools on the dialog boxes to generate automatically the VBA code in order to reproduce your analysis using the VBA editor or to simply load pre-set settings. This effortless automation of routine analysis will be a huge time saver on your part.

Dataset for creating a CHAID classification tree

An Excel sheet containing both the data and the results for use in this tutorial can be downloaded by clicking here.

The data are from [Fisher M. (1936). The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7, 179 -188] and correspond to 150 Iris flowers, described by four variables (sepal length, sepal width, petal length, petal width) and their species. Three different species have been included in this study: setosa, versicolor and virginica.

iris_setosa.jpgiris_versicolor.jpgiris_virginica.jpg

Iris setosa, versicolor and virginica.

Goal of this CHAID classification tree

Our goal is to test if the four descriptive variables allow to efficiently predict to which species a flower corresponds, and in this case, to identify rules that would help classifying the flowers on the basis of the four variables.

Note: the same case is treated in the tutorial on discriminant analysis.

Setting up the dialog box to generate a CHAID classification tree

After opening XLSTAT, select the XLSTAT / Modeling data / Classification and regression trees command, or click the corresponding button of the Modeling data toolbar (see below).

bartree.gif

Once you've clicked the button, the dialog box appears. The qualitative dependent variable corresponds here to the "Species" variable.

The quantitative Explanatory variables are the four descriptive variables.

We choose to use the CHAID algorithm and we set the maximum tree depth to 3 to avoid obtaining a too complex tree.

tree1.gif

In the Options tab, several technical options allow to better control the way the tree is built.

tree2.gif

In Charts tab we first select the Bar charts option to display the distribution of the species at each node.

As we will see later, the Pie charts option is also being used in this tutorial.

tree3.gif

The computations begin once you have clicked on OK. The results will then be displayed.

Interpreting the results of a CHAID classification tree

Below the simple statistics for all the selected variables, XLSTAT displays information on the tree structure. This includes for each node, the p-value for the splitting, the number of objects at each node, the corresponding % the parent and son nodes, the split variable, the value(s) or intervals of the latter, and the purity that indicates what is the % of objects that belong to the dominating category of the dependent variable at this node.

tree4.gif

The next result displayed is the classification tree.

tree5.gif

This diagram allows to visualize the successive steps during which the CHAID algorithm identifies the variables that allow to best split the categories of the dependent variable. Thus, we see that using only the petal length, the algorithm has found a rul that allows to perfectly separate the Iris flowers of the setosa species. If the petal length is between 10 and 24.5 then the species is setosa.

The information available at each node is explained below.

tree51.gif

The algorithm stops when no additional rule can be found, or when one of the limits set by the user are reached (number of objects at a parent or son node, maximum tree depth, threshold p-value for splitting).

XLSTAT offers a second possibility to visualize the classification trees. Instead of using bar charts, it uses pie charts. The latter are easier to read when they are many nodes and many categories for the dependent variable. The inner circle of the pie corresponds to the relative frequencies of the categories to which the objects contained in the node correspond. The outer ring shows distribution of the categories at the parent node.

tree6.gif

The following table contains the rules built by the algorithm in a less visual but more readable way: the rules are written in natural language. The purity gives the % that corresponds to the majority category at the node level. The number of objects corresponding to the category is also displayed.

tree7.gif

In this way, we see that "If PETAL LENGHT is in the interval [30; 49.5[ and PETAL WIDTH is in the interval [10; 16.5[ then SPECIES is Versicolor in 100% of cases" this rule is verified by 47 flowers.

The rules that correspond to the leaves of the tree (the terminal nodes) allow to compute predictions for each observation, with a probability that depends on the distribution of the categories at the leaf level. These results are displayed in the "Results by object" table.

tree8.gif

We see that 3 observations have been miss-classified by the algorithm. This result is almost identical to what is obtained with a discriminant analysis where the miss-classified observations are 5, 9, 12.

The confusion matrix summarizes the reclassification of the observations, and allows to quickly see the % of well classified observations, which is the ratio of the number of observations that have been well classified over the total number of observations. It is here equal to 98%.

tree9.gif

The trees created by XLSTAT are partially dynamic. You can prune the tree at a given level for all branches, or you can prune only one given branch. To prune the tree you first need to click on a node. When the six grey dots appear around the node, right click the mouse to display the contextual menu:

tree10.gif

If we decide to hide a subtree, the tree is then re-created without the branches starting from the selected node. The contours of the node are displayed in red color.

tree11.gif

It is of course possible afterwards to display again the hidden subtree using the same contextual menu.