CLUSTATIS

Use CLUSTATIS to perform a cluster analysis of subjects on the basis of their perceptions of products. Available in Excel with the XLSTAT software.

Clustering tables by means of CLUSTATIS in Excel

Cases where data are made up of different blocks of variables are becoming more and more frequent. Sensory analysis is particularly concerned by this phenomenon, since many tasks lead to this type of data, with each consumer/judge/subject providing a configuration/table (e.g.Projective mapping/Napping, conventional profiling, free choice profiling). As perceptions between subjects are often different, a clustering of the subjects may be necessary. The CLUSTATIS method fits into this context. Moreover, this strategy allows to set aside configurations that do not conform to any of the constructed classes, which correspond to atypical subjects in the framework of sensory analysis.

What is CLUSTATIS?

Description

CLUSTATIS is a clustering method based on the matrices of the scalar products of each configuration, which allows to consider configurations with different numbers of columns. The objective of this method is to constitute classes of configurations that are as homogeneous as possible, each group of configurations being represented by a latent configuration (called consensus) determined by STATIS. It is therefore natural that each class is finally analysed by STATIS, in order to determine the differences between the constituted classes. CLUSTATIS consists of a hierarchical algorithm that can be "consolidated" by a partitioning algorithm (i.e. the partitioning algorithm is initialized by cutting the dendrogram). An interesting option is the creation of a class "K+1" (corresponding to an additional class) in order to set aside tables that do not conform to any class. A configuration will be placed in this class if the similarities (RV coefficients) between the consensus of each class and this configuration are all considered weak.

Structure of the data

Two cases exist:

1. The number of the p variables is identical for the m configurations.

2. The number p of the variables varies from one configuration to the other.

For data entry, XLSTAT asks you to select a configuration corresponding to the m contiguous configurations, and to give the case of structure.

Scaling

If the data within a configuration are not on the same scale, it is advisable to scale (reduce) the variables of each configuration. For example, this is not the case for ratings between 0 and 20, but it is advised if some notes lay between 0 and 10 and others between 0 and 20.

Interpreting the results

For each class, the representation of objects in the space of factors allows to visually interpret the proximities between the objects, by means of precautions. We can consider that the projection of an object on a plan is reliable if the object is far from the center of the graph.

Since the class "K+1" contains tables that do not conform to any of the classes, this class is very dependent on the number of groups.

Number of factors

Two methods are commonly used to determine how many factors must be retained for the interpretation of the results:

- Watch the decreasing curve of eigenvalues. The number of factors to be kept corresponds to the first turning point found on the curve.

- We can also use the cumulative variability percentage represented by the factor axes and decide to use only a certain percentage.

Graphic representations

The graphical representations of the objects in each class are only reliable if the sum of the variability percentages associated with the axes of the representation space are sufficiently high. If this percentage is high (for example 80%), the representation can be considered as reliable. If the percentage is low, it is recommended to produce representations on several axis pairs in order to validate the interpretation made on the two first factor axes.

Quality of a cluster analysis

In order to determine the quality of a hierarchical clustering, one can use the increase in within-class variance (CLUSTATIS criterion error) caused by the merging of two classes. This increase is equal to the height of the dendrogram in which the two classes of configurations are grouped in the same class.

The homogeneity of each class and the global homogeneity are also very important indices (between 1/m and 1, m being the number of configurations) which allow to judge the quality of the cluster analysis. It should be noted that the consolidation and the addition of a class "K+1" can increase homogeneities.

Results of the CLUSTATIS analysis in XLSTAT

Descriptive statistics: The table of descriptive statistics shows the simple statistics for all the variables selected. This includes the number of observations (objects), the number of missing values, the number of non-missing values, the mean and the standard deviation (unbiased).

RV matrix: The matrix of RV coefficients between all configurations is displayed. The RV coefficient is an index of similarity between two configurations included between 0 and 1. The closer it is to 1, the stronger the similarity.

Node statistics: This table shows the data for the successive nodes in the dendrogram. The first node has an index which is the number of configurations increased by 1. Hence it is easy to see at any time if a configuration or group of configurations is clustered with another group of configurations in the dendrogram.

Levels bar chart: This table displays the statistics for dendrogram nodes, which correspond to the increase in the CLUSTATIS minimization criterion (equivalent to the increase in within-class variance) when merging two classes.

Dendrograms: The full dendrogram displays the progressive clustering of configurations. If truncation has been requested, a broken line marks the level the truncation has been carried out. The truncated dendrogram shows the classes after truncation.

Composition of classes:

Results by configuration: This table shows the assignment class for each configuration in the initial configuration order. If a consolidation is requested, the results are given before and after the consolidation. If you have checked "class K+1", it is possible that some tables may have a missing value after consolidation. This means that they are not placed in any of the main classes (they are placed in class K+1).

Results by class: The results are given by class. Thus, a list of configurations is displayed for each class.

Number of configurations per class: The number of configurations in each class is indicated.

Analysis of the class k:

In this section, the analysis of each of the classes by the STATIS method is displayed.

Eigenvalues: The eigenvalues and corresponding chart (scree plot ) are displayed.

Consensus coordinates: Consensus coordinates in the factors space are displayed, with the corresponding charts (depending on the number of factors chosen).

Consensus configuration: The consensus configuration is displayed. It corresponds to the weighted average of the scalar product matrices of the initial configurations (reduced globally and possibly reduced by variable).

RV config/consensus: The RV coefficients between the configurations and the consensus are displayed, with the associated bar chart. Like the weights of STATIS, these coefficients make it possible to detect atypical configurations. The advantage of these coefficients is that they are between 0 and 1, so they are easier to interpret than the weights.

Weights: The weights calculated by STATIS are displayed, with the associated bar chart. The greater the weight, the more the configuration contributed to the consensus. Knowing that STATIS gives more weight to the closest configurations from a global point of view, a much lower weight than the others will mean that the configuration is atypical.

Indices:

Homogeneities: The homogeneity of each class is displayed. It is a value between 1/m (m being the number of configurations of the class) and 1, which increases with the homogeneity of the configurations. In a second step, the global homogeneity, which is a weighted average of the homogeneity of each class, is displayed.

Global Error/Within-class Variance: The error of the CLUSTATIS criterion is displayed. It corresponds to the within-class variance.

RV between consensus : The matrix of the RV coefficients between the consensus of each class is displayed. This matrix shows how close the classes are to each other.