Gaussian mixture models
Gaussian Mixture Models (GMM) are a popular probabilistic clustering method. They are available in Excel using the XLSTAT statistical software.
What are the Gaussian mixture models?
Mixture modeling were first mentioned by Pearson in 1894 but their development is mainly due to the EM algorithm (Expectation Maximization) of Dempster et al. in 1978.
These models are commonly used for a clustering purpose. They can provide a framework for assessing the partitions of the data by considering that each component represents a cluster. These models have two main advantages:
- It is a probabilistic method for obtaining a fuzzy classification of the observations. The probability of belonging to each cluster is calculated and a classification is usually achieved by assigning each observation to the most likely cluster. These probabilities can also be used to interpret suspected classifications.
- Mixture modeling is very flexible.
The aim of mixture models is to structure dataset into several clusters. XLSTAT proposes the use of a mixture of Gaussian distributions.
Mixture models in XLSTAT
By controlling the covariance matrix according to the eigenvalue decomposition of Celeux et al., XLSTAT offers 14 different Gaussian mixture models. It is also possible to force the mixing proportions to be equal.
Inference algorithms used in XLSTAT for mixture models
XLSTAT offers the possibility to use three different inference algorithms to estimate the Gaussian parameters of the 14 models:
- EM: This is the standard algorithm used for inference in mixture models.
- SEM: This is a stochastic version of the EM algorithm. By adding a stochastic step for assigning observations to clusters. This algorithm can lead to empty clusters and disrupt the parameters estimation.
- CEM: This is a classifying version of the EM algorithm. A classification step is added for assigning observations to clusters by the MAP rule (Maximum A Posteriori). This algorithm can lead to empty clusters and disrupt the parameters estimation.
Select the number of components in XLSTAT
In practice, the number of components is often unknown, XLSTAT offers four different criteria to estimate the number of components:
- BIC: The Bayesian Information Criterion is a penalized likelihood-based criterion. This is the criterion commonly used in mixture models.
- AIC: the Akaike Information Criterion is a penalized likelihood-based criterion. This criterion tends to overestimate the number of components.
- ICL: the Integrated Complete Likelihood is a penalized likelihood-based criterion, it is the BIC penalized by the entropy. This criterion focuses on the model that provides well-separated clusters. Generally, the selected number of components is lower than BIC one.
- NEC: the Normalized Entropy Criterion. This criterion looks for model that provides well-separated clusters. The NEC is not defined for a model with one component. This criterion is used to select the number of components and not the covariance matrix.
Results of the mixture models in XLSTAT
XLSTAT offers the following results for mixture models:
- The values of the selection criterion for the selected set of models and for a number of components varying within a range defined by the user.
- Estimation of model parameters: mixing proportions, means and variances by cluster for the selected model.
- Some characteristics of the selected model: BIC, AIC, ICL, log-likelihood, NEC, Entropy and DF.
- The probability of belonging to each cluster and the MAP classification.
In the one-dimensional case, XLSTAT offers two diagnostic plots:
- Plot of the empirical cumulative distribution function against the estimated one.
- Q-Q plot between the quartiles of the empirical distribution and the estimated mixture.