Latent Class cluster models
Latent class modeling is a powerful method for obtaining meaningful segments that differ with respect to response patterns associated with categorical or continuous variables or both (latent class cluster models), or differ with respect to regression coefficients where the dependent variable is continuous, categorical, or a frequency count (latent class regression models).
What is Latent Class Analysis?
Latent class analysis involves the construction of Latent Classes which are unobserved (latent) subgroups or segments of cases. The latent classes are constructed based on the observed (manifest) responses of the cases on a set of indicator variables. Cases within the same latent class are homogeneous with respect to their responses on these indicators, while cases in different latent classes differ in their response patterns. Formally, latent classes are represented by K distinct categories of a nominal latent variable X.. Since the latent variable is categorical, Latent Class modeling differs from more traditional latent variable approaches such as factor analysis, structural equation models, and random-effects regression models since these approaches are based on continuous latent variables.
What is a Latent Class cluster model?
A Latent Class cluster model:
- Includes a nominal latent variable X with K categories, each category representing a cluster.
- Each cluster contains a homogeneous group of persons (cases) who share common interests, values, characteristics, and/or behavior (i.e., share common model parameters).
- These interest, values, characteristics, and/or behavior constitute the observed variables (indicators) Y upon which the latent clusters are derived.
XLSTAT-LG allows lauching computations automatically on different models according to different number of classes. It is also possible to optimize Bayes constants, sets of random starting values, as well iteration parameters for both the Expectation-Maximization and Newton-Raphson algorithms, which are used for model estimation.
Advantages of Latent Class cluster models over more traditional clustering methods
Advantages of Latent Class cluster models over more traditional ad-hoc types of cluster analysis methods include model selection criteria and probability-based classification. Posterior membership probabilities are estimated directly from the model parameters and used to assign cases to the modal class - the class for which the posterior probability is highest.
Furthermore, it is possible to include variables of different scales (continuous, ordinal or nominal) within the same model. These variables are called indicators.
A special feature of Latent Class cluster models is the ability to obtain an equation for calculating these posterior membership probabilities directly from the observed variables (indicators). This equation is called the scoring equation. It can be used to score new cases based on a LC cluster model estimated previously. That is, the equation can be used to classify new cases into their most likely latent class as a function of the observed variables. This feature is unique to LC models – it is not available with any other clustering technique.
XLSTAT-LG provides one section per model (each model being represented by a specific number of classes):
Model Summary Statistics: Number of cases used in model estimation, number of distinct parameters estimated, seed and best seed that can reproduce the current model more quickly using the number of starting sets =0.
Estimation Summary: for each of the Expectation-Maximization and Newton-Raphson algorithms, XLSTAT reports the number of iterations used, the log-posterior value, the likelihood-ratio goodness-of-fit value, as well as the final convergence value.
- Likelihood-ratio goodness-of-fit value (L²) for the current model and the associated bootstrap p-value.
- X2 and Cressie-Read. These are alternatives to L2 that should yield a similar p-value according to large sample theory if the model specified is valid and the data is not sparse.
- BIC, AIC, AIC3 and CAIC and SABIC (based on L²). These statistics (information criteria) weight fit and parsimony by adjusting the LL to account for the number of parameters in the model. The lower the value, the better the model.
- Dissimilarity index: A descriptive measure indicating how much the observed and estimated cell frequencies differ from one another. It indicates the proportion of the sample that needs to be movedto another cell to get a perfect fit.
- log-likelihood, log-prior (associated to Bayes constants) as well as the log-posterior.
- BIC, AIC, AIC3, CAIC and SABIC (based on LL). these statistics (information criteria) weight fit and parsimony by adjusting the LL to account for the number of parameters in the model. The lower the value, the better the model.
- Classification errors (based on modal assignment).
- Reduction of errors (Lambda), entropy R², standard R². These pseudo R-squared statistics indicate how well one can predict class memberships based on the observed variables (indicators and covariates). The closer these values are to 1 the better the predictions.
- Classification log-likelihood under the assumption that the true class membership is known.
- AWE (similar to BIC, but also takes into account classification performance).
- Modal table: Cross-tabulates modal class assignments.
- Proportional table: Cross-tabulates probabilistic class assignments.
Profile table, which includes:
- Number of clusters
- Indicators: The body of the table contains (marginal) conditional probabilities that show how the clusters are related to the Nominal or Ordinal indicator variables. These probabilities sum to 1. For indicators specified as Continuous, the body of the table contains means instead of probabilities. For indicators specified as Ordinal, means are displayed in addition to the conditional probabilities within each cluster (column).
- Standard errors for the (marginal) conditional probabilities.
Probabilities and means that appear in the Profile Output, are displayed graphically in a Profile Plot.
Frequencies / Residuals:
Table of observed vs. estimated expected frequencies (and residuals). Note: Residuals having magnitude greater than 2 are statistically significant. This output is not reported in the case of 1 or more continuous indicators.
Bivariate Residuals: a table containing the bivariate residuals (BVRs) for a model. Large BVRs suggest violation of the local independence assumption.
Scoring equation: regression coefficients associated with the multinomial logit model.
Classification: Outputs for each observation the posterior class memberships and the modal assignment based on the current model.
Vermunt, J.K. (2010). Latent class modeling with covariates: Two improved three-step approaches. Political Analysis, 18, 450-469. Link: http://members.home.nl/jeroenvermunt/lca_three_step.pdf
Vermunt, J.K., and Magidson, J. (2005). Latent GOLD 4.0 User's Guide. Belmont, MA: Statistical Innovations Inc. http://www.statisticalinnovations.com/technicalsupport/LGusersguide.pdf
Vermunt, J.K., and Magidson, J. (2013). Technical Guide for Latent GOLD 5.0: Basic, Advanced, and Syntax. Belmont, MA: Statistical Innovations Inc. http://www.statisticalinnovations.com/technicalsupport/LGtechnical.pdf
Vermunt, J.K., and Magidson, J. (2013). Latent GOLD 5.0 Upgrade Manual. Belmont, MA: Statistical Innovations Inc.