Latent Class regression models
Latent class modeling is a powerful method for obtaining meaningful segments that differ with respect to response patterns associated with categorical or continuous variables or both (latent class cluster models), or differ with respect to regression coefficients where the dependent variable is continuous, categorical, or a frequency count (latent class regression models).
What is Latent Class Analysis?
Latent class analysis (LCA) involves the construction of Latent Classes which are unobserved (latent) subgroups or segments of cases. The latent classes are constructed based on the observed (manifest) responses of the cases on a set of indicator variables. Cases within the same latent class are homogeneous with respect to their responses on these indicators, while cases in different latent classes differ in their response patterns. Formally, latent classes are represented by K distinct categories of a nominal latent variable X.. Since the latent variable is categorical, Latent Class modeling differs from more traditional latent variable approaches such as factor analysis, structural equation models, and random-effects regression models since these approaches are based on continuous latent variables.
What is a Latent Class regression model?
A Latent Class regression model:
- Is used to predict a dependent variable as a function of predictor variables (Regression model).
- Includes a K-category latent variable X to cluster cases (LC model)
- Each category represents a homogeneous subpopulation (segment) having identical regression coefficients (LC Regression Model).
- Each case may contain multiple records (Regression with repeated measurements).
- The appropriate model is estimated according to the scale type of the dependent variable:
- Continuous: Linear regression model (with normally distributed residuals).
- Nominal (with more than 2 levels): Multinomial logistic regression.
- Ordinal (with more than 2 ordered levels): Adjacent-category ordinal logistic regression model.
- Count: Log-linear Poisson regression.
- Binomial Count: Binomial logistic regression model.
XLSTAT-LG allows lauching computations automatically on different models according to different number of classes. It is also possible to optimize Bayes constants, sets of random starting values, as well iteration parameters for both the Expectation-Maximization and Newton-Raphson algorithms, which are used for model estimation.
XLSTAT-LG provides one section per model (each model being represented by a specific number of classes):
Model Summary Statistics: Number of cases used in model estimation, number of distinct parameters estimated, seed and best seed that can reproduce the current model more quickly using the number of starting sets =0.
Estimation Summary: for each of the Expectation-Maximization and Newton-Raphson algorithms, XLSTAT reports the number of iterations used, the log-posterior value, the likelihood-ratio goodness-of-fit value, as well as the final convergence value.
- Likelihood-ratio goodness-of-fit value (L²) for the current model and the associated bootstrap p-value.
- X2 and Cressie-Read. These are alternatives to L2 that should yield a similar p-value according to large sample theory if the model specified is valid and the data is not sparse.
- BIC, AIC, AIC3 and CAIC and SABIC (based on L²). These statistics (information criteria) weight fit and parsimony by adjusting the LL to account for the number of parameters in the model. The lower the value, the better the model.
- Dissimilarity index: A descriptive measure indicating how much the observed and estimated cell frequencies differ from one another. It indicates the proportion of the sample that needs to be movedto another cell to get a perfect fit.
- log-likelihood (LL), log-prior (associated to Bayes constants) as well as the log-posterior.
- BIC, AIC, AIC3, CAIC and SABIC (based on LL). these statistics (information criteria) weight fit and parsimony by adjusting the LL to account for the number of parameters in the model. The lower the value, the better the model.
- Classification errors (based on modal assignment).
- Reduction of errors (Lambda), entropy R², standard R². These pseudo R-squared statistics indicate how well one can predict class memberships based on the observed variables (indicators and covariates). The closer these values are to 1 the better the predictions.
- Classification log-likelihood under the assumption that the true class membership is known.
- AWE (similar to BIC, but also takes into account classification performance).
- Modal table: Cross-tabulates modal class assignments.
- Proportional table: Cross-tabulates probabilistic class assignments.
Prediction statistics table:
The columns in this table correspond to:
- prediction error of the baseline model (also referred to as null-model)
- Model: the prediction error of the estimated model.
- R2: the proportional reduction of errors in the estimated model compared to the baseline model
The rows in this table correspond to:
- Squared Error: Average prediction error based on squared error.
- Minus Log-likelihood: Average prediction error based on minus the log-likelihood.
- Absolute Error: Average prediction error based on absolute error.
- Prediction error: Average prediction error based on proportion of prediction errors (for categorical variables only).
Prediction Table: For nominal and ordinal dependent variables, a prediction table that cross-classifies observed and against estimated values is provided.
- R2: class-specific and overall R2 values. The overall R2 indicates how well the dependent variable is overall predicted by the model (same measure as appearing in Prediction Statistics). For ordinal, continuous, and (binomial) counts, these are standard R2 measures. For nominal dependent variables, these can be seen as weighted averages of separate R2 measures for each category treated as a separate dichotomous response variable.
- Intercept: intercept of the linear regression equation.
- s.e.: standard errors of the parameters.
- z-value: z-test statistics corresponding to the parameter tests.
- Wald: Wald statistics are provided in the output to assess the statistical significance of the set of parameter estimates associated with a given variable. Specifically, for each variable, the Wald statistic tests the restriction that each of the parameter estimates in that set equals zero (for variables specified as Nominal, the set includes parameters for each category of the variable). For Regression models, by default, two Wald statistics (Wald, Wald(=)) are provided in the table when more than 1 class has been estimated. For each set of parameter estimates, the Wald(=) statistic considers the subset associated with each class and tests the restriction that each parameter in that subset equals the corresponding parameter in the subsets associated with each of the other classes. That is, the Wald(=) statistic tests the equality of each set of regression effects across classes.
- p-value: measures of significance for the estimates.
- Mean: means for the regression coefficients.
- Std.Dev: standard deviations for the regression coefficients.
Classification: Outputs for each observation the posterior class memberships and the modal assignment based on the current model.
Vermunt, J.K. (2010). Latent class modeling with covariates: Two improved three-step approaches. Political Analysis, 18, 450-469. Link: http://members.home.nl/jeroenvermunt/lca_three_step.pdf
Vermunt, J.K., and Magidson, J. (2005). Latent GOLD 4.0 User's Guide. Belmont, MA: Statistical Innovations Inc. http://www.statisticalinnovations.com/technicalsupport/LGusersguide.pdf
Vermunt, J.K., and Magidson, J. (2013). Technical Guide for Latent GOLD 5.0: Basic, Advanced, and Syntax. Belmont, MA: Statistical Innovations Inc. http://www.statisticalinnovations.com/technicalsupport/LGtechnical.pdf
Vermunt, J.K., and Magidson, J. (2013). Latent GOLD 5.0 Upgrade Manual. Belmont, MA: Statistical Innovations Inc.