PLS Path Modelling

Partial Least Squares Path Modeling is a powerful tool similar to Structural Equation Modeling (SEM). Run PLS-PM in Excel using the XLSTAT software.

What is PLS Path Modeling?

Partial Least Squares Path Modeling (PLS-PM) is a statistical approach for modeling complex multivariable relationships (structural equation models) among observed and latent variables. Since a few years, this approach has been enjoying increasing popularity in several sciences (Esposito Vinzi et al., 2007). Structural Equation Models include a number of statistical methodologies allowing the estimation of a causal theoretical network of relationships linking latent complex concepts, each measured by means of a number of observable indicators.

The first presentation of the finalized PLS approach to path models with latent variables has been published by Wold in 1979 and then the main references on the PLS algorithm are Wold (1982 and 1985).

Herman Wold opposed LISREL (Jöreskog, 1970) "hard modeling" (heavy distribution assumptions, several hundreds of cases necessary) to PLS "soft modeling" (very few distribution assumptions, few cases can suffice). These two approaches to Structural Equation Modeling have been compared in Jöreskog and Wold (1982).

From the standpoint of structural equation modeling, PLS-PM is a component-based approach where the concept of causality is formulated in terms of linear conditional expectation. PLS-PM seeks for optimal linear predictive relationships rather than for causal mechanisms thus privileging a prediction-relevance oriented discovery process to the statistical testing of causal hypotheses. Two very important review papers on PLS approach to Structural Equation Modeling are Chin (1998, more application oriented) and Tenenhaus et al. (2005, more theory oriented).

Furthermore, PLS Path Modeling can be used for analyzing multiple tables and it is directly related to more classical data analysis methods used in this field. In fact, PLS-PM may be also viewed as a very flexible approach to multi-block (or multiple table) analysis by means of both the hierarchical PLS path model and the confirmatory PLS path model (Tenenhaus and Hanafi, 2007). This approach clearly shows how the "data-driven" tradition of multiple table analysis can be somehow merged in the "theory-driven" tradition of structural equation modeling so as to allow running the analysis of multi-block data in light of current knowledge on conceptual relationships between tables.

The PLS Path Modeling algorithm

A PLS Path model is described by two models: (1) a measurement model relating the manifest variables to their own latent variable and (2) a structural model relating some endogenous latent variables to other latent variables. The measurement model is also called the outer model and the structural model the inner model.

1. Manifest variables standardization

There exist four options for the standardization of the manifest variables depending upon three conditions that eventually hold in the data:

Condition 1: The scales of the manifest variables are comparable. For instance, in the ECSI example the item values (between 0 and 100) are comparable. On the other hand, for instance, weight in tons and speed in km/h would not be comparable.
Condition 2: The means of the manifest variables are interpretable. For instance, if the difference between two manifest variables is not interpretable, the location parameters are meaningless.
Condition 3: The variances of the manifest variables reflect their importance.

If condition 1 does not hold, then the manifest variables have to be standardized (mean 0 and variance 1).

If condition 1 holds, it is useful to get the results based on the raw data. But the calculation of the model parameters depends upon the validity of the other conditions:

Condition 2 and 3 do not hold: The manifest variables are standardized (mean 0 variance 1) for the parameter estimation phase. Then the manifest variables are rescaled to their original means and variances for the final expression of the weights and loadings.

Condition 2 holds, but not condition 3: The manifest variables are not centered, but are standardized to unitary variance for the parameter estimation phase. Then the manifest variables are rescaled to their original variances for the final expression of the weights and loadings (to be defined later).

Conditions 2 and 3 hold: Use the original variables.

Lohmöller (1989) introduced a standardization parameter to select one of these four options:

Variable scales are comparable	Means are interpretable	Variance is related to variable importance	Mean	Variance	Rescaling	METRIC
No			0	1	No	1
Yes	No	No	0	1	Yes	2
Yes	Yes	No	Original	1	Yes	3
Yes	Yes	Yes	Original	Original		4

With METRIC=1 being "Standardized, weights on standardized MV", METRIC=2 being "Standardized, weights on raw MV", METRIC=3 being "Reduced, weights on raw MV" and METRIC=4 being "Raw MV".

2. The measurement model

A latent variable (LV) ξ is an unobservable variable (or construct) indirectly described by a block of observable variables x_h which are called manifest variables (MV) or indicators. There are three ways to relate the manifest variables to their latent variables, respectively called the reflective way, the formative one, and the MIMIC (Multiple effect Indicators for Multiple Causes) way.

2.1. The reflective way

2.1.1. Definition

In this model each manifest variable reflects its latent variable. Each manifest variable is related to its latent variable by a simple regression:

x_h = π_h0+ π_hξ + ε_h,

where ξ has mean m and standard deviation 1. It is a reflective scheme: each manifest variable x_h reflects its latent variable ξ. The only hypothesis made on model (1) is called by H. Wold the predictor specification condition:

E(x_h | ξ) = π_h0+ π_hξ.

This hypothesis implied that the residual ε_h has a zero mean and is uncorrelated with the latent variable ξ.

2.1.2. Check for unidimensionality

In the reflective way the block of manifest variables is unidimensional in the meaning of factor analysis. On practical data this condition has to be checked. Three main tools are available to check the unidimensionality of a block: use of principal component analysis of each block of manifest variables, Cronbach's a and Dillon-Goldstein's r.

Principal component analysis of a block
A block is essentially unidimensional if the first eigenvalue of the correlation matrix of the block MVs is larger than 1 and the second one smaller than 1, or at least very far from the first one. The first principal component can be built in such a way that it is positively correlated with all (or at least a majority of) the MVs. There is a problem with MV negatively correlated with the first principal component.
Cronbach's α
Cronbach's α can be used to check unidimensionality of a block of p variables x_h when they are all positively correlated. Cronbach has proposed the following procedure for standardized variables:
α = p / (p-1) [Ʃ_h≠h’cor(x_h, x_h’) / (p + Ʃ_h≠h’cor(x_h, x_h’))]
The Cronbach’s alpha is also defined for original (raw) variables as:
α = p / (p-1) [Ʃ_h≠h’cor(x_h, x_h’) / var(Ʃ_hx_h)]
A block is considered as unidimensional when the Cronbach's alpha is larger than 0.7.
Dillon-Goldstein's r
The sign of the correlation between each MV x_h and its LV ξ is known by construction of the item and is supposed here to be positive. In equation (1) this hypothesis means that all the loadings π_h are positive. A block is unidimensional if all these loadings are large.
The Goldstein-Dillon's r is defined by:
r = (Ʃ_h=1..pπ_h)²Var(ξ) / [(Ʃ_h=1..pπ_h)² Var(ξ) + Ʃ_h=1..pε_h]
Let's now suppose that all the MVs x_h and the latent variable ξ are standardized. An approximation of the latent variable ξ is obtained by standardization of the first principal component t₁ of the block MVs. Then p_h is estimated by cor(x_h, t₁) and, using equation (1), Var(ε_h) is estimated by 1 – cor²(x_h, t₁). So we get an estimate of the Dillon-Goldstein's r:
ȓ = (Ʃ_h=1..pcor(x_h,t₁))² / [(Ʃ_h=1..pcor(x_h,t₁))² / + Ʃ_h=1..pVar(ε_h)]

PLS Path Modeling is a mixture of a priori knowledge and data analysis. In the reflective way, the a priori knowledge concerns the unidimensionality of the block and the signs of the loadings. The data have to fit this model. If they do not, they can be modified by removing some manifest variables that are far from the model. Another solution is to change the model and use the formative way that will now be described.

2.2. The formative way

In the formative way, it is supposed that the latent variable ξ is generated by its own manifest variables. The latent variable ξ is a linear function of its manifest variables plus a residual term:

ξ = Ʃ_hw_hx_h + δ

In the formative model the block of manifest variables can be multidimensional. The predictor specification condition is supposed to hold as:

E(ξ|x₁...x_{p_i})= Ʃ_hw_hx_h

This hypothesis implies that the residual vector δ has a zero mean and is uncorrelated with the MVs x_h.

2.3. The MIMIC way

The MIMIC way is a mixture of the reflective and formative ways.

The measurement model for a block is the following:

x_h = π_h0+ π_hξ + ε_h, for h = 1 to p₁

where the latent variable is defined by:

ξ = Ʃ_h=p1+1w_hx_h + δ_h

The p₁ first manifest variables follow a reflective way and the (p – p₁) last ones a formative way. The predictor specification hypotheses still hold and lead to the same consequences as before on the residuals.

3. The structural model

The causality model leads to linear equations relating the latent variables between them (the structural or inner model):

ξ_j = β_j0 Ʃ_i β _ji ξ_i + v_j

The predictor specification hypothesis is still applied.

A latent variable, which never appears as a dependent variable, is called an exogenous variable. Otherwise it is called an endogenous variable.

4. The Estimation Algorithm

4.1. Latent variables Estimation

The latent variables ξ_j are estimated according to the following procedure.

4.1.1. Outer estimate y_j of the standardized latent variable (ξ_j – m_j)

The standardized latent variables (mean = 0 and standard deviation = 1) are estimated as linear combinations of their centered manifest variables:

y_j ∞ ± [Ʃ w_jh (x_jh - ẋ_jh)]

where the symbol "∞" means that the left variable represents the standardized right variable and the "±" sign shows the sign ambiguity. This ambiguity is solved by choosing the sign making y_j positively correlated to a majority of x_jh.

The standardized latent variable is finally written as:

y_j = Ʃ ŵ_jh (x_jh - ẋ_jh)

The coefficients w_jh and ŵ_jh are both called the outer weights.

The mean m_j is estimated by:

ṁ_j = Ʃ ŵ_jh ẋ_jh

and the latent variable ξ_j by

approx(ξ_j) = Ʃ ŵ_jh x_jh = y_h ṁ_j

When all manifest variables are observed on the same measurement scale, it is nice to express (Fornell (1992)) latent variables estimates in the original scale as:

approx(ξ_j)* = Ʃ ŵ_jh x_jh / Ʃ ŵ_jh.

This equation is feasible when all outer weights are positive. Finally, most often in real applications, latent variables estimates are required on a 0-100 scale so as to have a reference scale to compare individual scores. For the i-th observed case, this is easily obtained by the following transformation:

approx(ξ_j)^0-100 = 100 * (approx(ξ_j)* - x_min) / (x_max - x_min)

where x_min and x_max are, respectively, the minimum and the maximum value of the measurement scale common to all manifest variables.

4.1.2. Inner estimate zj of the standardized latent variable (ξ_j – m_j)

The inner estimate z_j of the standardized latent variable (ξ_j – m_j) is defined by:

z_j ∞ Ʃ_{j':ξ_i' is connected with ξ_i} e_jj' y_j'

where the inner weights e_jj’ are equal to the signs of the correlations between yj and the y_j’'s connected with y_j. Two latent variables are connected if there exists a link between the two variables: an arrow goes from one variable to the other in the arrow diagram describing the causality model. This choice of inner weights is called the centroid scheme.

Centroid scheme:
This choice shows a drawback in case the correlation is approximately zero as its sign may change for very small fluctuations. But it does not seem to be a problem in practical applications.
In the original algorithm, the inner estimate is the right term and there is no standardization. We prefer to standardize because it does not change anything for the final inner estimate of the latent variables and it simplifies the writing of some equations.
Two other schemes for choosing the inner weights exist: the factorial scheme and the path weighting (or structural) scheme. These two new schemes are defined as follows:
Factorial scheme:
The inner weights e_ji are equal to the correlation between y_i and y_j. This is an answer to the drawbacks of the centroid scheme described above.
Path weighting scheme (structural):
The latent variables connected to x_j are divided into two groups: the predecessors of x_j, which are latent variables explaining x_j, and the followers, which are latent variables explained by x_j.
For a predecessor x_j’ of the latent variable x_j, the inner weight e_jj’ is equal to the regression coefficient of y_j’ in the multiple regression of y_j on all the y_j’’s related to the predecessors of x_j. If x_j’ is a successor of xj then the inner weight e_jj’ is equal to the correlation between y_j’ and y_j.

These new schemes do not significantly influence the results but are very important for theoretical reasons. In fact, they allow to relate PLS Path modeling to usual multiple table analysis methods.

4.2. The PLS algorithm for estimating the weights

4.2.1. Estimation modes for the weights wjh

There are three classical ways to estimate the weights w_jh: Mode A, Mode B and Mode C.

Mode A:

In mode A the weight w_jh is the regression coefficient of z_j in the simple regression of x_jh on the inner estimate z_j:

w_jh = cov(x_jh, z_j),

as z_j is standardized.

Mode B:

In mode B the vector w_j of weights w_jh is the regression coefficient vector in the multiple regression of z_j on the manifest centered variables (x_jh - ẋ_jh) related to the same latent variable ξ_j:

w_j = (X_j'X_j)-1X_j'z_j,

where X_j is the matrix with columns defined by the centered manifest variables x_jh - ẋ_jh related to the j-th latent variable ξ_j.

Mode A is appropriate for a block with a reflective measurement model and Mode B for a formative one. Mode A is often used for an endogenous latent variable and mode B for an exogenous one. Modes A and B can be used simultaneously when the measurement model is the MIMIC one. Mode A is used for the reflective part of the model and Mode B for the formative part.

In practical situations, mode B is not so easy to use because there is often strong multicollinearity inside each block. When this is the case, PLS regression may be used instead of OLS multiple regression. As a matter of fact, it may be noticed that mode A consists in taking the first component from a PLS regression, while mode B takes all PLS regression components (and thus coincides with OLS multiple regression). Therefore, running a PLS regression and retaining a certain number of significant components may be meant as a new intermediate mode between mode A and mode B.

Mode C (centroid):

In mode C the weights are all equal in absolute value and reflect the signs of the correlations between the manifest variables and their latent variables:

w_jh = sign(cor(x_jh, z_j)).

These weights are then normalized so that the resulting latent variable has unitary variance. Mode C actually refers to a formative way of linking manifest variables to their latent variables and represents a specific case of Mode B whose comprehension is very intuitive to practitioners.

4.2.2. Estimating the weights

The starting step of the PLS algorithm consists in beginning with an arbitrary vector of weights w_jh. These weights are then standardized in order to obtain latent variables with unitary variance.

A good choice for the initial weight values is to take w_jh = sign(cor(x_jh, ξ_h)) or, more simply, w_jh = sign(cor(x_jh, ξ_h)) for h = 1 and 0 otherwise or they might be the elements of the first eigenvector from a PCA of each block.

Then the steps for the outer and the inner estimates, depending on the selected mode, are iterated until convergence (guaranteed only for the two-blocks case, but practically always encountered in practice even with more than two blocks).

After the last step, final results are yielded for the inner weights ŵ_jh, the standardized latent variable y_j = Ʃ ŵ_jh (x_jh- ẋ_jh) the estimated mean ṁ_j = Ʃ ŵ_jh ẋ_jh of the latent variable ξ_j, and the final estimate approx(ξ_j) = Ʃ ŵ_jh x_jh = y_j + ṁ_j of ξ_j. The latter estimate can be rescaled.

The latent variable estimates are sensitive to the scaling of the manifest variables in Mode A, but not in mode B. In the latter case, the outer LV estimate is the projection of the inner LV estimate on the space generated by its manifest variables.

4.3. Estimation of the structural equations

The structural equations are estimated by individual OLS multiple regressions where the latent variables ξ_j are replaced by their estimates approx( ξ_j). As usual, the use of OLS multiple regressions may be disturbed by the presence of strong m ulticollinearity between the estimated latent variables. In such a case, PLS regression may be applied instead.

View all tutorials

analyze your data with xlstat

14-day free trial

Download xlstat

Related features

Regularized Generalized Canonical Correlation Analysis (RGCCA)

Generalized Structured Component Analysis (GSCA)