# PLS Path Modelling

Partial Least Squares Path Modeling is a powerful tool similar to Structural Equation Modeling (SEM). Run PLS-PM in Excel using the XLSTAT software.

## What is PLS Path Modeling?

Partial Least Squares Path Modeling (PLS-PM) is a statistical approach for modeling complex multivariable relationships (structural equation models) among observed and latent variables. Since a few years, this approach has been enjoying increasing popularity in several sciences (Esposito Vinzi et al., 2007). Structural Equation Models include a number of statistical methodologies allowing the estimation of a causal theoretical network of relationships linking latent complex concepts, each measured by means of a number of observable indicators.

The first presentation of the finalized PLS approach to path models with latent variables has been published by Wold in 1979 and then the main references on the PLS algorithm are Wold (1982 and 1985).

Herman Wold opposed LISREL (Jöreskog, 1970) "hard modeling" (heavy distribution assumptions, several hundreds of cases necessary) to PLS "soft modeling" (very few distribution assumptions, few cases can suffice). These two approaches to Structural Equation Modeling have been compared in Jöreskog and Wold (1982).

From the standpoint of structural equation modeling, PLS-PM is a component-based approach where the concept of causality is formulated in terms of linear conditional expectation. PLS-PM seeks for optimal linear predictive relationships rather than for causal mechanisms thus privileging a prediction-relevance oriented discovery process to the statistical testing of causal hypotheses. Two very important review papers on PLS approach to Structural Equation Modeling are Chin (1998, more application oriented) and Tenenhaus et al. (2005, more theory oriented).

Furthermore, PLS Path Modeling can be used for analyzing multiple tables and it is directly related to more classical data analysis methods used in this field. In fact, PLS-PM may be also viewed as a very flexible approach to multi-block (or multiple table) analysis by means of both the hierarchical PLS path model and the confirmatory PLS path model (Tenenhaus and Hanafi, 2007). This approach clearly shows how the "data-driven" tradition of multiple table analysis can be somehow merged in the "theory-driven" tradition of structural equation modeling so as to allow running the analysis of multi-block data in light of current knowledge on conceptual relationships between tables.

## The PLS Path Modeling algorithm

A PLS Path model is described by two models: (1) a measurement model relating the manifest variables to their own latent variable and (2) a structural model relating some endogenous latent variables to other latent variables. The measurement model is also called the outer model and the structural model the inner model.

#### 1. Manifest variables standardization

There exist four options for the standardization of the manifest variables depending upon three conditions that eventually hold in the data:

- Condition 1: The scales of the manifest variables are comparable. For instance, in the ECSI example the item values (between 0 and 100) are comparable. On the other hand, for instance, weight in tons and speed in km/h would not be comparable.
- Condition 2: The means of the manifest variables are interpretable. For instance, if the difference between two manifest variables is not interpretable, the location parameters are meaningless.
- Condition 3: The variances of the manifest variables reflect their importance.

If condition 1 does not hold, then the manifest variables have to be standardized (mean 0 and variance 1).

If condition 1 holds, it is useful to get the results based on the raw data. But the calculation of the model parameters depends upon the validity of the other conditions:

Condition 2 and 3 do not hold: The manifest variables are standardized (mean 0 variance 1) for the parameter estimation phase. Then the manifest variables are rescaled to their original means and variances for the final expression of the weights and loadings.

Condition 2 holds, but not condition 3: The manifest variables are not centered, but are standardized to unitary variance for the parameter estimation phase. Then the manifest variables are rescaled to their original variances for the final expression of the weights and loadings (to be defined later).

Conditions 2 and 3 hold: Use the original variables.

Lohmöller (1989) introduced a standardization parameter to select one of these four options:

Variable scales are comparable | Means are interpretable | Variance is related to variable importance | Mean | Variance | Rescaling | METRIC |
---|---|---|---|---|---|---|

No | 0 | 1 | No | 1 | ||

Yes | No | No | 0 | 1 | Yes | 2 |

Yes | Yes | No | Original | 1 | Yes | 3 |

Yes | Yes | Yes | Original | Original | 4 |

With METRIC=1 being "Standardized, weights on standardized MV", METRIC=2 being "Standardized, weights on raw MV", METRIC=3 being "Reduced, weights on raw MV" and METRIC=4 being "Raw MV".

#### 2. The measurement model

A latent variable (LV) ξ is an unobservable variable (or construct) indirectly described by a block of observable variables x_{h} which are called manifest variables (MV) or indicators. There are three ways to relate the manifest variables to their latent variables, respectively called the reflective way, the formative one, and the MIMIC (Multiple effect Indicators for Multiple Causes) way.

##### 2.1. The reflective way

**2.1.1. Definition**

In this model each manifest variable reflects its latent variable. Each manifest variable is related to its latent variable by a simple regression:

x_{h} = π_{h0}+ π_{h}ξ + ε_{h},

where ξ has mean m and standard deviation 1. It is a reflective scheme: each manifest variable x_{h} reflects its latent variable ξ. The only hypothesis made on model (1) is called by H. Wold the predictor specification condition:

E(x_{h} | ξ) = π_{h0}+ π_{h}ξ.

This hypothesis implied that the residual ε_{h} has a zero mean and is uncorrelated with the latent variable ξ.

**2.1.2. Check for unidimensionality**

In the reflective way the block of manifest variables is unidimensional in the meaning of factor analysis. On practical data this condition has to be checked. Three main tools are available to check the unidimensionality of a block: use of principal component analysis of each block of manifest variables, Cronbach's a and Dillon-Goldstein's r.

- Principal component analysis of a block

A block is essentially unidimensional if the first eigenvalue of the correlation matrix of the block MVs is larger than 1 and the second one smaller than 1, or at least very far from the first one. The first principal component can be built in such a way that it is positively correlated with all (or at least a majority of) the MVs. There is a problem with MV negatively correlated with the first principal component. - Cronbach's α

Cronbach's α can be used to check unidimensionality of a block of p variables x_{h}when they are all positively correlated. Cronbach has proposed the following procedure for standardized variables:

α = p / (p-1) [Ʃ_{h≠h’}cor(x_{h}, x_{h’}) / (p + Ʃ_{h≠h’}cor(x_{h}, x_{h’}))]

The Cronbach’s alpha is also defined for original (raw) variables as:

α = p / (p-1) [Ʃ_{h≠h’}cor(x_{h}, x_{h’}) / var(Ʃ_{h}x_{h})]

A block is considered as unidimensional when the Cronbach's alpha is larger than 0.7. - Dillon-Goldstein's r

The sign of the correlation between each MV x_{h}and its LV ξ is known by construction of the item and is supposed here to be positive. In equation (1) this hypothesis means that all the loadings π_{h}are positive. A block is unidimensional if all these loadings are large.

The Goldstein-Dillon's r is defined by:

r = (Ʃ_{h=1..p}π_{h})²Var(ξ) / [(Ʃ_{h=1..p}π_{h})² Var(ξ) + Ʃ_{h=1..p}ε_{h}]

Let's now suppose that all the MVs x_{h}and the latent variable ξ are standardized. An approximation of the latent variable ξ is obtained by standardization of the first principal component t_{1}of the block MVs. Then p_{h}is estimated by cor(x_{h}, t_{1}) and, using equation (1), Var(ε_{h}) is estimated by 1 – cor^{2}(x_{h}, t_{1}). So we get an estimate of the Dillon-Goldstein's r:

ȓ = (Ʃ_{h=1..p}cor(x_{h},t_{1}))² / [(Ʃ_{h=1..p}cor(x_{h},t_{1}))² / + Ʃ_{h=1..p}Var(ε_{h})]

PLS Path Modeling is a mixture of a priori knowledge and data analysis. In the reflective way, the a priori knowledge concerns the unidimensionality of the block and the signs of the loadings. The data have to fit this model. If they do not, they can be modified by removing some manifest variables that are far from the model. Another solution is to change the model and use the formative way that will now be described.

##### 2.2. The formative way

In the formative way, it is supposed that the latent variable ξ is generated by its own manifest variables. The latent variable ξ is a linear function of its manifest variables plus a residual term:

ξ = Ʃ_{h}w_{h}x_{h} + δ

In the formative model the block of manifest variables can be multidimensional. The predictor specification condition is supposed to hold as:

E(ξ|x_{1}...x_{pi})= Ʃ_{h}w_{h}x_{h}

This hypothesis implies that the residual vector δ has a zero mean and is uncorrelated with the MVs x_{h}.

##### 2.3. The MIMIC way

The MIMIC way is a mixture of the reflective and formative ways.

The measurement model for a block is the following:

x_{h} = π_{h0}+ π_{h}ξ + ε_{h}, for h = 1 to p_{1}

where the latent variable is defined by:

ξ = Ʃ_{h=p1+1 }w_{h}x_{h} + δ_{h}

The p_{1} first manifest variables follow a reflective way and the (p – p_{1}) last ones a formative way. The predictor specification hypotheses still hold and lead to the same consequences as before on the residuals.

#### 3. The structural model

The causality model leads to linear equations relating the latent variables between them (the structural or inner model):

ξ_{j} = β_{j0} Ʃ_{i} β _{ji} ξ_{i} + v_{j}

The predictor specification hypothesis is still applied.

A latent variable, which never appears as a dependent variable, is called an exogenous variable. Otherwise it is called an endogenous variable.

#### 4. The Estimation Algorithm

##### 4.1. Latent variables Estimation

The latent variables ξ_{j} are estimated according to the following procedure.

**4.1.1. Outer estimate y _{j} of the standardized latent variable (ξ_{j} – m_{j})**

The standardized latent variables (mean = 0 and standard deviation = 1) are estimated as linear combinations of their centered manifest variables:

y_{j} ∞ ± [Ʃ w_{jh} (x_{jh} - ẋ_{jh})]

where the symbol "∞" means that the left variable represents the standardized right variable and the "±" sign shows the sign ambiguity. This ambiguity is solved by choosing the sign making y_{j} positively correlated to a majority of x_{jh}.

The standardized latent variable is finally written as:

y_{j} = Ʃ ŵ_{jh} (x_{jh} - ẋ_{jh})

The coefficients w_{jh} and ŵ_{jh} are both called the outer weights.

The mean m_{j} is estimated by:

ṁ_{j} = Ʃ ŵ_{jh} ẋ_{jh}

and the latent variable ξ_{j} by

approx(ξ_{j}) = Ʃ ŵ_{jh} x_{jh} = y_{h} ṁ_{j}

When all manifest variables are observed on the same measurement scale, it is nice to express (Fornell (1992)) latent variables estimates in the original scale as:

approx(ξ_{j})* = Ʃ ŵ_{jh} x_{jh} / Ʃ ŵ_{jh}.

This equation is feasible when all outer weights are positive. Finally, most often in real applications, latent variables estimates are required on a 0-100 scale so as to have a reference scale to compare individual scores. For the i-th observed case, this is easily obtained by the following transformation:

approx(ξ_{j})^{0-100} = 100 * (approx(ξ_{j})* - x_{min}) / (x_{max} - x_{min})

where x_{min} and x_{max} are, respectively, the minimum and the maximum value of the measurement scale common to all manifest variables.

**4.1.2. Inner estimate zj of the standardized latent variable (ξ _{j} – m_{j})**

The inner estimate z_{j} of the standardized latent variable (ξ_{j} – m_{j}) is defined by:

z_{j} ∞ Ʃ_{j':ξi' is connected with ξi } e_{jj'} y_{j'}

where the inner weights e_{jj’} are equal to the signs of the correlations between yj and the y_{j’}'s connected with y_{j}. Two latent variables are connected if there exists a link between the two variables: an arrow goes from one variable to the other in the arrow diagram describing the causality model. This choice of inner weights is called the centroid scheme.

- Centroid scheme:

This choice shows a drawback in case the correlation is approximately zero as its sign may change for very small fluctuations. But it does not seem to be a problem in practical applications.

In the original algorithm, the inner estimate is the right term and there is no standardization. We prefer to standardize because it does not change anything for the final inner estimate of the latent variables and it simplifies the writing of some equations.

Two other schemes for choosing the inner weights exist: the factorial scheme and the path weighting (or structural) scheme. These two new schemes are defined as follows: - Factorial scheme:

The inner weights e_{ji}are equal to the correlation between y_{i}and y_{j}. This is an answer to the drawbacks of the centroid scheme described above. - Path weighting scheme (structural):

The latent variables connected to x_{j}are divided into two groups: the predecessors of x_{j}, which are latent variables explaining x_{j}, and the followers, which are latent variables explained by x_{j}.

For a predecessor x_{j’}of the latent variable x_{j}, the inner weight e_{jj’}is equal to the regression coefficient of y_{j’}in the multiple regression of y_{j}on all the y_{j’}’s related to the predecessors of x_{j}. If x_{j’}is a successor of xj then the inner weight e_{jj’}is equal to the correlation between y_{j’}and y_{j}.

These new schemes do not significantly influence the results but are very important for theoretical reasons. In fact, they allow to relate PLS Path modeling to usual multiple table analysis methods.

##### 4.2. The PLS algorithm for estimating the weights

**4.2.1. Estimation modes for the weights wjh**

There are three classical ways to estimate the weights w_{jh}: Mode A, Mode B and Mode C.

*Mode A:*

In mode A the weight w_{jh} is the regression coefficient of z_{j} in the simple regression of x_{jh} on the inner estimate z_{j}:

w_{jh} = cov(x_{jh}, z_{j}),

as z_{j} is standardized.

*Mode B:*

In mode B the vector w_{j} of weights w_{jh} is the regression coefficient vector in the multiple regression of z_{j} on the manifest centered variables (x_{jh} - ẋ_{jh}) related to the same latent variable ξ_{j}:

w_{j} = (X_{j'}X_{j})-1X_{j'}z_{j},

where X_{j} is the matrix with columns defined by the centered manifest variables x_{jh} - ẋ_{jh} related to the j-th latent variable ξ_{j}.

Mode A is appropriate for a block with a reflective measurement model and Mode B for a formative one. Mode A is often used for an endogenous latent variable and mode B for an exogenous one. Modes A and B can be used simultaneously when the measurement model is the MIMIC one. Mode A is used for the reflective part of the model and Mode B for the formative part.

In practical situations, mode B is not so easy to use because there is often strong multicollinearity inside each block. When this is the case, PLS regression may be used instead of OLS multiple regression. As a matter of fact, it may be noticed that mode A consists in taking the first component from a PLS regression, while mode B takes all PLS regression components (and thus coincides with OLS multiple regression). Therefore, running a PLS regression and retaining a certain number of significant components may be meant as a new intermediate mode between mode A and mode B.

*Mode C (centroid):*

In mode C the weights are all equal in absolute value and reflect the signs of the correlations between the manifest variables and their latent variables:

w_{jh} = sign(cor(x_{jh}, z_{j})).

These weights are then normalized so that the resulting latent variable has unitary variance. Mode C actually refers to a formative way of linking manifest variables to their latent variables and represents a specific case of Mode B whose comprehension is very intuitive to practitioners.

**4.2.2. Estimating the weights**

The starting step of the PLS algorithm consists in beginning with an arbitrary vector of weights w_{jh}. These weights are then standardized in order to obtain latent variables with unitary variance.

A good choice for the initial weight values is to take w_{jh} = sign(cor(x_{jh}, ξ_{h})) or, more simply, w_{jh} = sign(cor(x_{jh}, ξ_{h})) for h = 1 and 0 otherwise or they might be the elements of the first eigenvector from a PCA of each block.

Then the steps for the outer and the inner estimates, depending on the selected mode, are iterated until convergence (guaranteed only for the two-blocks case, but practically always encountered in practice even with more than two blocks).

After the last step, final results are yielded for the inner weights ŵ_{jh}, the standardized latent variable y_{j} = Ʃ ŵ_{jh} (x_{jh}- ẋ_{jh}) the estimated mean ṁ_{j} = Ʃ ŵ_{jh} ẋ_{jh} of the latent variable ξ_{j}, and the final estimate approx(ξ_{j}) = Ʃ ŵ_{jh} x_{jh} = y_{j} + ṁ_{j} of ξ_{j}. The latter estimate can be rescaled.

The latent variable estimates are sensitive to the scaling of the manifest variables in Mode A, but not in mode B. In the latter case, the outer LV estimate is the projection of the inner LV estimate on the space generated by its manifest variables.

##### 4.3. Estimation of the structural equations

The structural equations are estimated by individual OLS multiple regressions where the latent variables ξ_{j} are replaced by their estimates approx( ξ_{j}). As usual, the use of OLS multiple regressions may be disturbed by the presence of strong m ulticollinearity between the estimated latent variables. In such a case, PLS regression may be applied instead.

### analyze your data with xlstat

Included in