Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is one of the most popular data mining statistical methods. Run your PCA in Excel using the XLSTAT statistical software.
Principal Component Analysis in Excel
Principal Component Analysis (PCA) is a powerful and popular multivariate analysis method that lets you investigate multidimensional datasets with quantitative variables. It is widely used in biostatistics, marketing, sociology, and many other fields.
XLSTAT provides a complete and flexible PCA feature to explore your data directly in Excel. XLSTAT proposes several standard and advanced options that will let you gain a deep insight into your data:
- Run your PCA on raw data or on dissimilarity matrices.
- Add supplementary variables or observations.
- Filter out variables or observations according to different criteria to optimize PCA map readability.
- Perform rotations such as VARIMAX.
- Customize your correlation circle, your observations plot or your biplots as standard Excel charts.
- Copy your PCA coordinates from the results report to use them in further analyses.
- Launch a pre-configured Agglomerative Hierarchical Clustering (AHC) using the dendrogram button below the factor scores table.
We also provide many free learning resources on the web, such as a tutorial on how to run PCA in XLSTAT as well as a guide to choose the best data mining or multivariate data analysis method according to your situation.
What is Principal Component Analysis?
Principal Component Analysis is one of the most frequently used multivariate data analysis methods for dimensionality reduction.
It is a projection method as it projects observations from a p-dimensional space with p variables to a k-dimensional space (where k < p) so as to conserve the maximum amount of information (information is measured here through the total variance of the dataset) from the initial dimensions. PCA dimensions are also called axes or Factors. If the information associated with the first 2 or 3 axes represents a sufficient percentage of the total variability of the scatter plot, the observations could be represented on a 2 or 3-dimensional chart, thus making interpretation much easier.
The Principal Component Analysis, a Data Mining tool
PCA can thus be considered as a Data Mining method as it allows to easily extract information from large datasets. There are several uses for it, including:
- The study and visualization of the correlations between variables to hopefully be able to limit the number of variables to be measured afterwards.
- Obtaining non-correlated factors which are linear combinations of the initial variables so as to use these factors in modeling methods such as linear regression, logistic regression or discriminant analysis.
- Visualizing observations in a 2- or 3-dimensional space in order to identify uniform or atypical groups of observations.
Options for Principal Component Analysis in Excel using the XLSTAT software
PCA on Pearson or Covariance?
PCA is used to calculate matrices to project the variables in a new space using a new matrix which shows the degree of similarity between the variables. It is common to use the Pearson correlation coefficient or the covariance as the index of similarity, Pearson correlation and covariance have the advantage of giving positive semi-defined matrices whose properties are used in PCA. However other indexes may be used.
XLSTAT offers several data treatments to be used on the input data prior to Principal Component Analysis computations:
- Pearson, the classic PCA, that automatically standardizes or normalizes the data prior to computations to avoid inflating the impact of variables with high variances on the result.
- Covariance, that works on unstandardized variances and covariances (variables with high variances will play stronger roles in the outputs.
- Spearman, fully equivalent to a classic PCA (based on Pearson correlation) performed on the matrix of ranks.
Traditionally, a correlation coefficient rather than the covariance is used as using a correlation coefficient removes the effect of scale: thus a variable which varies between 0 and 1 does not weigh more in the projection than a variable varying between 0 and 1000. However in certain areas, when the variables are supposed to be on an identical scale or we want the variance of the variables to influence factor building, covariance is used.
Where only a similarity matrix is available rather than a table of observations/variables, or where you want to use another similarity index, you can carry out a PCA starting from the similarity matrix (correlation or covariance).
PCA with supplementary variables and observations
XLSTAT lets you add variables (qualitative or quantitative) or observations to the PCA after it has been computed. Those variables or observations are called supplementary. This can be used in several contexts. Here are two examples:
- If the user wants to investigate roughly how a set of dependent variables relates to the others. The set of dependent variables should be used here as a set of supplementary variables and the others (i.e. independent variables) should be used to build the PCA.
- If the user simply wants to see how different categories of observations behave in the PCA space (Males vs Females for example). In this case, a qualitative supplementary variable (sex) may be used to color observations according to the sex they belong to. It is also possible to display the category centroids as well as confidence ellipses around categories.
PCA with rotations: Varimax and others
Rotations can be applied on the factors. Several methods are available including Varimax, Quartimax, Equamax, Parsimax, Quartimin and Oblimin and Promax.
Results for Principal Component Analysis in XLSTAT
The XLSTAT PCA feature provides results relative to variables and to observations.
What are the Correlation/Covariance matrices?
This table shows the data to be used afterwards in the calculations. The type of correlation depends on the option chosen in the General tab in the dialog box. For correlations, significant correlations are displayed in bold.
What is Bartlett's sphericity test in PCA?
The results of the Bartlett sphericity test are displayed. They are used to reject or not the hypothesis according to which the variables are not correlated.
XLSTAT also proposes the Kaiser-Meyer-Olkin (KMO) test.
What are Eigenvalues and inertia?
Eigenvalues are the amount of information (inertia) summarized in every dimension. The first dimension contains the highest amount of inertia, followed by the second, then the third, and so on. XLSTAT displays eigenvalues in a table and in a chart (scree plot). The number of eigenvalues is equal to the number of non-null eigenvalues.
What are contributions?
Contributions (also called absolute contributions) represent the extent to which each variable contributed to building the corresponding PCA axis. They help in the interpretation.
How to interpret Squared cosines for the variables?
Squared cosines reflect the representationquality of a variable on a PCA axis. As in other factor methods, squared cosine analysis is used to avoid interpretation errors due to projection effects. If the squared cosines of a variable associated to an axis is low, the position of the variable on this axis should not be interpreted.
What are factor scores?
Factor scores are the observations coordinates on the PCA dimensions. They are displayed in a table XLSTAT. If supplementary data have been selected, these are displayed at the end of the table.
As for the results related to variables, XLSTAT displays observations contributions (i.e. their contribution in building the PCA axes) as well as squared cosines (i.e. their representation quality on the different axes).
Results with rotations
Where a rotation has been requested, the results of the rotation are displayed with the rotation matrix first applied to the factor loadings. This is followed by the modified variability percentages associated with each of the axes involved in the rotation. The coordinates, contributions and cosines of the variables and observations after rotation are displayed in the following tables.
XLSTAT charts for Principal Component Analysis in Excel
The PCA correlation circle or variables chart
The correlation circle (or variables chart) shows the correlations between the components and the initial variables. Supplementary variables can also be displayed in the shape of vectors.
The PCA observations charts
The observations charts represent the observations in the PCA space.
The PCA biplots
The biplots represent the observations and variables simultaneously in the new space. Here as well the supplementary variables can be plotted in the form of vectors. There are different types of biplots:
- Correlation biplot
- Distance biplot
- Symmetric biplot
XLSTAT allows to choose the coefficient whose square root is to be multiplied by the coordinates of the variables. This coefficient lets you adjust the position of the variable points in the biplot in order to make it more readable. If set to other than 1, the length of the variable vectors can no longer be interpreted as standard deviation (correlation biplot) or contribution (distance biplot).
Tutorials on how to run PCA in Excel using the XLSTAT software
Several examples and applications are available on our webiste which will help you set up and interpret a PCA analysis according to your needs.
- An example on how to run a Principal Component Analysis (PCA).
- An example on how to run a Principal Component Analysis (PCA) with supplementary variables and individuals.
- An example on how to customize a Principal Component Analysis (PCA) chart for an easier interpretation.
- An example on how to use Principal Component Analysis and apply filters based on communalities (squared cosines).
analyze your data with xlstat