Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is one of the most popular data mining statistical methods. Run your PCA in Excel using the XLSTAT statistical software.
Principal Component Analysis in Excel
Principal Component Analysis (PCA) is a powerful and popular multivariate analysis method that lets you investigate multidimensional datasets with quantitative variables. It is widely used in biostatistics, marketing, sociology, and many other fields.
XLSTAT provides a complete and flexible PCA feature to explore your data directly in Excel. XLSTAT proposes several standard and advanced options that will let you gain a deep insight into your data. You can run your PCA on raw data or on dissimilarity matrices, add supplementary variables or observations, filter out variables or observations according to different criteria to optimize PCA map readability. Also, you can perform rotations such as VARIMAX. Feel free to customize your correlation circle, your observations plot or your biplots as standard Excel charts. Copy your PCA coordinates from the results report to use them in further analyses.
What is Principal Component Analysis?
Principal Component Analysis is one of the most frequently used multivariate data analysis methods.
It is a projection method as it projects observations from a p-dimensional space with p variables to a k-dimensional space (where k < p) so as to conserve the maximum amount of information (information is measured here through the total variance of the dataset) from the initial dimensions. PCA dimensions are also called axes or Factors. If the information associated with the first 2 or 3 axes represents a sufficient percentage of the total variability of the scatter plot, the observations could be represented on a 2 or 3-dimensional chart, thus making interpretation much easier.
The Principal Component Analysis, a Data Mining tool
PCA can thus be considered as a Data Mining method as it allows to easily extract information from large datasets. There are several uses for it, including:
- The study and visualization of the correlations between variables to hopefully be able to limit the number of variables to be measured afterwards;
- Obtaining non-correlated factors which are linear combinations of the initial variables so as to use these factors in modeling methods such as linear regression, logistic regression or discriminant analysis.
- Visualizing observations in a 2- or 3-dimensional space in order to identify uniform or atypical groups of observations.
Options for Principal Component Analysis in Excel using the XLSTAT software
Pearson or Covariance?
XLSTAT offers several data treatments to be used on the input data prior to Principal Component Analysis computations:
- Pearson, the classic PCA, that automatically standardizes the data prior to computations to avoid inflating the impact of variables with high variances on the result.
- Covariance, that works on unstandardized variances and covariances (variables with high variances will play stronger roles in the outputs.
- Polychoric, for ordinal data.
Supplementary variables and observations
XLSTAT lets you add variables (qualitative or quantitative) or observations to the PCA after it has been computed. Those variables or observations are called supplementary. This can be used in several contexts. Here are two examples:
- If the user wants to investigate roughly how a set of dependent variables relates to the others. The set of dependent variables should be used here as a set of supplementary variables and the others (i.e. independent variables) should be used to build the PCA.
- If the user simply wants to see how different categories of observations behave in the PCA space (Males vs Females for example). In this case, a qualitative supplementary variable (sex) may be used to color observations according to the sex they belong to. It is also possible to display the category centroids as well as confidence ellipses around categories.
Rotations: Varimax and others
Rotations can be applied on the factors. Several methods are available including Varimax, Quartimax, Equamax, Parsimax, Quartimin and Oblimin and Promax.
Results for Principal Component Analysis in XLSTAT
The XLSTAT PCA feature provides results relative to variables and to observations.
What are the Correlation/Covariance matrices?
This table shows the data to be used afterwards in the calculations. The type of correlation depends on the option chosen in the General tab in the dialog box. For correlations, significant correlations are displayed in bold.
What is Bartlett's sphericity test in PCA?
The results of the Bartlett sphericity test are displayed. They are used to reject or not the hypothesis according to which the variables are not correlated.
XLSTAT also proposes the Kaiser-Meyer-Olkin (KMO) test.
What are Eigenvalues and inertia?
Eigenvalues are the amount of information (inertia) summarized in every dimension. The first dimension contains the highest amount of inertia, followed by the second, then the third, and so on. XLSTAT displays eigenvalues in a table and in a chart (scree plot). The number of eigenvalues is equal to the number of non-null eigenvalues.
What are contributions?
Contributions (also called absolute contributions) represent the extent to which each variable contributed to building the corresponding PCA axis. They help in the interpretation.
How to interpret Squared cosines for the variables
Squared cosines reflect the representationquality of a variable on a PCA axis. As in other factor methods, squared cosine analysis is used to avoid interpretation errors due to projection effects. If the squared cosines of a variable associated to an axis is low, the position of the variable on this axis should not be interpreted.
What are factor scores?
Factor scores are the observations coordinates on the PCA dimensions. They are displayed in a table XLSTAT. If supplementary data have been selected, these are displayed at the end of the table.
As for the results related to variables, XLSTAT displays observations contributions (i.e. their contribution in building the PCA axes) as well as squared cosines (i.e. their representation quality on the different axes).
Results with rotations
Where a rotation has been requested, the results of the rotation are displayed with the rotation matrix first applied to the factor loadings. This is followed by the modified variability percentages associated with each of the axes involved in the rotation. The coordinates, contributions and cosines of the variables and observations after rotation are displayed in the following tables.
XLSTAT charts for Principal Component Analysis in Excel
The correlation circle or variables chart
The correlation circle (or variables chart) shows the correlations between the components and the initial variables. Supplementary variables can also be displayed in the shape of vectors.
The Observations charts
The observations charts represent the observations in the PCA space.
The biplots represent the observations and variables simultaneously in the new space. Here as well the supplementary variables can be plotted in the form of vectors. There are different types of biplots:
- Correlation biplot
- Distance biplot
- Symmetric biplot
XLSTAT allows to choose the coefficient whose square root is to be multiplied by the coordinates of the variables. This coefficient lets you adjust the position of the variable points in the biplot in order to make it more readable. If set to other than 1, the length of the variable vectors can no longer be interpreted as standard deviation (correlation biplot) or contribution (distance biplot).
Tutorials on how to run PCA in Excel using the XLSTAT software
This tutorial will help you run a Principal Component Analysis within Excel using the XLSTAT software.
analyze your data with xlstat