Running a Principal Component Analysis (PCA) with XLSTAT

Dataset for Principal Component Analysis (PCA) XLS185 KB

Tutorial video
Principal Component Analysis (PCA) is part of: Download Trial version More details See users' feedback
  • Pro Core statistical software

  • System configuration

    • Windows:
      • Versions: 9x/Me/NT/2000/XP/Vista/Win 7
      • Excel: 97 and later
      • Processor: 32 or 64 bits
      • Hard disk: 150 Mb
    • Mac OS X:
      • OS: OS X
      • Excel: X, 2004 and 2011
      • Hard disk: 150Mb.

Benefits

  • Easy and user-friendly
    Easy and user-friendly XLSTAT is flawlessly integrated with Microsoft Excel which is the most popular spreadsheet worldwide. This integration makes it one of the simplest available tools to work with as it utilizes the same philosophy as Microsoft Excel. The program is accessible in a dedicated XLSTAT tab. The analyses are grouped into functional menus. The dialog boxes are user-friendly and setting up an analysis is straightforward.
  • Data and results shared seamlessly
    Data and results shared seamlessly One of the greatest advantages of XLSTAT is the way you can share data and results seamlessly. As the results are stored in Microsoft Excel, anyone can access them. There is no need for the receiver to have an XLSTAT license or any additional viewer which makes your team-work easier and more affordable. In addition, results are easily integrable into other Microsoft Office software such as PowerPoint, so that you can create striking presentation in minutes.
  • Modular
    Modular XLSTAT is a modular product. XLSTAT-Pro is a core statistical module of XLSTAT which includes all the mainstream functionalities in statistics and multivariate analysis. More advanced features contained in add-on modules can be added for specific applications. This way you can adapt the software to your needs making the software more cost-efficient.
  • Didactic
    Didactic The results of XLSTAT are organized by analysis and are easy to navigate. Moreover useful information is provided along with the results to assist you in your interpretation.
  • Affordable
    Affordable XLSTAT is a complete and modular analytical solution that can suit any analytical business needs. It is very reasonably priced so that the return of your investment is almost immediate. Any XLSTAT license comes with top level support and assistance.
  • Accessible - Available in many languages
    Accessible - Available in many languages We have ensured XLSTAT is accessible to everyone by making the program available in many languages, including Chinese, English, French, German, Italian, Japanese, Polish, Portuguese and Spanish.
  • Automatable and customizable
    Automatable and customizable Most of the statistical functions available in XLSTAT can be called directly from the Visual Basic window of Microsoft Excel. They can be modified and integrated to more code to fit to the specificity of your domain. Adding tables and plots as well as modifying existing outputs becomes easy. Furthermore, XLSTAT includes some special tools on the dialog boxes to generate automatically the VBA code in order to reproduce your analysis using the VBA editor or to simply load pre-set settings. This effortless automation of routine analysis will be a huge time saver on your part.

Dataset for running a Principal Component Analysis

An Excel sheet containing both the data and the results for use in this tutorial can be downloaded by clicking here.

The data are from the US Census Bureau and describe the changes in the population of 51 states between 2000 and 2001. The initial dataset has been transformed to rates per 1000 inhabitants, with the data for 2001 serving as the focus for the analysis.

Goal of this Principal Component Analysis

Our goal is to analyze the correlations between the variables and to find out if the changes in population in some states are very different from the ones in other states. This example is also used in our Hierarchical Clustering tutorial.

Principal Component Analysis

Principal Component Analysis is a very useful method to analyze numerical data structured in a M observations / N variables table. It allows to:

  • Quickly visualize and analyze correlations between the N variables,
  • Visualize and analyze the M observations (initially described by the N variables) on a low dimensional map, the optimal view for a variability criterion,
  • Build a set of P uncorrelated factors (P

The limits of Principal Component Analysis stem from the fact that it is a projection method, and sometimes the visualization can lead to false interpretations. There are however some tricks to avoid these pitfalls.

Setting up a Principal Component Analysis

Once XLSTAT-Pro is activated, select the XLSTAT / Analyzing data / Principal components analysis command, or click on the corresponding button of the Analyzing Data toolbar (see below).

Principal Component AnalysisPrincipal Component Analysis software

The Principal Component Analysis dialog box will appear.

Select the data on the Excel sheet.

Note: There are several ways of selecting data with XLSTAT - for further information, please check the section on selecting data.

In this example, the data start from the first row, so it is quicker and easier to use columns selection. This explains why the letters corresponding to the columns are displayed in the selection boxes.

The Data format chosen is Observations/variables because of the format of the input data.

The PCA type that will be used during the computations is the Pearson's correlation matrix, which corresponds to the classical correlation coefficient.

Principal Component Analysis software

In the Outputs tab, we choose to activate the option to display significant correlations in bold characters ("Test significancy").

Principal Component Analysis software

In the Charts tab, in order to display the labels on all charts, and to display all the observations (observations charts and biplots), the filtering option is unchecked. If there is a lot of data, displaying the labels might slow down the global display of the results. Displaying all the observations might make the of the results unreadable. In these cases, filtering the observations to display is recommended.

Principal Component Analysis

The computations begin once you have clicked on OK. You are asked to confirm the number of rows and columns.

Note: This message can be bypassed by un-selecting the "Ask for selections confirmation" in the XLSTAT options panel.

Then you should confirm the axes for which you want to display plots. In this example, the percentage of variability represented by the first two factors is not very high (67.72%); to avoid a misinterpretation of the results, we have decided to complement the results with a second chart on axes 1 and 3.

Principal Component Analysis menu PC1 and PC2’’’’Principal Component Analysis

Interpreting the results of a Principal Component Analysis

The first result to look at is the correlation matrix. We can see right away that the rates of people below and above 65 are negatively correlated (r = -1). Either of the two variables could have been removed without effect on the quality of the results. We can also see that the Net Domestic Migration has low correlation with the other variables, including the Net International migration. This means that U.S. nationals and non-nationals may be moving to a state for different sets of reasons.

Principal Component Analysis

The next table and the corresponding chart are related to a mathematical object, the eigenvalues, which reflect the quality of the projection from the N-dimensional initial table (N=7 in this example) to a lower number of dimensions. In this example, we can see that the first eigenvalue equals 3.567 and represents 51% of the total variability. This means that if we represent the data on only one axis, we will still be able to see % of the total variability of the data.

Each eigenvalue corresponds to a factor, and each factor to a one dimension. A factor is a linear combination of the initial variables, and all the factors are un-correlated (r=0). The eigenvalues and the corresponding factors are sorted by descending order of how much of the initial variability they represent (converted to %).

Principal Component Analysis softwarePrincipal Component Analysis software

Ideally, the first two or three eigenvalues will correspond to a high % of the variance, ensuring us that the maps based on the first two or three factors are a good quality projection of the initial multi-dimensional table. In this example, the first two factors allow us to represent 67.72% of the initial variability of the data. This is a good result, but we'll have to be careful when we interpret the maps as some information might be hidden in the next factors. We can see here that although we initially had 7 variables, the number of factors is 6. This is due to the two age variables, which are negatively correlated (-1). The number of "useful" dimensions has been automatically detected.

The first map is called the correlation circle (below on axes F1 and F2). It shows a projection of the initial variables in the factors space. When two variables are far from the center, then, if they are: Close to each other, they are significantly positively correlated (r close to 1); If they are orthogonal, they are not correlated (r close to 0); If they are on the opposite side of the center, then they are significantly negatively correlated (r close to -1).

When the variables are close to the center, it means that some information is carried on other axes, and that any interpretation might be hazardous. For example, we might be tempted to interpret a correlation between the variables Net Domestic migration and Net International Migration although, in fact, there is none. This can be confirmed either by looking at the correlation matrix or by looking at the correlation circle on axes F1 and F3.

Principal Component Analysis

The correlation circle is useful in interpreting the meaning of the axes. In this example, the horizontal axis is linked with age and population renewal, and the vertical axis with domestic migration. These trends will be helpful in interpreting the next map. To confirm that a variable is well linked with an axis, take a look at the squared cosines table: the greater the squared cosine, the greater the link with the corresponding axis. The closer the squared cosine of a given variable is to zero, the more careful you have to be when interpreting the results in terms of trends on the corresponding axis. Looking at this table we can see that the trends for international migration would be best viewed on a F2/F3 map.

Principal Component Analysis

The next chart can be the ultimate goal of the Principal Component Analysis (PCA). It enables you to look at the data on a two- dimensional map, and to identify trends. We can see that the demographics of Nevada and Florida are unique, as are the demographics of Utah and Alaska, two states that share common characteristics. Going back to the table, we can confirm that Utah and Alaska have a low population rate of people over age 65. Utah has the highest birth rate in the U.S., and Alaska ranks high as well.

Principal Component Analysis software

Watch this video to see how the settings were performed.

Click to view a 3D visualization on the first three axes generated by XLSTAT-3DPlot.

Note on the usage of Principal Component Analysis

Principal component analysis is often performed before a regression, to avoid using correlated variables, or before clustering the data, to have a better overview of the variables. The number of clusters might sometimes be a simple guess based on the maps. The above demographic data have also been used in the tutorial on hierarchical clustering. The ">65 pop" variable has been removed as its inclusion would double the weight of the age variables in the analysis.