Creating histograms and fitting a distribution with XLSTAT

Dataset for Histograms XLS1010 KB

Tutorial video
  • Pro Core statistical software

  • System configuration

    • Windows:
      • Versions: 9x/Me/NT/2000/XP/Vista/Win 7
      • Excel: 97 and later
      • Processor: 32 or 64 bits
      • Hard disk: 150 Mb
    • Mac OS X:
      • OS: OS X
      • Excel: X, 2004 and 2011
      • Hard disk: 150Mb.

Benefits

  • Easy and user-friendly
    Easy and user-friendly XLSTAT is flawlessly integrated with Microsoft Excel which is the most popular spreadsheet worldwide. This integration makes it one of the simplest available tools to work with as it utilizes the same philosophy as Microsoft Excel. The program is accessible in a dedicated XLSTAT tab. The analyses are grouped into functional menus. The dialog boxes are user-friendly and setting up an analysis is straightforward.
  • Data and results shared seamlessly
    Data and results shared seamlessly One of the greatest advantages of XLSTAT is the way you can share data and results seamlessly. As the results are stored in Microsoft Excel, anyone can access them. There is no need for the receiver to have an XLSTAT license or any additional viewer which makes your team-work easier and more affordable. In addition, results are easily integrable into other Microsoft Office software such as PowerPoint, so that you can create striking presentation in minutes.
  • Modular
    Modular XLSTAT is a modular product. XLSTAT-Pro is a core statistical module of XLSTAT which includes all the mainstream functionalities in statistics and multivariate analysis. More advanced features contained in add-on modules can be added for specific applications. This way you can adapt the software to your needs making the software more cost-efficient.
  • Didactic
    Didactic The results of XLSTAT are organized by analysis and are easy to navigate. Moreover useful information is provided along with the results to assist you in your interpretation.
  • Affordable
    Affordable XLSTAT is a complete and modular analytical solution that can suit any analytical business needs. It is very reasonably priced so that the return of your investment is almost immediate. Any XLSTAT license comes with top level support and assistance.
  • Accessible - Available in many languages
    Accessible - Available in many languages We have ensured XLSTAT is accessible to everyone by making the program available in many languages, including Chinese, English, French, German, Italian, Japanese, Polish, Portuguese and Spanish.
  • Automatable and customizable
    Automatable and customizable Most of the statistical functions available in XLSTAT can be called directly from the Visual Basic window of Microsoft Excel. They can be modified and integrated to more code to fit to the specificity of your domain. Adding tables and plots as well as modifying existing outputs becomes easy. Furthermore, XLSTAT includes some special tools on the dialog boxes to generate automatically the VBA code in order to reproduce your analysis using the VBA editor or to simply load pre-set settings. This effortless automation of routine analysis will be a huge time saver on your part.

What you will do in this tutorial

First by using the XLSTAT tool allowing to create histograms, and then by using the distribution fitting tool, we want to test if the sample (in a statistical sense) follows a negative binomial distribution or not. Usually, the negative binomial distribution represents well the aggregation/dispersion phenomenon of bacteria in water environments.

Data to create an histogram and fit a distribution

An Excel sheet with both the data and the results can be downloaded by clicking here.

The data correspond to an experiment where 200 samples of water from a river were cultured on medium with nutrients to determine the presence or absence of bacterial contamination with Escherichia coli. The number of colonies has been counted after 72 hours of incubation. In the Bact-Data column you will find the counts for the 200 samples.

Setting up the dialog box to create an histogram

After opening XLSTAT, select the XLSTAT / Describing data / Histograms command, or click on the corresponding button of the Describing data toolbar (see below).

bardesc1.gif

Once you've clicked on the button, the dialog box appears. Select the data on the Excel sheet.

The Data are in the B column. We activate the discrete option because the counts are discrete values. The Sample labels option is left activated because the first row of the data selection contains the name of the sample.

histo1.gifhisto2.gifhisto3.gif

The computations begin once you have clicked on the OK button. The results will then be displayed.

Interpreting a histogram

After some summary statistics, the histogram is displayed on sheet Histogram, followed by a table where the statistics of the histogram are available.

histo4.gif

On the histogram we can see that the most frequent value is 0, which represents over 20% of the data. That is, in more than one sample out of five, no bacteria has been found. We also notice that the frequency decreases quickly. In one sample, over 36 colonies have been counted.

The following video shows how to do it.

Creating a histogram specifying the bounds of the intervals

Because we want to test the fit between the negative binomial distribution function and the sample, (the Chi-square test requires that there is are least 5 data in a class), and because the uncertain precision of the counts of the bacteria, it seems necessary to group the counts into larger classes. For that reason, we created a list of bounds that seemed coherent with our problem: 0,1,2,3,4,5,10,15,20,40.

In order to verify if the frequencies of the new classes are greater than 5 and decrease regularly, we create a new histogram, specifying this time the bounds of the intervals.

To activate this tool, select the XLSTAT / Preparing data / Discretization command, or click on the corresponding button of the Discretization toolbar (see below).

histo5.gif

The computations begin once you have clicked on the OK button, and the new histogram appears (see in sheet "Histogram1").

histo6.gif

The following video shows you how to reproduce those results.

As we are satisfied by this result, we can now use the distribution fitting tool to test if the sample follows a negative binomial distribution.

Setting up the dialog box to fit a distribution

To activate this tool, select the XLSTAT / Modeling data / Distribution fitting command, or click on the corresponding button of the Modeling Data toolbar (see below).

bardfit1.gif

Once you've clicked on the button, the dialog box appears. Select the data on the Excel sheet. The Data are in the B column. We let XLSTAT estimate the parameters of the negative binomial distribution function.

XLSTAT offers two different formulations of the negative binomial distribution. The one that is adapted to our case is the second one.

histo7.gif

We activate the options for the Kolmogorov-Smirnov and the Goodness of Chi-square tests, which are necessary to test our assumption. For the Chi-square test, we use the bounds that we defined above.

histo8.gif

The following chart options have been selected.

histo9.gif

Interpreting the results of a distribution fitting analysis

The first result of interest for us is the value of the k and p parameters of the negative binomial distribution (fitted using the maximum likelihood method), and the estimates of the sample and theoretical mean, variance, skewness and kurtosis. The closer these statistics obtained from the data and from the parameters, the better the fit. Here, the fit is excellent. Note: the theoretical mean is given by kp, and the variance by kp(p+1).

histo10.gifhisto11.gif

The Chi-square goodness of fit test allows to test if the Chi-square distance between the empirical and theoretical distribution functions is above a critical value or not. A visual comparison between the observed and theoretical frequencies is available on the next figure.

histo12.gif

For classes 1, 6 and 7, there seems to be a slight difference. In spite of this small difference, the p-value computed for the test (0.767) is significantly higher than the significance level we have chosen (0.05). Therefore, the Chi-square test confirms our hypothesis that the data follow a negative binomial distribution.

As a conclusion, the presence of the bacteria of interest in the river in which the sample were collected, is follows a negative binomial distribution (k=0.839, p=5.763), with a mean of 4.8 and a variance of 32.7.

The following video shows you how to do the fitting of the distribution.