Data sampling

Use this tool to generate a subsample of observations from a set of univariate or multivariate data. Available in Excel using the XLSTAT add-on statistical software.

Use of data sampling

Sampling is one of the fundamental data analysis and statistical techniques. Sampling aims at extracting a sample of size n from a dataset.

Samples are generated to:

  1. Test an hypothesis on one sample, then test it on another;
  2. Obtain very small tables which have the properties of the original table.

To meet these different situations, several methods have been proposed.

XLSTAT data sampling options

XLSTAT offers the following methods for generating a sample of N observations from a table of M rows:

  • N first rows: The sample obtained is taken from the first N rows of the initial table. 
  • N last rows: The sample obtained is taken from the last N rows of the initial table. This method is only used if it is certain that the values have not been sorted according to a particular criterion which could introduce bias into the analysis
  • N every s starting at k: The sample is built extracting N rows, every s rows, starting at row k
  • Random without replacement: Observations are chosen at random and may occur only once in the sample
  • Random with replacement: Observations are chosen at random and may occur several times in the sample
  • Systematic from random start: From the j'th observation in the initial table, an observation is extracted every k observations to be used in the sample. j is chosen at random from among a number of possibilities depending on the size of the initial table and the size of the final sample. k is determined such that the observations extracted are as spaced out as possible
  • Systematic centered: Observations are chosen systematically in the centers of N sequences of observations of length k
  • Random stratified (1): Rows are chosen at random within N sequences of observations of equal length, where N is determined by dividing the number of observations by the requested sample size;
  • Random stratified (2): Rows are chosen at random within N strata defined by the user. In each stratum, the number of sampled observations is proportional to the relative frequency of the stratum.
  • Random stratified (3 ): Rows are chosen at random within N strata defined by the user. In each stratum, the number of sampled observations is proportional to a relative frequency supplied by the user.
  • User defined: A variable indicates the frequency of each observation within the output sample.
  • Training and test sets: Data are split into two parts – a training set and a test set. The rows of each set are randomly drawn from the initial dataset. The size of the training set is defined by a number of rows.
  • Training and test sets (%): Data are split into two parts – a training set and a test set. The rows of each set are randomly drawn from the initial dataset. The size of the training set is defined by a row number percentage from the initial data set.