# Differential expression

Differential expression is used in the OMICS field to help identifying features that are affected by an explanatory variable. Available in Excel with XLSTAT.

## What is differential expression?

Differential expression allows identifying features (genes, proteins, metabolites…) that are significantly affected by explanatory variables. For example, we might be interested in identifying proteins that are differentially expressed between healthy and diseased individuals. In this kind of studies, data often have a very important size ( = high-throughput data). At this stage, we may talk about *omics* data analyses, in reference to analyses performed over the genome (gen*omics*) or the transcriptome (transcript*omics*) or the proteome (prote*omics*) or the metabolome (metabol*omics*), etc.

In order to test if features are differentially expressed, we often use traditional statistical tests. However, the size of the data may cause problems in terms of computation time as well as readability and statistical reliability of results. Those tools must therefore be slightly adapted in order to overcome these problems.

## Statistical tests

The statistical tests proposed in the differential expression tool in XLSTAT are traditional parametric or non-parametric tests: Student t-test, ANOVA, Mann-Whitney, Kruskal-Wallis).

## Post-hoc corrections

The p-value represents the risk that we take to be wrong when stating that an effect is statistically significant. Running a test several times increases the number of computed p-values, and subsequently the risk of detecting significant effects which are not significant in reality. Considering a significance level alpha of 5%, we would likely find 5 significant p-values by chance over 100 computed p-values. When working with high-throughput data, we often test the effect of an explanatory variable on the expression of thousands of genes, thus generating thousands of p-values. Consequently, p-values should be corrected ( = increased = penalized) as their number grow. XLSTAT proposes three common p-value correction methods:

**Benjamini-Hochberg**: this procedure makes sure that p-values increase both with their number and the proportion of non-significant p-values. It is part of the FDR (False Discovery Rate) correction procedure family. The Benjamini-Hochberg correction is poorly conservative ( = not very severe). It is therefore adapted to situations where we are looking for a large number of genes which are likely affected by the explanatory variables. It is widely used in differential expression studies.

The corrected p-value according to the Benjamini-Hochberg procedure is defined by:

p_{BenjaminiHochberg} = min( p* nbp / j , 1)

where p is the original (uncorrected) p-value, nbp is the number of computed p-values in total and j is the rank of the original p-value when p-values are sorted in ascending order.

**Benjamini-Yekutieli**: this procedure makes sure that p-values increase both with their number and the proportion of non-significant p-values. It is part of the FDR (False Discovery Rate) correction procedure family. In addition to Benjamini-Hochberg’s approach, it takes into account a possible dependence between the tested features, making it more conservative than this procedure. However, it is far less stringent than the Bonferroni approach which we describe just after.

The corrected p-value according to the Benjamini-Yekutieli procedure is defined by:

p_{BenjaminiYekutieli} = min[( p * nbp * ∑_{i=1…nbp}1/i ) / j , 1]

where p is the original p-value, nbp is the number of computed p-values in total and j is the rank of the original p-value when p-values are sorted in ascending order.

**Bonferroni**: p-values increase only with their number. This procedure is very conservative. It is part of the FWER (Familywise error rate) correction procedure family. It is rarely used in differential expression analyses. It is useful when the goal of the study is to select a very low number of differentially expressed features.

The corrected p-value according to the Bonferroni procedure is defined by:

p_{Bonferroni} = min( p * nbp, 1 )

where p is the original p-value and nbp is the number of computed p-values in total.

## Multiple pairwise comparisons

After one-way ANOVAs or Kruskal-Wallis tests, it is possible to perform multiple pairwise comparisons for each feature taken separately.

## Non-specific filtering

Before launching the analyses, it is interesting to filter out features with very poor variability across individuals. Non-specific filtering has two major advantages:

- It allows computations to focus less on features which are very likely to be not differentially expressed thus saving computation time.
- It limits post-hoc penalizations, as fewer p-values are computed.

Two methods are available in XLSTAT:

- The user specifies a variability threshold (interquartile range or standard deviation), and features with lower variability are eliminated prior to analyses.
- The user specifies a percentage of features with low variability (interquartile range or standard deviation) to be removed prior to analyses.

## Biological effects and statistical effects: the volcano plot

A statistically significant effect is not necessarily interesting at the biological scale. An experiment involving very precise measurements with a high number of replicates may provide low p-values associated to very weak biological differences. It is thus recommended to keep an eye on biological effects and not to rely only on p-values. The **volcano plot** is a scatter chart that combines statistical effects on the y-axis and biological effects on the x-axis for a whole individuals/features matrix. The only constraint is that it can only be executed to examine the difference between the levels of two-level qualitative explanatory variables.

The y axis coordinates are -log10( p-values ) making the chart easier to read: high values reflect the most significant effects whereas low values correspond to effects which are less significant.

XLSTAT provides two ways of building the x axis coordinates:

- Difference between the mean of the first level and the mean of the second, for each feature. Generally, we use this format when handling data on a transformed scale such as log or square root.
- Log2 of fold change between the two means: log2( mean1 / mean2 ). This format should preferably be used with untransformed data.

## Differential expression results in XLSTAT

For each explanatory variable, XLSTAT provides the following results:

**X features with the lowest p-values table**: it contains information about the x features with the lowest p-values. Features are sorted in an ascending order of p-values. The p-values column contains modified p-values according to the selected post-hoc correction method. The significant column indicates if the corresponding p-value is significant at the selected significance level. If the multiple pairwise comparisons option has been activated, additional columns appear. According to the selected type of test, they contain means (parametric tests) or medians (non-parametric tests) of the explanatory variable’s levels. Within each feature, levels are associated to letters summarizing multiple pairwise comparisons. Two levels sharing the same letter are not significantly different.

**Charts**: A histogram depicting the distribution of corrected p-values is followed by a volcano plot allowing the user to pinpoint features with the highest statistical and biological effects.

### References

**Benjamini Y. and Hochberg Y. (1995)**. Controlling the false discovery rate: a practical and powerful approach to multiple testing. *Journal of the Royal Statistical Society, Series B*, **57**, 289–300.

**Benjamini Y. and Yekutieli D. (2001)**. The control of the false discovery rate in multiple hypothesis testing under dependency. *Annals of Statistics*, **29**, 1165–88.

**Hahne F., Huber W., Gentleman R. and Falcon S. (2008)**. Bioconductor Case Studies. Springer.

Included in

Related features