Real-life application with XLSTAT: Grapes for everyone?
Who doesn’t enjoy sipping a nice glass of wine with dinner or while sitting on a sun-drenched terrace? Or, perhaps you’d rather have some delicious Italian grapes with your cheese?
Grapes are the third most-consumed fruit in the world. They are enjoyed in many guises (juice, tarts and dried) and because they are healthy.
But did you know that insects like them too?
Several species need grapevines to not only feed on (luckily unlike us) but to lay their eggs on as well. The problem is that as these pests grow, they can lead to damage such as mould on grapes or other diseases. With enough insects and the right weather conditions, disease development can be rapid and cause major vineyard damage.
Insects are cold-blooded creatures and need warmth. With climate change set to bring about rising temperatures over the coming decades, insect populations will likely grow and increase parasite pressure.
If you’re like me, you’re probably worried about the future of our vineyards. Will they survive the threat of these growing populations of “consumers” (both us and the insects)? Why don't we do a quick statistical analysis for some reassurance?
In this article, we’re going to play at being data scientists to evaluate whether there is a link between the number of moths in a vineyard and temperature. In terms of maths, we’re going to try to establish the link Y=f(temperature) where Y equals the number of moths.
A web search led me to the site Agrobio Périgord with a dataset for the number of grapevine moths captured from 2014 to 2017. These data show the presence of the pest over the entire vegetative period of the grapevines (April to September) over four consecutive years. Traps were set at several locations around the Dordogne region in France to measure the infestation over a large area.
The greater the number of moths captured, the greater the parasite pressure on the vineyards.
The figure above shows these data. We can see that in 2015 the number of moths trapped was greater than in 2014, 2016 and 2017, while in 2017 the number of moths trapped was much lower. Is this linked to temperature?
Using historical weather data available from another website, we can recreate the average temperature dynamics for the same geographical zone where the traps were located over the same time scale as that used for the number of moths captured (figure below).
By eye, it’s hard to see a big difference between these four curves because the temperature increases gradually from spring through summer and falls again in late summer. However, there is a peak in 2015 in early summer.
Would this be a possible explanation for the increase in moths trapped seen on the first figure following this period?
Let’s try and answer that question.
Analysing the data: choosing the method and interpretation
What kind of analysis should we do?
How do we structure the data?
Asking the right questions and choosing the right methodological approach is never easy in data analysis.
Here, we have a quantitative variable (the number of trapped moths) that we want to explain using another quantitative variable (temperature). Our first analysis choice is logically to opt for a linear regression to describe our quantitative variable as easily as possible.
We can do this using the “linear regression” method in XLSTAT because it is very easy to use in Excel. The resulting model describes only 23% (R2 value, explained in this tutorial) of the trapped moth data. This is an unsatisfactory result because a good model should have an R2 value that is closer to 1.
This means one of two things:
- Either the linear model is not suited to our problem,
- Or the temperature is not a sufficient variable to describe the number of moths trapped.
To verify the first option, this time we will test the “nonlinear regression” in XLSTAT by choosing models suited to our problem from among the many functions available in the tool. Given our data and for parameter interpretation reasons, we will limit ourselves to second and third degree polynomial equations and to one- or two-step exponential equations.
The best R² obtained from these models is only 25.7% with a third degree polynomial equation. This is not satisfactory, so let’s look at the second option.
We want to add a second covariable (or independent variable) to enhance our mathematical model. For example, let’s introduce the generational variable provided by the data, which describe the moth generations captured during the season (G1 = 1st generation, G2 = 2nd generation, G3 = 3rd generation). This variable is qualitative and is characterised by three conditions.
A possible statistical analysis with qualitative and quantitative independent variables is an analysis of covariance (ANCOVA). The result is a linear model that, in our case, captures 51% of the data variability. By structuring our data so as to segment each generation in three periods (Gi_start = start of the generation i, Gi_peak = peak of the generation, Gi_end = end of the generation with i from 1 to 3), a new ANCOVA gives us an R2 of 66%.
We can see here that adding a new covariable has helped improve our model, and even more when it is well structured.
We’ve answered our starting problem because we’ve explained the change in the number of moths in a vineyard with temperature. However, our model could be further improved by adding other independent variables, such as the physiological condition of the plant or data on the dynamics of captured moths like in Figures 2 and 3 from the source5.
Through this case, we were able to see how to begin a statistical analysis of temporal data and adopt the study approach of a data scientist. You’ve also seen how you can get better results by transforming the data. We could have performed other statistical analyses on these data, such as t-tests for independent samples, or an analysis of time series.
Now it’s your turn!
Enjoy all 200+ features of XLSTAT for free during 14 days by downloading our trial:
This short course delivered online will show consumer scientists how to set up and learn about the routines available in XLSTAT for relating consumer acceptability to sensory/analytic measures.
This short course delivered online will show consumer scientists how to use partial least squares in XLSTAT for relating consumer acceptability to sensory/analytic measures.
This webinar presents the principles of Supervised Machine Learning & Prediction, with demos using the XLSTAT data analysis software. May 5, 2020 5:00 PM - 6:00 PM CEST