Extreme Gradient Boosting (XGBOOST)

XGBOOST, which stands for "Extreme Gradient Boosting", is a machine learning model that is used for supervised learning problems, in which we use the training data to predict a target/response variable.

What is XGBOOST?

XGBOOST, which stands for "Extreme Gradient Boosting", is a machine learning model employed for supervised learning problems, in which we use the training data to predict a target/response variable.

Choose this method to fit a classification or regression model on a sample described by qualitative and / or quantitative variables. The method efficiently handles large datasets with a large number of variables.

Classification (qualitative response variable): the model enables to predict the class each observation belongs to, based on explanatory variables that can be quantitative and / or qualitative.
Regression (continuous response variable): the model enables to build a predictive model for a quantitative response variable based on explanatory variables that can be quantitative and / or qualitative.

What is the principle of XGBOOST?

Machine learning models can be fitted to data individually, or combined to other models, creating an ensemble. An ensemble is a combination of simple individual models that together create a more powerful one.

Machine learning boosting is a method that creates such an ensemble. It starts by fitting an initial model (in our case a regression or classification tree) to the data. A second model is then built to focus on accurately predicting the observations that the first model predicted poorly. The combination of these two models is expected to be better than each one of them. This boosting process is then repeated several times, each successive model attempting to correct the shortcomings of the combined boosted ensemble that contains all previous models.

Gradient boosting

Gradient boosting is a type of machine learning boosting. It relies on the intuition that the best possible next model, when combined with previous models, minimizes the overall prediction error. The key idea is to set each observation weight for this next model in order to minimize the error. These are calculated in the following way:

At each boosting step and for each observation, a score is calculated based on the prediction error of the model.

The name gradient boosting arises from the fact that each weight is set based on the gradient of the error with respect to the prediction. Each new model takes a step in the direction that minimizes the prediction error, in the space of possible predictions for each observation.

What are the results in XLSTAT?

Descriptive statistics: Activate this option to display descriptive statistics for the selected variables.

Correlations: Activate this option to display the correlation matrix.

Predictions and residuals(Regression only): Activate this option to display the predictions and residuals for all the observations

Results by object(classification only): Activate this option to display for each observation, the observed category, the predicted category, and the probabilities corresponding to the various categories of the dependent variable.

Statistics for each iteration: Activate this option to display the table showing the evaluation metric evolution across each iteration.

Confusion matrix(classification only): Activate this option to display the table showing the numbers of well and wrongly classified observations for each of the categories.

Variable importance: Activate this option to display the variable (feature) importance measures. XLSTAT gives the importance measures below :

The Frequency corresponds to the percentage representing the number of times a feature has been used in trees.
The Gain corresponds to the relative contribution of a feature to the model and is calculated by taking the ratio between each feature’s total contribution and the total contribution for all the features in the model. The higher the Gain metric is, the more important the predictive feature is.
The Cover metric is the proportion of observations that are related to a feature. When a feature is used to split a node that is just before a leaf, we say that the observations in this node are covered by the feature. For example, let's suppose that you have 100 observations, 4 features and 3 trees, and that feature 1 is used to decide the leaf node for 10, 5, and 2 observations in tree1, tree2 and tree3 respectively. The metric will count cover for this feature as 10+5+2 = 17 observations. This will be calculated for all 4 features and the cover will be 17 expressed as a percentage of all features’ cover metrics.

The Gain is the most relevant attribute to interpret the relative importance of each feature.

Statistics for each iteration: Activate this option to display the chart showing the evolution of the evaluation metric across each iteration.

Variable importance: Activate this option to display the chart showing the variable importance measures.

Regression charts: Activate this option to display the following charts:

Response variable versus standardized residuals.
Predictions versus standardized residuals.
Predictions versus response variable.
Bar chart of standardized Residuals.

Confusion plot(classification only): Activate this option to display the confusion plot which allows a synthetic visualization of the classification table. The numbers can be linked either to the width or the surface of the displayed squares.

Roc curve(classification only): Activate this option to display the ROC curve.

Lift curve(classification only): Activate this option to display the Lift curve.

Cumulative gain curve(classification only): Activate this option to display the cumulative gain curve.

View all tutorials