Use this method to perform a regression when you have more variables than observations or, more universally, when the number of variables is large.
Description of the LASSO Regression in XLSTAT
LASSO stands for Least Absolute Shrinkage and Selection Operator. The LASSO regression was proposed by Robert Tibshirani in 1996. It is an estimation method that constrains its coefficients not to explode, unlike standard linear regression in the high-dimensional field. The high-dimensional context covers all situations where we have a very large number of variables compared to the number of individuals.
LASSO regression is one of the methods that overcome the shortcomings (instability of the estimate and unreliability of the prediction) of linear regression in a high-dimensional context. The main advantage of LASSO regression is its ability to perform variable selection, which can be valuable when there are a large number of variables.
Options of the LASSO Regression in XLSTAT
Model parameters: this option allows you to choose the method used to define the regularization parameter λ.
- Cross-validation: Activate this option if you want to calculate the λ parameter by cross-validation. This option allows you to run a k-folds cross-validation to obtain the optimal λ regularization parameter and to quantify the quality of the classification or regression depending on it. Data is partitioned into k subsamples of equal size. A single subsample is retained as the validation data to test the model, and the remaining k-1 subsamples are used as training data.
- Enter manually: Activate this option if you want to specify the accrual parameter λ.
Lambda: Activate this option if you want to calculate the parameter λ by cross validation. Otherwise, enter the value you want to assign to the parameter λ.
- Number of folds: Enter the number of folds to be constituted for the cross validation. Default value: 5.
- Number of values tested: Enter the number of λ values that will be tested during the cross validation. Default value: 100.
Convergence: Enter the maximum value of the evolution of the log of the likelihood from one iteration to another which, when reached, means that the algorithm is considered to have converged. Default value: 0.000001.
Maximum time (in seconds): Enter the maximum time allowed for a coordinate descent. Past that time, if convergence has not been reached, the algorithm stops and returns the results obtained during the last iteration. Default value: 180 seconds.
Interactions / Level: Activate this option to include interactions in the model then enter the maximum interaction level (value between 1 and 5).
Results of the LASSO Regression in XLSTAT
Descriptive statistics: The table of descriptive statistics shows the simple statistics for all the variables selected. The number of missing values, the number of non-missing values, the mean and the standard deviation (unbiased) are displayed for the quantitative variables.
Correlation matrix: This table is displayed to give you a view of the correlations between the various variables selected.
Goodness of fit statistics: The statistics relating to the fitting of the regression model are shown in this table:
Observations: The number of observations used in the calculations. In the formulas shown below, nn is the number of observations.
Sum of weights: The sum of the weights of the observations used in the calculations. In the formulas shown below, WW is the sum of the weights.
DF: The number of degrees of freedom for the chosen model (corresponding to the error part).
R²: The determination coefficient for the model. This coefficient must be between 0 and 1.
The R² is interpreted as the proportion of the variability of the dependent variable explained by the model. The nearer R² is to 1, the better is the model.
- MSE: The mean squared error (MSE).
- RMSE: The root mean square of the errors (RMSE) is the square root of the MSE.
Model parameters: This table gives the value of each parameter after fitting to the model
Standardized coefficients: The table of standardized coefficients (also called beta coefficients) are used, if the matrix containing the explanatory variables has not been centered, to compare the relative weights of the variables. The higher the absolute value of a coefficient, the more important the weight of the corresponding variable.
Predictions and residuals: This table shows, for each observation, the observed value of the dependent variable, the prediction of the model and the residuals.
Evolution of the MSE (Cross Validation): this table provides the evolution of the MSE as well as the number of active variables according to the lambda regularization parameter.
Chart of variable importance: The importance measure for a given variable is the absolute value of its coefficient in the regression.
Chart of MSE evolution (Cross-validation): This chart shows the MSE evolution according to the λ parameter.
Charts of predictions and residuals: These charts allow you to visualize the results mentioned above.
analyze your data with xlstat