Use this method to perform a regression when you have more variables than observations or, more universally, when the number of variables is large. Available in Excel using the XLSTAT software.
Description of a Ridge regression
Ridge regression, a method derived from Tikhonov regularization, was proposed by Hoerl and Kennard in 1970. It is an estimation method that constrains its coefficients not to explode, unlike standard linear regression in the field of high-dimensional statistics. The high-dimensional context covers all situations where we have a very large number of variables compared to the number of individuals.
Ridge regression is one of the methods that overcome the shortcomings (instability of the estimate and unreliability of the prediction) of linear regression in a high-dimensional context. Ridge regression differs from LASSO regression in that it shows greater robustness when datasets with high multicollinearity are involved.
Setting up of a Ridge regression in XLSTAT
Y / Dependent variables:
Quantitative: Select the response variable(s) you want to model. If several variables have been selected, XLSTAT carries out calculations for each of the variables separately. If a column header has been selected, check that the "Variable labels" option has been activated.
Response type: Select the type of response you have:
- Quantitative: If your response type contains real values, choose this type to fit a regression model.
X / Explanatory variables:
Quantitative: Activate this option if you want to include one or more quantitative explanatory variables in the model. Then select the corresponding variables in the Excel worksheet. The data selected may be of the numerical type. If the variable header has been selected, make sure the "Variable labels" option has been activated.
Qualitative: Activate this option if you want to include one or more qualitative explanatory variables in the model. Then select the corresponding variables in the Excel worksheet. The selected data may be of any type, but numerical data will automatically be considered as nominal. If the variable header has been selected, make sure the "Variable labels" option has been activated.
Options of a Ridge regression in XLSTAT
Model parameters: this option allows you to choose the method used to define the regularization parameter λ.
- Cross-validation: Activate this option if you want to calculate the λ parameter by cross-validation. This option allows you to run a k-folds cross-validation to obtain the optimal λ regularization parameter and to quantify the quality of the classification or regression depending on it. Data is partitioned into k subsamples of equal size. A single subsample is retained as the validation data to test the model, and the remaining k-1 subsamples are used as training data.
- Enter manually: Activate this option if you want to specify the accrual parameter λ.
Lambda: Activate this option if you want to calculate the parameter λ by cross validation. Otherwise, enter the value you want to assign to the parameter λ.
- Number of folds: Enter the number of folds to be constituted for the cross validation. Default value: 5.
- Number of values tested: Enter the number of λ values that will be tested during the cross validation. Default value: 100.
Convergence: Enter the maximum value of the evolution of the log of the likelihood from one iteration to another which, when reached, means that the algorithm is considered to have converged. Default value: 0.000001.
Maximum time (in seconds): Enter the maximum time allowed for a coordinate descent. Past that time, if convergence has not been reached, the algorithm stops and returns the results obtained during the last iteration. Default value: 180 seconds.
Interactions / Level: Activate this option to include interactions in the model then enter the maximum interaction level (value between 1 and 5).
Validation: Activate this option if you want to use a sub-sample of the data to validate the model.
Validation set: Choose one of the following options to define how to obtain the observations used for the validation:
Random: The observations are randomly selected. The "Number of observations" N must then be specified.
N last rows: The N last observations are selected for the validation. The "Number of observations" N must then be specified.
N first rows: The N first observations are selected for the validation. The "Number of observations" N must then be specified.
Group variable: If you choose this option, you need to select a binary variable with only 0s and 1s. The 1s identify the observations to use for the validation.
Prediction: Activate this option if you want to select data to use in prediction mode. If you activate this option, you need to make sure that the prediction dataset is structured like the estimation dataset: same variables with the same order in the selections. On the other hand, variable labels must not be selected: the first row of the selections listed below must correspond to data.
Quantitative: Activate this option to select the quantitative explanatory variables. The first row should include variable labels if the Variable labels option is activated on this page.
Qualitative: Activate this option to select the qualitative explanatory variables. The first row should include variable labels if the Variable labels option is activated on this page
Results of a Ridge regression in XLSTAT
Descriptive statistics: The table of descriptive statistics shows the simple statistics for all the variables selected. The number of missing values, the number of non-missing values, the mean and the standard deviation (unbiased) are displayed for the quantitative variables.
Correlation matrix: This table is displayed to give you a view of the correlations between the various variables selected.
Goodness of fit statistics: The statistics relating to the fitting of the regression model are shown in this table:
Observations: The number of observations used in the calculations. In the formulas shown below, nn is the number of observations.
Sum of weights: The sum of the weights of the observations used in the calculations. In the formulas shown below, W is the sum of the weights.
DF: The number of degrees of freedom for the chosen model (corresponding to the error part).
R²: The determination coefficient for the model.
The R² is interpreted as the proportion of the variability of the dependent variable explained by the model. The nearer R² is to 1, the better the model.
- MSE: The mean squared error (MSE).
- RMSE: The root mean square of the errors (RMSE) is the square root of the MSE.
Model parameters: This table gives the value of each parameter after fitting it to the model
Standardized coefficients: The table of standardized coefficients (also called beta coefficients) are used, if the matrix containing the explanatory variables has not been centered, to compare the relative weights of the variables. The higher the absolute value of a coefficient, the more important the weight of the corresponding variable.
Predictions and residuals: This table shows, for each observation, the observed value of the dependent variable, the model's prediction and the residuals.
Evolution of the MSE (Cross Validation): this table provides the evolution of the MSE as well as the number of active variables according to the lambda regularization parameter.
Chart of variable importance: The importance measure for a given variable is the absolute value of its coefficient in the regression.
Chart of MSE evolution (Cross-validation): This chart shows the MSE evolution according to the λ parameter.
Charts of predictions and residuals: These charts allow you to visualize the results mentioned above.
Example of a Ridge regression in XLSTAT
A tutorial on how to use the Ridge regression is available on the Addinsoft website.
analice sus datos con xlstat