Terms selection

Use this method to perform a regression on a document-term matrix. Available in Excel using the XLSTAT software.

DESCRIPTION OF THE TERMS SELECTION

Terms selection uses the well-known Elastic net regression method and its logistic version. Indeed, it allows you to model quantitative variables but also binomial (typically binary) variables and multinomial variables (qualitative variables with more than two categories).

Terms selection is a method used only in the case of text mining, where the document-term matrix replaces the quantitative explicative variables, and the sentiment vector is the response variable giving the sentiment ("positive", "negative", etc.) of each document or its rate (quantitative indication of the opinion).

The Elastic net regression is based on two fundamental parameters: the mixing parameter α (which is between 0 and 1) and the λ regularization parameter >0. XLSTAT offers to its users to find the optimal λ parameters by cross-validation.

OPTIONS OF THE TERMS SELECTION IN XLSTAT

Response variable: Select the response variable you want to model. If a column header has been selected, check that the "Column labels" option has been activated.

Response type: Select your type of response variable:

Gaussian: If your response variable is numerical, choose this type to fit a regression model.
Poisson: If your response variable is numerical, choose this type to fit a regression model.
Binomial: If your response variable is binary, choose this type to fit a regression logistic model.
Multinomial: If your response variable includes more than two categories, choose this type to fit a regression logistic model.

Term frequencies: Select in this field the term frequency matrix. One column corresponds to the frequencies of one term in each document. The selected data must be numerical. If the variable header has been selected, check that the "Column labels" option has been activated.

Term frequencies (Prediction): Activate this option if you want to select data to use in prediction mode. If you activate this option, you need to make sure that the prediction dataset is structured exactly like the estimation dataset: the same variables to be selected in the same order. If the variable header has been selected, check that the "Column labels" option has been activated.

Document labels (Prediction): Activate this option if document labels are available. Then select the corresponding data. If this option is not activated, the document labels are automatically generated by XLSTAT (PredDoc1, PredDoc2 …).

Select coefficients according to the: Select the coefficients according to the optimal λ of your choice.

Lambda minimum: Select this option to choose the coefficients according to the λ that gives minimum mean cross-validated error.
Lambda 1se: Select this option to choose the coefficients according to the λ that gives the most regularized model such that the cross-validated error is within one standard error of the minimum.

Optimal lambda: Activate this option to display a table with the values and the degrees of freedom according to the λ.

Coefficients: Activate this option to display the sorted coefficients of each term.

Odds ratio: Activate this option to display the odds ratio of each term in the same table as the coefficients.

Term frequencies: Activate this option to display the total term frequency of each term in the same table as the coefficients.

Display non-zero coefficients only: Activate this option to display only terms with an influence according to the model. Terms with zero coefficients, their odds ratio, and their frequency are then removed from the "Results by term" table.

Results by document: Activate this option to display the response variable and the prediction for each document and the probabilities for the classification.

Confusion matrix: Activate this option, only in the case of classification, to display the confusion matrix for the classification of the training dataset. The confusion matrix contains information about the observed and predicted classifications by the model. Performances can be evaluated using the confusion matrix. The diagonal contains correct predictions. The greater the sum of elements of the diagonal, the better the classifier.

Goodness of fit statistics: The statistics related to the fitting of the regression model are shown in this table.

Coefficients: Activate this option to display a bar chart showing the term coefficients.

Odds ratio: Activate this option to display a bar chart showing the term odd ratio.

Evolution of the deviance: Activate this option to display a chart showing the cross-validation curve with its upper and lower standard deviation curves, as a function of the λ values automatically generated or entered (see Options tab). The λ minimum is plotted in red while the λ 1se is plotted in blue. If the two λ are equal, only the λ minimum is plotted.

RESULTS OF THE TERMS SELECTION IN XLSTAT

Results regarding the terms: This table gives a view of the influence of each term. The coefficient and the odd ratio allow you to know if a term is important or not. The coefficient gives the intensity and direction of its influence whereas the odd ratio gives the probabilities to predict the target class vs another target. For instance, if the target class is "Positive" and the other one is "Negative" and the odd ratio for the term "good" is three, it means that the document with "good" includes in it, presents the chance to be predict as "Positive" three times greater than a document which does not have this term. The frequency column helps to know if the coefficient is influenced by a high frequency. If no term has a non-zero coefficient only the intercept is plotted on the coefficients and odds ratio charts. To get more terms with a non-zero coefficient, we suggest decreasing the value of the α.

Results regarding the confusion matrix: The confusion matrix is deduced from prior and posterior classifications together with the overall percentage of well-classified observations.

Results regarding the goodness of fit statistics: The statistics related to the fitting of the regression model are shown in this table:

Observations: The number of observations used in the calculations.
DF: The number of degrees of freedom for the chosen model.
Deviance: Corresponds to the loss, for the Gaussian model it is the squared error, for the Poisson model it is the deviance and for binomial or multinomial classification it is the misclassification error.
AIC: Akaike’s Information Criterion.
AICc: Corrected Akaike’s Information Criterion.
SBC: Schwarz’s Bayesian Criterion.

Results regarding the documents: This table gives a view of the document prediction. For the classification case, the probabilities for the target class are displayed for binomial classification and the probabilities for each class are displayed for multinomial classification. Note: The target class is the last in alphabetical order.

View all tutorials