Logistic regression for binary response data and polytomous variables (Logit, Probit)

Logistic regression for binary response data and polytomous variables (Logit, Probit) is part of:
  • Pro Core statistical software

  • System configuration

    • Windows:
      • Versions: 9x/Me/NT/2000/XP/Vista/Win 7/Win 8
      • Excel: 97 and later
      • Processor: 32 or 64 bits
      • Hard disk: 150 Mb
    • Mac OS X:
      • OS: OS X
      • Excel: X, 2004 and 2011
      • Hard disk: 150Mb.


  • Easy and user-friendly
    Easy and user-friendly XLSTAT is flawlessly integrated with Microsoft Excel which is the most popular spreadsheet worldwide. This integration makes it one of the simplest available tools to work with as it utilizes the same philosophy as Microsoft Excel. The program is accessible in a dedicated XLSTAT tab. The analyses are grouped into functional menus. The dialog boxes are user-friendly and setting up an analysis is straightforward.
  • Data and results shared seamlessly
    Data and results shared seamlessly One of the greatest advantages of XLSTAT is the way you can share data and results seamlessly. As the results are stored in Microsoft Excel, anyone can access them. There is no need for the receiver to have an XLSTAT license or any additional viewer which makes your team-work easier and more affordable. In addition, results are easily integrable into other Microsoft Office software such as PowerPoint, so that you can create striking presentation in minutes.
  • Modular
    Modular XLSTAT is a modular product. XLSTAT-Pro is a core statistical module of XLSTAT which includes all the mainstream functionalities in statistics and multivariate analysis. More advanced features contained in add-on modules can be added for specific applications. This way you can adapt the software to your needs making the software more cost-efficient.
  • Didactic
    Didactic The results of XLSTAT are organized by analysis and are easy to navigate. Moreover useful information is provided along with the results to assist you in your interpretation.
  • Affordable
    Affordable XLSTAT is a complete and modular analytical solution that can suit any analytical business needs. It is very reasonably priced so that the return of your investment is almost immediate. Any XLSTAT license comes with top level support and assistance.
  • Accessible - Available in many languages
    Accessible - Available in many languages We have ensured XLSTAT is accessible to everyone by making the program available in many languages, including Chinese, English, French, German, Italian, Japanese, Polish, Portuguese and Spanish.
  • Automatable and customizable
    Automatable and customizable Most of the statistical functions available in XLSTAT can be called directly from the Visual Basic window of Microsoft Excel. They can be modified and integrated to more code to fit to the specificity of your domain. Adding tables and plots as well as modifying existing outputs becomes easy. Furthermore, XLSTAT includes some special tools on the dialog boxes to generate automatically the VBA code in order to reproduce your analysis using the VBA editor or to simply load pre-set settings. This effortless automation of routine analysis will be a huge time saver on your part.

Logistic regression principles

Logistic regression is a frequently-used method as it enables binary variables, the sum of binary variables, or polytomous variables (variables with more than two categories) to be modeled. It is frequently used in the medical domain (whether a patient will get well or not), in sociology (survey analysis), epidemiology and medicine, in quantitative marketing (whether or not products are purchased following an action) and in finance for modeling risks (scoring).

The principle of the logistic regression model is to link the occurrence or non-occurrence of an event to explanatory variables.

Models for logistic regression

Logistic and linear regression belong to the same family of models called GLM (Generalized Linear Models): in both cases, an event is linked to a linear combination of explanatory variables.

For linear regression, the dependent variable follows a normal distribution N (µ, s) where µ is a linear function of the explanatory variables. For logistic regression, the dependent variable, also called the response variable, follows a Bernoulli distribution for parameter p (p is the mean probability that an event will occur) when the experiment is repeated once, or a Binomial (n, p) distribution if the experiment is repeated n times (for example the same dose tried on n insects). The probability parameter p is here a linear combination of explanatory variables.

The must common functions used to link probability p to the explanatory variables are the logistic function (we refer to the Logit model) and the standard normal distribution function (the Probit model). Both these functions are perfectly symmetric and sigmoid: XLSTAT provides two other functions: the complementary Log-log function is closer to the upper asymptote. The Gompertz function is on the contrary closer the axis of abscissa.

The analytical expression of the models is as follows:

Where βX represents the linear combination of variables (including constants).

The knowledge of the distribution of the event being studied gives the likelihood of the sample. To estimate the β parameters of the model (the coefficients of the linear function), we try to maximize the likelihood function.

Contrary to linear regression, an exact analytical solution does not exist. So an iterative algorithm has to be used. XLSTAT uses a Newton-Raphson algorithm. The user can change the maximum number of iterations and the convergence threshold if desired.

Separation problem

In the example above, the treatment variable is used to make a clear distinction between the positive and negative cases.

  Treatment 1 Treatment 2
Response + 121 0
Response + 0 85

In such cases, there is an indeterminacy on one or more parameters for which the variance is as high as the convergence threshold is low which prevents a confidence interval around the parameter from being given. To resolve this problem and obtain a stable solution, Firth (1993) proposed the use of a penalized likelihood function. XLSTAT offers this solution as an option and uses the results provided by Heinze (2002). If the standard deviation of one of the parameters is very high compared with the estimate of the parameter, it is recommended to restart the calculations with the "Firth" option activated.

The multinomial logit model

The multinomial logit model, that correspond to the case where the dependent variable has more than two categories, has a different parameterization from the logit model because the response variable has more than two categories. It focuses on the probability to choose one of the J categories knowing some explanatory variables.

The analytical expression of the model is as follows: Log[p(y =j | xi) / p(y =1 | xi)] = αj + βjXi

where the category 1 is called the reference or control category. All obtained parameters have to be interpreted relatively to this reference category. The probability to choose category j is: p(y =j | xi) = exp(αj + βjXi) / [1 + Σk=2..J exp(αk + βkXi)]

For the reference category, we have: p(y =1 | xi) = 1 / [1 + Σk=2..J exp(αk + βkXi)]

The model is estimated using a maximum likelihood method; the log-likelihood is as follows: l(α,β) = Σi=1..nΣj=1..J yij log(p(y=j|xi))

To estimate the β parameters of the model (the coefficients of the linear function), we try to maximize the likelihood function. Contrary to linear regression, an exact analytical solution does not exist. XLSTAT uses the Newton-Raphson algorithm to iteratively find a solution.

Some results that are displayed for the logistic regression are not applicable in the case of the multinomial case.

Confidence intervals for Logistic regression

The calculation of confidence intervals for parameters is as for linear regression assuming that the parameters are normally distributed. XLSTAT also offers the more reliable alternative "profile likelihood" method as it does not require the assumption that the parameters are normally distributed.

XLSTAT results for Logistic regression

XLSTAT can display the classification table (also called the confusion matrix) used to calculate the percentage of well-classified observations for a given cutoff point. Typically, for a cutoff value of 0.5, if the probability is less than 0.5, the observation is considered as being assigned to class 0, otherwise it is assigned to class 1.

The ROC curve can also be displayed. The ROC curve (Receiver Operating Characteristics) displays the performance of a model and enables a comparison to be made with other models. The terms used come from signal detection theory.

The proportion of well-classified positive events is called the sensitivity. The specificity is the proportion of well-classified negative events.

Results for logistic regression in XLSTAT