Naive Bayes classifier

Naive Bayes classifier is a popular supervised machine learning algorithm that assumes independence among predictors. Available in Excel using XLSTAT.

What is the Naive Bayes classifier?

The Naive Bayes classifier is a supervised machine learning algorithm that allows you to classify a set of observations according to a set of rules determined by the algorithm itself. This classifier has first to be trained on a training dataset that shows which class is expected for a set of inputs. During the training phase, the algorithm elaborates the classification rules on this training dataset that will be used in the prediction phase to classify the observations of the prediction dataset. Naive Bayes implies that classes of the training dataset are known and should be provided hence the supervised aspect of the technique.

Historically, the Naive Bayes classifier has been used in document classification and spam filtering. As of today, it is a renowned classifier that can find applications in numerous areas. It has the advantage of requiring a limited amount of training to estimate the necessary parameters and it can be extremely fast compared to some other techniques. Finally, in spite of its strong simplifying assumption of independence between variables (see description below), the naive Bayes classifier performs quite well in many real-world situations which makes it an algorithm of choice among the supervised Machine Learning methods.

At the root of the Naive Bayes classifier is the Bayes’ theorem with the naive assumption of independence between all pairs of variables/features.

Naive Bayes classifier options in XLSTAT

Distribution of the quantitative variables

Same parametric/Empirical Distribution for all quantitative variables: this option allows you to choose the same parametric/empirical distribution for all quantitative variables.
Select a specific distribution for each quantitative variable: this option allows you to select for each quantitative variable a specific parametric distribution or to consider it as an empirical distribution. The parametric distribution can be selected from the following set of distributions: Normal, log-Normal, Gamma, exponential, logistic, Poisson, Binomial, Bernoulli, Uniform.

The qualitative variables are implicitly drawn from independent empirical distributions. The parameters of the selected parametric distributions are estimated using the moment method.

Breaking ties

Prediction using the naive Bayes approach can end up in a case where some classes have the same probability P(y). There are several ways to break ties for a given prediction. The following options are available:

Random breaker: chooses a random class in the set of classes having the same P(y).
Smallest Index: chooses the first class encountered in the set of classes having the same P(y).

Laplace smoothing parameter

The Laplace smoothing prevents from getting probabilities equal to zero or one.

Naive Bayes classifier results in XLSTAT

Results corresponding to the parameters involved in the classification process

The kind of probability distribution is reported.

The qualitative variables are considered to follow implicitly an empirical distribution.

The nature of the a priori distribution of the classes (uniform, not uniform) is also reported.

Results regarding the classifier

In order to evaluate and to score the naive Bayes classifier, a simple confusion matrix computed using the leave one out method as well as an accuracy index are displayed.

Results regarding the validation method

The error rate of the naive Bayes model obtained using the K folded-cross validation is reported. The number of folds is also reported to the user.

The cross validation results enables the selection of the adequate model parameters.

Result corresponding to the predicted classes

The predicted classes obtained using the naive Bayes classifier are displayed. In addition to the predicted classes, the a posteriori probabilities used to predict each observation are also reported.

View all tutorials

analyze your data with xlstat

14-day free trial

Download xlstat

Related features

Fuzzy k-means clustering

k-means clustering

One-class Support Vector Machine

Support Vector Machine

K Nearest Neighbors (KNN)

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Classification and regression random forests

Classification and regression trees

Gaussian mixture models

Association rules

Model performance Indicators

Extreme Gradient Boosting (XGBOOST)