What is the Naive Bayes classification method?

Naive Bayes (NB) is a classification method based on the Bayes theorem, a probability rule that updates probabilities based on evidence. Given observed input features x=(x1,x2,...,xn)x = (x_1, x_2, . . . , x_n) and target class variable YY, NB computes the conditional class probability P(Y=yiX1=x1,X2=x2,...,Xn=xn)iP(Y = y_i|X_1 = x_1, X_2 = x_2, . . . , X_n = x_n) ∀i, that is, the probability that the class variable takes on class values yiy_i given the observed values of input variables. It then outputs the class value with the highest probability.

As such, NB is a parametric classification method, where the parameters are the conditional probabilities between the class value and input variables and the structure of the adopted model. NB’s underlying assumption is that the input features XX used for classification have a direct probability relationship with the target value but not with one another. This means any correlation among the input variables’ observed values can be completely explained by their relationship with the class/target.

Example

For example, consider the task of predicting the SkillLevel of a player in the Dotalicious dataset. Applying an NB classifier in this task imposes an implicit assumption that all input features, such as the number of deaths and kills, are conditionally independent given the knowledge of the SkillLevel. In other words, if we know a player’s SkillLevel, we don’t need the number of deaths to infer the number of kills and vice versa; that is, they are pieces of information fully inferable from SkillLevel alone.

Relation when the SkillLevel value isn’t known

On the other hand, if we don’t know the value of SkillLevel, these two variables may be correlated due to their indirect relationship through SkillLevel as the proxy. Therefore, we need to confirm that our variables conform with this assumption; if not, we may need to do some processing on the data to get a set of features that conform with this assumption.

Comparison with other classification techniques

As compared to other classification techniques, NB is extremely fast and not susceptible to the curse of dimensionality, which is a notorious challenge when dealing with big data.

Curse of dimensionality

In the context of classification, the curse of dimensionality refers to the situation in which the volume of the data space expands very quickly, that is, typically at an exponential or higher rate, as the number of input features increases, making the available data sparse. As a result, a learning algorithm susceptible to this curse either requires a prohibitively larger amount of computing resources (in terms of space, speed, or time) or cannot yield a good model due to the sparsity of the available data.

NB isn’t susceptible to this curse because the assumption of independence between features simplifies the relationship among features, breaking the otherwise complex label estimation down to terms of single-dimensional probability. This can also be problematic, however, as some argue that such an assumption may be too strong. Moreover, as NB estimates the probability terms from the data, in cases of imbalanced data (that is, when we have very few or no data points for some variables), there could be many probabilities that are close to zero, which affects the overall class estimation.

However, in practice, despite the naivete of the assumption, NB has shown to still be a very strong baseline classifier that is worth a try, at least at the early stages of the data analysis process.

Method

Before delving into the technical details of NB, we’re encouraged to brush up on probability basics, especially Bayes’ theorem.

NB formulates the classification problem as a problem of finding the most probable class label given the input features as evidence, that is, solving the optimization equation:

y=argmaxyP(Y=yx1,x2,...xn)y^*=argmax_yP(Y=y|x_1,x_2,...x_n)

For any class value y, the probability can be computed as follows:

P(Y=yx1,x2,,xn)=P(x1,x2,...,xnY=y)P(Y=y)P(x1,x2,...,xn)P(Y=y|x_1,x_2,\cdots,x_n)=\frac{P(x_1,x_2,...,x_n|Y=y)P(Y=y)}{P(x_1,x_2,...,x_n)}

=P(y)i=1nP(xiy)P(x1,x2,...,xn) = \frac {P(y) \prod^n_{i=1} P(x_i|y)} { P(x_1,x_2,...,x_n)}

For the sake of simplifying notations, we write P(y)P(y) and P(xiy)P(x_i|y) in place of P(Y=y)P(Y = y) and P(xiY=y)P(x_i|Y = y), respectively. The last equality is derived from the chain rule and independence assumption. Specifically, first, we apply the chain rule:

P(x1,x2,x3,...,xny)=P(x1,x2,x3,...,xn,y)P(x2x3,...,xn,y)...P(xny)P(y) P(x_1,x_2,x_3,...,x_n|y)=P(x_1|,x_2,x_3,...,x_n,y)P(x_2|x_3,...,x_n,y)...P(x_n|y)P(y)

Under the independence assumption, that is, x1,x2,x3,...,xnx_1, x_2, x_3, . . . , x_n are conditionally independent given yy, it follows that:

P(x1x2,x3,...,xn,y)=P(x1y)P(x2x3,x4,...,xn,y)=P(x2y)p(xn1xn,y)=P(xn1y) P(x_1|x_2,x_3,...,x_n,y)=P(x_1|y) P(x_2|x_3,x_4,...,x_n,y)=P(x_2|y) p(x_{n-1}|x_{n,y})=P(x_{n-1}|y)

Substituting these terms back to the chain rule yields:

P(x1,x2,...,xny)=i=1nP(xiy)P(x_1,x_2,...,x_n|y)=\prod ^n_{i=1} P(x_i|y)

Since P(x1,x2,.,xn)P(x_1, x_2, .\cdots , x_n) is constant, given the input (x1,x2,...,xn)(x_1, x_2, . . . , x_n), solving equation yy^* is equivalent to solving:

y=argmaxyP(y)i=1nP(xiy)y^*=argmax_yP(y) \prod^n_{i=1} P(x_i|y)

While P(y)P(y) can be estimated from the training data, that is, by counting the occurrence frequency of records labeled as the class yy, there are many ways to estimate the conditional probability P(xiy)P(x_i|y), also called likelihood, of each input variable xix_i, each of which yields a different variant of NB.

Likelihood estimation

The most popular approach is to assume some parametric form of P(xiy)P(x_i|y), then learn the parameters defining the distributions from the data. For instance, Bernoulli NB, Gaussian NB, and categorical NB are three NB classifiers that, respectively, assume P(xiy)P(x_i|y) take the form of Bernoulli, Gaussian, and multinomial distributions. The below table summarizes the applicability of these distribution forms and their parameters. This list is certainly not exhaustive; depending on the nature of the measures for the variables (for example, continuous or discrete, binary or multivalued), other distributions can be used.

Get hands-on with 1400+ tech skills courses.