Home/Blog/Data Science/Naïve Bayes explained
Home/Blog/Data Science/Naïve Bayes explained

Naïve Bayes explained

13 min read
Mar 04, 2024
content
Overview
Bayes’ theorem
Naïve Bayes
How Naïve Bayes works
The Naïve Bayes algorithm
Example
Pros and cons of using Naïve Bayes
Conclusion and next steps

Become a Software Engineer in Months, Not Years

From your first line of code, to your first day on the job — Educative has you covered. Join 2M+ developers learning in-demand programming skills.

Overview#

You might be familiar with the growing excitement around machine learning and its various applications. Amidst this frenzy, one algorithm stands out for its simplicity and effectiveness in classification tasks: the Naïve Bayes classifier. Its versatility makes it applicable in numerous real-world scenarios, including the following:

  • Spam email detection: Based on the presence of specific words or phrases, Naïve Bayes is used to separate spam emails from valid ones.

  • Sentiment analysis: This is a technique used in natural language processing applications, such as customer reviews or social media posts, to classify text into positive, negative, or neutral attitudes.

  • Medical diagnosis: Based on test findings and patient symptoms, Naïve Bayes is used to forecast the likelihood that a specific disease is present.

  • Recommendation systems: These are used in recommendation engines to forecast user preferences and make pertinent product or service recommendations in response to user activity.

Recommendation system
Recommendation system
  • Document classification: News articles, academic papers, and legal documents are among the specified categories that can be classified using Naïve Bayes.

As seen from the examples above, the Naïve Bayes algorithm can work well with textual data as well as structured (or tabular) data. In this blog, our focus will be on structured data. We will see how Naïve Bayes works, along with exploring some of its advantages and disadvantages.

Bayes’ theorem #

Before delving further, let’s first take a look at the Bayes’ theorem, upon which the Naïve Bayes algorithm is based.

Bayes’ theorem is a key idea in probability theory and statistics that explains how to update the probability of a hypothesis (or event) in light of fresh information or data.

Bayes’ theorem mathematically expresses this relationship between event AA and event BB as follows:

Here:

  • P(AB)P(A∣B), known as posterior probability, denotes the conditional probability of event AA occurring, given that event BB has previously occurred.

Note: Conditional probability is defined as the probability of an event (AA) occurring, given that another event (BB) has already occurred.

  • P(BA)P(B∣A), also called likelihood, is the conditional probability of event BB happening, given that event AA has already occurred.

  • P(A)P(A), also known as prior probability, is the probability of event AA occurring based on prior history.

  • Evidence probability, or P(B)P(B), is the probability of the event happening.

Let’s explain Bayes’ theorem using a coin toss example.

Assume you have two coins, one is a fair coin (AA), and the other is a biased coin (BB) with a higher probability of landing on tails. You randomly pick one of the coins and toss it. Now, you want to find the probability that the coin is the biased one (BB), given that it landed on tails.

Let’s define it as follows:

  • AA: The event that coin AA is chosen.

  • BB: The event that coin BB is chosen.

  • TT: The event that the coin lands on tails.

We’ll assume the following probabilities:

  • P(A)=0.5P(A) = 0.5 (the prior probability of choosing coin AA).

  • P(B)=0.5P(B) = 0.5 (the prior probability of choosing coin BB).

  • P(TA)=0.5P(T∣A) = 0.5 (the conditional probability of getting tails with coin AA).

  • P(TB)=0.7P(T∣B) = 0.7 (the conditional probability of getting tails with coin BB).

Using Bayes’ Theorem, we derive the following equation:

Note: We can calculate P(T)P(T) as follows: P(TB)P(B)+P(TA)P(A)P(T|B) * P(B) + P(T|A) * P(A)

We have all the values that can be plugged into the above equation. Therefore,

Now that we have a brief understanding of Bayes’ theorem, let’s take a look at the Naïve Bayes algorithm—a probabilistic algorithm that is based on the assumption that features are independent of one another. The assumption of independence among features means that the occurrence or value of one feature does not affect or depend on the occurrence or value of another feature. However, this assumption might not hold true in real-world applications.

Naïve Bayes classifier
Naïve Bayes classifier

How Naïve Bayes works#

Assume we have kk attributes, where A1A_1​ through AkA_k​ are attributes with distinct features. The class is CC and can have multiple distinct values. Let’s further suppose that a test example dd with observed attribute values a1a_1​ through aka_k​ is provided, and aia_i​ is the value of attribute AiA_i, where i=1,kii=1,\cdots k_i. In essence, classification involves calculating the subsequent posterior probability. The prediction is the class cjc_j​ such that

is maximum.

By applying the Bayes’ theorem, we get:

The class prior probability is P(C=cj)P(C=c_j), which is easily determined. The likelihood that a class will exist without taking any attributes into account is known as its prior probability. It is computed as the percentage of occurrences that are associated with that particular class in the existing dataset.

The denominator P(A1=a1,A2=a2,,Ak=ak){P(A_1 = a_1, A_2 = a_2, …, A_k = a_k)} can be ignored because it will be the same for every probability.

Therefore, we only need to calculate:

This can be written as:

Note: The above calculation is made significantly easier due to the Naïve Bayes’ assumption that all features are independent of each other.

Now, how do we calculate P(Ai=aiC=cj)P(A_i = a_i| C = c_j)? Simple! We calculate it as follows:

This can also be easily calculated by looking at the dataset.

Finally, we have all the details. Therefore, to determine the most likely class for the test instance given a test example dd, we compute the following:

The Naïve Bayes algorithm#

Let’s now take a look at the algorithm step by step followed by a working example.

  1. Calculate prior probabilities: Calculate the prior probabilities of each class in the given dataset.

  2. Calculate likelihoods: For each feature and each class, calculate the likelihood of observing a specific feature value given the class. This involves counting the occurrences of feature values for each class in the training data.

  3. Calculate posterior probabilities: For a new, unseen data point, calculate the posterior probabilities for each class using Bayes’ theorem. The posterior probability of a class given the features is proportional to the prior probability of the class and the product of the likelihoods for each feature.

  4. Make a prediction: As the projected class for the new data point, select the class with the highest posterior probability. This is the class that the algorithm believes is most likely given the observed features.

Naïve Bayes in steps
Naïve Bayes in steps

Example#

Let’s assume that we have the following dataset.

Fever

Fatigue

Cough

Disease

Yes

No

No

Influenza

No

Yes

No

Common cold

No

No

Yes

Influenza

Yes

No

Yes

Other

No

No

No

Influenza

Yes

Yes

Yes

Common cold

Yes

No

Yes

Influenza

No

Yes

No

Other

Yes

Yes

No

Influenza

No

No

Yes

Common cold

Step 1: Calculate prior probabilities

Based on the given data, we have the following prior probabilities:

  • P(Disease=Influenza)=5/10P(\text{Disease} = \text{Influenza)} = 5/10

  • P(Disease=Common cold)=3/10P(\text{Disease} = \text{Common cold)} = 3/10

  • P(Disease=Other)=2/10P(\text{Disease} = \text{Other}) = 2/10

Step 2: Calculate likelihoods

For each feature, we calculate the likelihood of observing “Yes” or “No” for each class as follows:

  • P(Fever = Yes | Disease = Influenza)=3/5P(\text{Fever = Yes | Disease = Influenza}) = 3/5

  • P(Fever = Yes | Disease = Common cold)=1/3P(\text{Fever = Yes | Disease = Common cold}) = 1/3

  • P(Fever = Yes | Disease = Other)=1/2P(\text{Fever = Yes | Disease = Other}) = 1/2

  • P(Fever = No | Disease = Influenza)=2/5P(F\text{ever = No | Disease = Influenza}) = 2/5

  • P(Fever = No | Disease = Common cold)=2/3P(\text{Fever = No | Disease = Common cold}) = 2/3

  • P(Fever = No | Disease = Other)=1/2P(\text{Fever = No | Disease = Other}) = 1/2

  • P(Fatigue = Yes | Disease = Influenza)=1/5P(\text{Fatigue = Yes | Disease = Influenza}) = 1/5

  • P(Fatigue = Yes | Disease = Common cold)=2/3P(\text{Fatigue = Yes | Disease = Common cold}) = 2/3

  • P(Fatigue = Yes | Disease = Other)=1/2P(\text{Fatigue = Yes | Disease = Other}) = 1/2

  • P(Fatigue = No | Disease = Influenza)=4/5P(\text{Fatigue = No | Disease = Influenza}) = 4/5 = 4/5

  • P(Fatigue = No | Disease = Common cold)=1/3P(\text{Fatigue = No | Disease = Common cold}) = 1/3

  • P(Fatigue = No | Disease = Other)=1/2P(\text{Fatigue = No | Disease = Other}) = 1/2

  • P(Cough = Yes | Disease = Influenza)=2/5P(\text{Cough = Yes | Disease = Influenza}) = 2/5

  • P(Cough = Yes | Disease = Common cold)=2/3P(\text{Cough = Yes | Disease = Common cold}) = 2/3

  • P(Cough = Yes | Disease = Other)=1/2P(\text{Cough = Yes | Disease = Other}) = 1/2

  • P(Cough = No | Disease = Influenza)=3/5P(\text{Cough = No | Disease = Influenza}) = 3/5

  • P(Cough = No | Disease = Common cold)=1/3P(\text{Cough = No | Disease = Common cold}) = 1/3

  • P(Cough = No | Disease = Other)=1/2P(\text{Cough = No | Disease = Other}) = 1/2

Step 3: Calculate posterior probabilities

Assume we have a new data point: d=Fever = Yes, Fatigue = No, Cough = Yes.d = \text{{Fever = Yes, Fatigue = No, Cough = Yes}}.

We need to calculate the following three probabilities:

  • P(Disease = Influenza | Fever = Yes, Fatigue = No, Cough = Yes)P(\text{Disease = Influenza | Fever = Yes, Fatigue = No, Cough = Yes})

  • P(Disease = Common cold | Fever = Yes, Fatigue = No, Cough = Yes)P(\text{Disease = Common cold | Fever = Yes, Fatigue = No, Cough = Yes})

  • P(Disease = Other | Fever = Yes, Fatigue = No, Cough = Yes)P(\text{Disease = Other | Fever = Yes, Fatigue = No, Cough = Yes})

Here is the posterior probability for Disease = Influenza\text{Disease = Influenza}:

P(Disease = Influenza | Fever = Yes, Fatigue = No, Cough = Yes)P(\text{Disease = Influenza | Fever = Yes, Fatigue = No, Cough = Yes}) =P(Disease = Influenza)= P(\text{Disease = Influenza}) * P(Fever = Yes | Disease = Influenza)P(\text{Fever = Yes | Disease = Influenza)} * P(Fatigue = No | Disease = Influenza)P(\text{Fatigue = No | Disease = Influenza}) * P(Cough = Yes | Disease = Influenza)P(\text{Cough = Yes | Disease = Influenza})

=5/103/54/52/5=0.096. = 5/10 * 3/5 * 4/5 * 2/5 = 0.096.

Here is the posterior probability for Disease = Common cold\text{Disease = Common cold}:

P(Disease = Common cold | Fever = Yes, Fatigue = No, Cough = Yes)P(\text{Disease = Common cold | Fever = Yes, Fatigue = No, Cough = Yes}) =P(Disease = Common cold)= P(\text{Disease = Common cold}) * P(Fever = Yes | Disease = Common cold)P(\text{Fever = Yes | Disease = Common cold}) * P(Fatigue = No | Disease = Common cold)P(\text{Fatigue = No | Disease = Common cold}) * P(Cough = Yes | Disease = Common cold)P(\text{Cough = Yes | Disease = Common cold})

=3/101/31/32/3=0.022.= 3/10 * 1/3 * 1/3 * 2/3 = 0.022.

Here is the posterior probability for Disease = Other\text{Disease = Other}:

P(Disease = Other | Fever = Yes, Fatigue = No, Cough = Yes)P(\text{Disease = Other | Fever = Yes, Fatigue = No, Cough = Yes}) =P(Disease = Other)= P(\text{Disease = Other}) * P(Fever = Yes | Disease = Other)P(\text{Fever = Yes | Disease = Other}) * P(Fatigue = No | Disease = Other)P(\text{Fatigue = No | Disease = Other}) * P(Cough = Yes | Disease = Other)P(\text{Cough = Yes | Disease = Other})

=2/101/21/21/2=0.025.= 2/10 * 1/2 * 1/2 * 1/2 = 0.025.

Step 4: Make a prediction

Because the posterior probability for Disease=InfluenzaDisease = Influenza is maximum, we can assign the new data point d belonging to influenza.

Pros and cons of using Naïve Bayes#

Let’s now take a look at a few advantages and disadvantages of using the Naïve Bayes algorithm.

Advantages

  • The algorithm is easy to implement.
  • Naïve Bayes is very efficient and generally gives good results in many applications.
  • Naïve Bayes can be applied to both small and large datasets and is not significantly impacted by the number of features.
  • It tends to be less prone to overfitting, especially when the dataset is small.

Disadvantages

All attributes are assumed to be categorical by Naïve Bayes, and discretization of numerical attributes is required. It is possible that in the training set, a specific attribute value will never occur with a class. Calculating probabilities for such attributes will result in 0. To avoid that, we introduce a smoothing factor, which is usually a small constant value. The assumption that the features are independent of one another is not valid in most cases. Therefore, accuracy can be very low when the assumption is seriously violated.

Conclusion and next steps#

This blog has provided a quick introduction to the Naïve Bayes algorithm. We started with a brief introduction to Bayes’ theorem, mentioned some use cases, and explored the advantages and disadvantages of using the Naïve Bayes algorithm for classification. We also demonstrated our example and showed each step.

However, your journey does not end here! To create models that are more reliable and accurate, you might want to experiment with various approaches and frameworks. We recommend that you look into the following courses offered by Educative:

A Practical Guide to Machine Learning with Python

Cover
A Practical Guide to Machine Learning with Python

This course teaches you how to code basic machine learning models. The content is designed for beginners with general knowledge of machine learning, including common algorithms such as linear regression, logistic regression, SVM, KNN, decision trees, and more. If you need a refresher, we have summarized key concepts from machine learning, and there are overviews of specific algorithms dispersed throughout the course.

72hrs 30mins
Beginner
108 Playgrounds
12 Quizzes

Machine Learning with Python Libraries

Cover
Machine Learning with Python Libraries

Machine learning is used for software applications that help them generate more accurate predictions. It is a type of artificial intelligence operating worldwide and offers high-paying careers. This path will provide a hands-on guide on multiple Python libraries that play an important role in machine learning. This path also teaches you about neural networks, PyTorch Tensor, PyCaret, and GAN. By the end of this module, you’ll have hands-on experience in using Python libraries to automate your applications.

53hrs
Beginner
56 Challenges
62 Quizzes

Mastering Machine Learning Theory and Practice

Cover
Mastering Machine Learning Theory and Practice

The machine learning field is rapidly advancing today due to the availability of large datasets and the ability to process big data efficiently. Moreover, several new techniques have produced groundbreaking results for standard machine learning problems. This course provides a detailed description of different machine learning algorithms and techniques, including regression, deep learning, reinforcement learning, Bayes nets, support vector machines (SVMs), and decision trees. The course also offers sufficient mathematical details for a deeper understanding of how different techniques work. An overview of the Python programming language and the fundamental theoretical aspects of ML, including probability theory and optimization, is also included. The course contains several practical coding exercises as well. By the end of the course, you will have a deep understanding of different machine-learning methods and the ability to choose the right method for different applications.

36hrs
Beginner
109 Playgrounds
10 Quizzes

Written By:
Kamran Lodhi
Join 2.5 million developers at
Explore the catalog

Free Resources