## Introduction

A very useful machine learning method which, for its simplicity, is incredibly successful in many real world applications is *the Naive Bayes classifier*. I am currently taking a machine learning module as part of my data science college course and this week’s practical work involved a classification problem using the Naive Bayes method. I thought that I’d write a two-part piece about the project and the concepts of probability in order to consolidate my own understanding as well to share my method with others who may find it useful. If the reader wishes to get straight into the Bayes classification problem, please see Part 2 here.

The dataset used in the classification problem is titled “Breast cancer data” and is sourced from Matjaz Zwitter and Milan Soklic from the Institute of Oncology, University Medical Center in Ljubljana, Slovenia (formerly Yugoslavia). It is an interesting dataset which contains categorical data on two classes with nine attributes. The dataset can be found here. The two outcome classes are “no-recurrence events” and “recurrence events” which refer to whether breast cancer returned to a patient or not having previously being diagnosed with the disease and treated for it.

Our goal was to use these data to build a Naive Bayes classification model which could predict with good accuracy whether a patient diagnosed with breast cancer is likely to experience a recurrence of the disease based on the attributes. First, I will briefly discuss the basic concepts of probability before moving on to the classification problem in Part 2.

## Fundamentals of Probability

Probability boils down to working out proportions and performing mathematical operations on them such as addition, multiplication and division. The fact that Naive Bayes classification has been so successful given this simple foundation is truly remarkable. There are a number of rules and concepts involved in probability calculations which one should be aware of in order to understand how the Naive Bayes classifier works. We will discuss these now but first I want to explain independent events, dependent events and mutual exclusiveness.

*Independent events* are those which, upon occurrence, do not affect the probability of another event occurring. An example would be Brazil winning the football World Cup and a hurricane forming over the Atlantic Ocean in September. *Dependent* events are those which, upon occurrence, affect the probability of another event occurring (i.e. they are linked in some way). An example would be getting a cold and going to work the following day. The probability of you going to work can be affected by the occurrence of a cold as you don’t feel very well and would like to not infect your colleagues. When two or more events are *mutually exclusive*, they cannot occur simultaneously. What is the probability of rolling a 6 and a 3 on a single dice roll? Zero. It is impossible. These two outcomes are mutually exclusive.

To calculate the probability of a single event occurring, sometimes called *the observational probability*, you just take the number of times the event occurred and divide it by the total number of processes which occurred that could have lead to that event. For example, let’s say I sprint 100 metres fifty times over the course of a month. I want to know the probability of my time being below 10 seconds based on the previous sprints. If I ran sub-10 second 100 metres on 27 occasions, then the observational probability of me running a sub-10 second 100 metres is simply:

P(<10 second 100 m) = (27 / 50) = 0.54 (54%)

Therefore, I can expect to sprint 100 metres in under 10 seconds about half of the time based on records of the fifty completed sprints. Please note that the P() wrapper is used to denote the probability of some outcome and is used throughout these posts.

### Multiplication Rules for AND Events

When calculating the probability of two or more events occurring simultaneously, we first consider whether the events are independent or dependent. If they are independent we can use the *simple multiplication rule*:

P(outcome 1 **AND** outcome 2) = P(outcome 1) * P(outcome 2)

If I were to calculate the probability of Brazil winning the football World Cup and a hurricane forming over the Atlantic Ocean in September, I would use this simple multiplication rule. The two events are independent as the occurrence of one does not affect the other’s chance of occurring. If Brazil have a probability of winning the World Cup of 41% and the probability of a hurricane over the Atlantic Ocean in September is 91%, then we calculate the probability of both occurring:

P(Brazil **AND** Hurricane) = P(Brazil) * P(Hurricane)

= (0.41) * (0.91)

= 0.3731 (37%)

If two or more events are dependent, however, we must use the *general multiplication rule. *This formula is always valid, in both cases of independent and dependent events, as we will see.

P(outcome 1 **AND** outcome 2) = P(outcome 1) * P(outcome 2 | outcome 1)

P(outcome 2 | outcome 1) refers to the* conditional probability* of outcome 2 occurring given outcome 1 has already occurred. One can immediately see how this formula incorporates dependence between the events. If the events were independent, then the conditional probability is irrelevant as one outcome does not influence the chance of the other and P(outcome 2 | outcome 1) is simply P(outcome 2). The formula just becomes the simple multiplication rule already described.

We can apply the general multiplication rule to a deck of cards example. What would be the probability of drawing a King of any suit and an Ace of any suit from a 52-card deck with just two draws? There are 4 Kings and 4 Aces in a standard deck so we know that drawing an Ace immediately affects our chances of drawing a King as there are fewer cards in the deck after the first draw. We can use the general multiplication formula as follows:

P(Ace **AND** King) = P(Ace) * P(King | Ace)

= (4 / 52) * (4 / 51)

= 0.006033183 (0.6%)

Obviously, if two events are mutually exclusive and cannot occur simultaneously, they are disjoint and the multiplication rules cannot be applied. The dice roll example describes such a scenario. The best we can do in such a case is state the probability of either outcome occurring. This brings us to the *the simple addition rule.*

### Addition Rules for OR Events

When calculating the probability of either one event or the other occurring, we use *the addition rule*. When the outcomes are mutually exclusive, we use the *simple addition* formula:

P(outcome 1 **OR** outcome 2) = P(outcome 1) + P(outcome 2)

Applied to the dice roll example, what is the probability of rolling a 6 or a 3? Both outcomes cannot occur simultaneously. The probability of rolling a 6 is (1 / 6) and the same can be said for rolling a 3. Therefore,

P(6 **OR** 3) = (1 / 6) + (1 / 6) = 0.33 (33%)

However, if the events are not mutually exclusive and can occur simultaneously, we must use the following *general addition* formula which is always valid in both cases of mutual exclusiveness and non-mutual exclusiveness.

P(outcome 1 **OR** outcome 2) = P(outcome 1) + P(outcome 2) – P(outcome 1 **AND** outcome 2)

Applied to the World Cup and hurricane example, this would mean calculating the probabilities of both outcomes as well as the probability of both occurring simultaneously.

P(Brazil) + P(Hurricane) – P(Brazil **AND** Hurricane)

We know that the outcomes are independent, the occurrence of one does not affect the probability of the other occurring, and we can therefore use the *simple multiplication rule* for two simultaneous events to calculate P(Brazil **AND** Hurricane).

P(Brazil **AND** Hurricane) = (0.41) * (0.91)

= 0.3731 (37%)

Finally, we can plug this probability into the main formula to get our answer:

P(Brazil **OR** Hurricane) = (0.91) + (0.41) – (0.3731) = 0.9469 (95%)

This makes sense right? Brazil are indeed an awesome football team. However, what drives this probability up is the fact that September is one of the months in the Atlantic hurricane season and you are almost guaranteed to see a hurricane event.

## Summary

That was a brief introduction to probability and the rules associated with it. I hope it is clear and easy to follow. The important rules to remember are the general forms of the multiplication rule and the addition rule as they are valid in all cases. Using them in the above examples in place of the simple rules still yields the same results. As such, it may be more useful to memorise these general forms. Here they are once more:

P(outcome 1 **AND** outcome 2) = P(outcome 1) * P(outcome 2 | outcome 1)

P(outcome 1 **OR** outcome 2) = P(outcome 1) + P(outcome 2) – P(outcome 1 **AND** outcome 2)

I have chosen to use “outcome 1” and “outcome 2” instead of the usual “A” and “B” format because I believe it easier to understand Naive Bayes when replacing the letters with meaningful words. Here is how the general rules are typically written:

P(A and B) = P(A) * P(B | A)

P(A or B) = P(A) + P(B) – P(A and B)

In Part 2 I describe how to implement the Naive Bayes classifier in R and explain how it works based on the fundamentals of probability outlined here.

*Featured image By Dennis Hill from The OC, So. Cal. – misc 24, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=54704175*

[…] article was first published on Environmental Science and Data Analytics, and kindly contributed to […]

I am sure this is a fine article, but next time you may want to reconsider using a “restricted” data set.

Hello there. You are right and I apologise for the oversight. As this dataset was already available to me, it slipped my mind that I needed to provide it to the reader. Please find the dataset here: https://github.com/UserSeanWalsh/Projects/blob/master/datasets-uci-breast-cancer4.xlsx

No problem. I was reluctant to point it out and I have been having a heck of a time doing Bayesian analysis in R. Your post really helped. Did you ever think about using the Caret package with the confusionmatrix function; it has a really nice interface.

I’m glad you did point it out and I’m also glad that the post helped. It certainly helped my own understanding of the process. I’m reading a book by the caret creator Max Kuhn at the moment “Applied Predictive Modelling” which uses the caret package in its computing sections. I’m looking forward to getting to know it better. I’ll certainly check out the confusion matrix function. Cheers!

[…] Naive Bayes Classification in R (Part 1) A very useful machine learning method which, for its simplicity, is incredibly successful in many real world applications is the Naive Bayes classifier. I am currently taking a machine learning module as part of my data science college course and this week’s practical work involved a classification problem using the Naive Bayes method. I thought that I’d write a two-part piece about the project and the concepts of probability in order to consolidate my own understanding as well to share my method with others who may find it useful. […]

[…] on from Part 1 of this two-part post, I would now like to explain how the Naive Bayes classifier works before […]