Naive Bayes Classification in R (Part 2)

Following on from Part 1 of this two-part post, I would now like to explain how the Naive Bayes classifier works before applying it to a classification problem involving breast cancer data. The dataset is sourced from Matjaz Zwitter and Milan Soklic from the Institute of Oncology, University Medical Center in Ljubljana, Slovenia (formerly Yugoslavia) and the attributes are as follows:

age: a series of ranges from 20-29 to 70-79

menopause: whether a patient was pre- or post-menopausal upon diagnosis

tumor.size: the largest diameter (mm) of excised tumor

inv.nodes: the number of axillary lymph nodes which contained metastatic breast cancer

node.caps: whether metastatic cancer was contained by the lymph node capsule

deg.malign: the histological grade of the tumor (1-3 with 3 = highly abnormal cells)

breast: which breast the cancer occurred in

breast.quad: region of the breast cancer occurred in (four quadrants with nipple = central)

irradiat: whether the patient underwent radiation therapy

Some preprocessing of these data was required as there were some NAs (9 in total). I imputed predicted values using separate Naive Bayes classifiers. The objective here is to attempt to predict, using these attributes, with relatively high accuracy whether a recurrence of breast cancer is likely to occur in patients who were previously diagnosed and treated for the disease. We can pursue this objective by using the Naive Bayes classification method.

Naive Bayes’ Classification

Below is the Naive Bayes’ Theorem:

P(A | B) = P(A) * P(B | A) / P(B)

which can be derived from the general multiplication formula for AND events:

P(A and B) = P(A) * P(B | A)

P(B | A) = P(A and B) / P(A)

P(B | A) = P(B) * P(A | B) / P(A)

If I replace the letters with meaningful words as I have been adopting throughout, the Naive Bayes formula becomes:

P(outcome | evidence) = P(outcome) * P(evidence | outcome) / P(evidence)

It is with this formula that the Naive Bayes classifier calculates conditional probabilities for a class outcome given prior information or evidence (our attributes in this case). The reason it is termed “naive” is because we assume independence between attributes when in reality they may be dependent in some way. For the breast cancer dataset we will be working with, some attributes are clearly dependent such as age and menopause status while some may or may not be dependent such as histological grade and tumor size.

This assumption allows us to calculate the probability of the evidence by multiplying the individual probabilities of each piece of evidence occurring together using the simple multiplication rule for independent AND events. Another point to note is that this naivety results in probabilities that are not entirely mathematically correct but they are a good approximation and adequate for the purposes of classification. Indeed, the Naive Bayes classifier has proven to be highly effective and is commonly deployed in email spam filters.

Calculating Conditional Probabilities

Conditional probabilities are fundamental to the working of the Naive Bayes formula. Tables of conditional probabilities must be created in order to obtain values to use in the Naive Bayes algorithm. The R package e1071 contains a very nice function for creating a Naive Bayes model. Read in the dataset sourced via the hyperlink at the start of this post or see the comments below for Github access. Note that some cleaning was carried out in this example but the original will work fine as long as strings are set to factors.

library(e1071)
breast_cancer <- read.csv("breast_cancer_df.csv")
model <- naiveBayes(class ~ ., data = breast_cancer)
class(model)
summary(model)
print(model)

The model has class “naiveBayes” and the summary tells us that the model provides a-priori probabilities of no-recurrence and recurrence events as well as conditional probability tables across all attributes. To examine the conditional probability tables just print the model.

One of our tasks for this assignment was to create code which would give us the same conditional probabilities as those output by the naiveBayes() function. I went about this in the following way:

tbl_list <- sapply(breast_cancer[-10], table, breast_cancer[ , 10])
tbl_list <- lapply(tbl_list, t)

cond_probs <- sapply(tbl_list, function(x) { 
  apply(x, 1, function(x) { 
    x / sum(x) }) })

cond_probs <- lapply(cond_probs, t)

print(cond_probs)

The first line of code uses the sapply function to loop over all attribute variables in the dataset and create tables against the class attribute. I then used the lapply function to transpose all tables in the list so the rows represented the class attribute.

To calculate conditional probabilities for each element in the tables, I used sapply, lapply and anonymous functions. I had to transpose the output in order to get the same structure as the naiveBayes model output. Finally, I printed out my calculated conditional probabilities and compared them with the naiveBayes output to validate the calculations.

Applying the Naive Bayes’ Classifier

So I’ve explained (hopefully reasonably well) how the Naive Bayes classifier works based on the fundamental rules of probability. Now it’s time to apply the model to the data. This is easily done in R by using the predict() function.

preds <- predict(model, newdata = breast_cancer)

You will see that I have trained the model using the entire dataset and then made predictions on the same dataset. In our assignment we were asked to train the model and test it on the dataset, treating the dataset as an unlabeled test set. This is unconventional as the training set and test set are then identical but I believe the assignment was intended to just test our understanding of how the method works. In practice, one would use a training set for the model to learn from and a test set to assess model accuracy.

If one outcome class is more abundant in the dataset, as is the case with the breast cancer data (no-recurrence: 201, recurrence: 85), the data is unbalanced. This is okay for a generative Naive Bayes model as you want your model to learn from real-world events and to capture the truth. Manipulating the data to achieve less skew would be dangerous. There is also the decision on whether to employ Laplace smoothing to the model. Laplace smoothing, in effect, adds imaginary observations to a dataset in order to avoid absolute zero probabilities which we cannot explicitly determine to be true.

Applying the model to the data gives the following confusion matrix from which a model accuracy of 75% can be calculated:

 conf_matrix <- table(preds, breast_cancer$class)

This post has only scraped the surface of classification methods in machine learning but has been a useful revision for myself and perhaps it may help others new to the Naive Bayes classifier. Please feel free to comment and correct any errors that may be present.

Featured image By Dennis Hill from The OC, So. Cal. – misc 24, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=54704175

Advertisements

15 thoughts on “Naive Bayes Classification in R (Part 2)”

      1. Thanks. Please take a look to the added sentence as it needs a lline feed to enhance readability.
        I found that the dataset is protected so emailed to the UCI to get the dataset disclosed.

  1. It’s also a good idea to set laplace = 1 in the naiveBayes() model building function to avoid some absolute zero conditional probabilities. I didn’t want to go into too much detail about Laplace smoothing in the post itself.

  2. Hi,
    When I tried to access the data set from the link (actually a sub-link of the link), I got the following message:
    “`
    This is a restricted data set. To access this dataset, please send a
    request to ml-repository ‘at’ ics ‘dot’ uci ‘dot’ edu

    Include in the request the name of this data set and the name of your
    educational institution.
    “`
    I’m happy to send the request email. However, I’m not associated with any educational institution. How can I get the data set for practicing?
    Thanks.

  3. There is an R-Package “naivebayes” which offers much faster predict() function than this from e1071. It also works correctly with features of class “logical”.

  4. Read. liked and understood Parts 1 & 2.
    Thanks!

    2 Qs:
    Q1) Can you include a .CSV version of the BreastCancer file
    in your GitHub link?
    I use Linux (not Windows) ,
    so I don’t have Excel to read your .xsls file….

    Q2) You mention that all string variables
    “have to be converted to R factors”.
    Can you include the R code to that easily on this data set?

    Q BONUS!) What is the R-code you used
    to “clean up” the breast cancer data set.
    (It would be interesting
    to include the complete/commented R code in the article – Part 2),
    ….just my opinion. 🙂

    Thanks again!
    Ray
    SF

    1. @Ray Thank you for the kind feedback. I’m glad you liked it. I also use Linux and the .xsls file opens on Libre Office for me. I have added the cleaned data as a .csv file in the Github link now in any case so I hope that helps. When you read in a .csv file using read.csv(), just add the argument: stringsAsFactors = T

      The dataset wasn’t too messy and it was just some NAs that had to be dealt with. I imputed predicted values using naive Bayes classification for those variables based on the other attributes. It turns out that naive Bayes predictions were the same as the mode.

  5. […] A very useful machine learning method which, for its simplicity, is incredibly successful in many real world applications is the Naive Bayes classifier. I am currently taking a machine learning module as part of my data science college course and this week’s practical work involved a classification problem using the Naive Bayes method. I thought that I’d write a two-part piece about the project and the concepts of probability in order to consolidate my own understanding as well to share my method with others who may find it useful. If the reader wishes to get straight into the Bayes classification problem, please see Part 2 here. […]

  6. Thanks again for this useful post. Now you may be thiking in a “part 3” if so, I would like to see you applying another algorithm (for instance logistic regression) and doing the performance comparisson between the two. I know how to do this with numerical presponse variables but have never seeb how to when the response variables are categorical.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s