Building a Spam Filter with a Multinomial Naive Bayes Model¶

Our goal is to create an algorithm that will tell whether a new message is a spam or not.
We will use de dataset built by Tiago A. Almeida and José Maria Gomez Hidalgo, which classifies 5,572 SMS as spams or non-spams.
The dataset can be downloaded here.

Thanks to Almeida and Hidalgo work, we can capitalize on a human classification of spam and non-spam messages and train our filtering function, before testing it.

import pandas as pd
import numpy as np

data = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

data.shape

(5572, 2)

data.head()

data['Label'].value_counts(normalize=True)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

'spam' means that a message is categorized as spam, while 'ham' indicates a regular non-spam message.
The dataset regroups 5572 text messages information, and we can see that 13.4% are spam.

Our end goal is to create a function that tells us whether a new message is categorized as a spam or not. Apart from coding the function, we expect to use a multinomial naive Bayes model in order to evaluate:

p(spam|message)
p(non-spam|message)
Eventually, we need to use as much data as possible to increase the accuracy of our model, and we also want to keep some data from the original dataset in order to test our model on 'new' messages.
Thus, we will randomize the dataset and separate it between a 'training' dataset (80% of the original dataset) and a 'testing' dataset (20% of the original dataset).

#Randomizing the dataset
data_random = data.sample(frac=1, random_state=1)

#Creating the train and test datasets
train = data_random.sample(frac=0.8, random_state=1)
test_index = set(data_random.index) - set(train.index)
test = data_random.loc[test_index]

#Resetting indexes
train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)

train.shape

(4458, 2)

test.shape

(1114, 2)

train['Label'].value_counts(normalize=True)*100

ham     86.675639
spam    13.324361
Name: Label, dtype: float64

test['Label'].value_counts(normalize=True)*100

ham     86.265709
spam    13.734291
Name: Label, dtype: float64

We can verify that the percentages of ham and spam messages are quite similar to the original data.

Data wrangling¶

The messages contents are located in the 'SMS' column, and in order to build our model, we want to be able to count the occurence of each word in each message. Doing so will enable us to evaluate the probability of finding each unique word among spam messages and non-spam messages (p(word_i|spam) and p(word_i|non-spam)), which we will need for our purpose.

In terms of Pandas manipulation, we expect to get a train dataframe with columns indicating the word count for each possible word in the entire messages vocabulary.

#Removing all special characters (replacing special characters, whitespaces included, with whitespaces)
#Setting all letters to lowercase
train['SMS'] = train['SMS'].str.replace('\W', ' ').str.lower()

train.head()

Creating a vocabulary list¶

Such list will contain all the words present in the 4458 messages (unique occurences) from the train dataset

vocabulary = []
train['SMS'] = train['SMS'].str.split()

for sms in train['SMS']:
    for word in sms:
        vocabulary.append(word)

#Keeping only one occurence of each word
vocabulary = set(vocabulary)

#Recreating the list
vocabulary = list(vocabulary)

len(vocabulary)

7712

Creating a dictionnary with words count¶

Such dictionnary is composed of words as keys and word count per sms as a list of word count.

#Creating a dictionnary counting each word occurence per sms
word_counts_per_sms = {unique_word: [0]*len(train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

#Creating the new dataframe based on the dictionnary we just created
word_counts = pd.DataFrame(word_counts_per_sms)

#Concatenating train_data with train to get 'Label' and 'SMS' columns
train_clean = pd.concat([train, word_counts], axis=1)

train_clean.head()

Building a multinomial naive Bayes model¶

As mentionned at the beginning of the project, we want to evaluate P(spam | message) and P(nonspam | message), corresponding to the probability that a new message is a spam or a regular message, given its content.

We know that:
$$ P(spam\ |\ message) = \frac{P(spam \cap message)}{P(message)} $$ $$ \iff P(spam\ |\ message) = \frac{P(spam) * \displaystyle\prod_{i=1}^n P(word_i\ |\ spam)}{P(message)} $$
considering that $ message = \displaystyle\sum_{i=1}^n word_i $

As we also consider $$ P(nonspam\ |\ message) = \frac{P(nonspam) * \displaystyle\prod_{i=1}^n P(word_i\ |\ nonspam)}{P(message)}, $$
we can evaluate only the numerators from both equations in order to compare P(spam | message) and P(nonspam | message)

According to Bayes, we get:
$$ P(spam\ |\ message)\ \ \propto \ \ P(spam) * \displaystyle\prod_{i=1}^n P(word_i\ |\ spam) $$ $$ and $$ $$ P(nonspam\ |\ message)\ \ \propto \ \ P(nonspam) * \displaystyle\prod_{i=1}^n P(word_i\ |\ nonspam) $$

Thus, we need to evaluate P(spam), P(nonspam), P(word | spam) and P(word | nonspam) for each word in the entire dataset vocabulary.

For a given word: $$ P(word\ |\ spam) = \frac{N_{word\ |\ spam} + \alpha}{N_{spam} + \alpha * N_{vocabulary}}, $$

with $\alpha$ as the Laplace smoothing factor, which we will set to 1, $N_{word\ |\ spam}$ as the wordcount of word in all spam messages, $N_{spam}$ the total number of words in spam messages and $N_{vocabulary}$ the total number of words (unique occurences) in the entire vocabulary.

The formula for non-spam messages is straightforward: $$ P(word\ |\ nonspam) = \frac{N_{word\ |\ nonspam} + \alpha}{N_{nonspam} + \alpha * N_{vocabulary}} $$

As we set alpha to 1, we get: $$ P(word\ |\ spam) = \frac{N_{word\ |\ spam} + 1}{N_{spam} + N_{vocabulary}} $$

$$ and $$

$$ P(word\ |\ nonspam) = \frac{N_{word\ |\ nonspam} + 1}{N_{nonspam} + N_{vocabulary}} $$

p_spam = train_clean['Label'].value_counts(normalize=True)['spam']
p_spam

0.13324360699865412

p_ham = train_clean['Label'].value_counts(normalize=True)['ham']
p_ham

0.8667563930013459

#Calculating N_spam, N_ham and N_vocabulary
train_spam = train_clean[train_clean['Label']=='spam']
words_per_sms_spam = train_spam['SMS'].apply(len)
n_spam = words_per_sms_spam.sum()

train_ham = train_clean[train_clean['Label']=='ham']
words_per_sms_ham = train_ham['SMS'].apply(len)
n_ham = words_per_sms_ham.sum()

n_vocabulary = len(vocabulary)

print('Spam messages vocabulary length equals {}'.format(n_spam))
print('Ham messages vocabulary length equals {}'.format(n_ham))
print('Total vocabulary length (unique words) equals {}'.format(n_vocabulary))

Spam messages vocabulary length equals 15142
Ham messages vocabulary length equals 57140
Total vocabulary length (unique words) equals 7712

#Initiating the alpha variable (Laplace smoothing) used in the naive Bayes theorem equation
alpha = 1

#Creating two dictionnaries with words as keys and probabilities as values
#Those two dictionnaries will look like this:
#dict_spam = {word_i: p(word_i|spam)}
#dict_ham = {word_j: p(word_j|ham)}

dict_spam = {word: 0 for word in vocabulary}
dict_ham = {word: 0 for word in vocabulary}

for word in vocabulary:
    n_word_given_spam = train_clean.loc[train_clean['Label']=='spam', word].sum()
    p_word_given_spam = (n_word_given_spam + alpha)/(n_spam + alpha*n_vocabulary)
    dict_spam[word] = p_word_given_spam
    n_word_given_ham = train_clean.loc[train_clean['Label']=='ham', word].sum()
    p_word_given_ham = (n_word_given_ham + alpha)/(n_ham + alpha*n_vocabulary)
    dict_ham[word] = p_word_given_ham

Creating the function that classifies messages¶

import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in dict_spam:
            p_spam_given_message *= dict_spam[word]
        if word in dict_ham:
            p_ham_given_message *= dict_ham[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

#Testing the function with two 'obvious' messages
expected_spam = 'WINNER!! This is the secret code to unlock the money: C3421.'
expected_ham = 'Sounds good, Tom, then see u there'

classify(expected_spam)
print('\n')
classify(expected_ham)

P(Spam|message): 1.273700039190484e-25
P(Ham|message): 2.6479653658243408e-27
Label: Spam


P(Spam|message): 1.0782622513413257e-25
P(Ham|message): 4.248744927807854e-21
Label: Ham

Testing the algorithm on the test dataset¶

def classify_test(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in dict_spam:
            p_spam_given_message *= dict_spam[word]
        if word in dict_ham:
            p_ham_given_message *= dict_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

#Creating a new column 'predicted' in the test dataframe
test['predicted'] = test['SMS'].apply(classify_test)
test.head()

#Measuring the accuracy of the spam filter
test['correct_prediction'] = test['Label']==test['predicted']
accuracy = test['correct_prediction'].sum()/test.shape[0]
print('The accuracy of this spam filter reaches {:.1f}%'.format(accuracy*100))

The accuracy of this spam filter reaches 98.7%

Conclusion¶

We get an accuracy of 98.7%, which is quite good for a first step towards spam filtering!

	Label	SMS	...
0	ham	[good, night, my, dear, sleepwell, amp, take, ...	...
1	ham	[sen, told, that, he, is, going, to, join, his...	...
2	ham	[thank, you, baby, i, cant, wait, to, taste, t...	...
3	ham	[when, can, ü, come, out]	...
4	ham	[no, thank, you, you, ve, been, wonderful]	...

	Label	SMS
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...

	Label	SMS
0	ham	good night my dear sleepwell amp take care
1	ham	sen told that he is going to join his uncle fi...
2	ham	thank you baby i cant wait to taste the real ...
3	ham	when can ü come out
4	ham	no thank you you ve been wonderful

	Label	SMS	predicted
0	spam	GSOH? Good with SPAM the ladies?U could b a ma...	spam
1	ham	Y so late but i need to go n get da laptop...	ham
2	spam	U have a secret admirer who is looking 2 make ...	spam
3	ham	Jus finish my lunch on my way home lor... I to...	ham
4	spam	SIX chances to win CASH! From 100 to 20,000 po...	spam