Building a Spam Filter with a Multinomial Naive Bayes Model

Our goal is to create an algorithm that will tell whether a new message is a spam or not.
We will use de dataset built by Tiago A. Almeida and José Maria Gomez Hidalgo, which classifies 5,572 SMS as spams or non-spams.
The dataset can be downloaded here.

Thanks to Almeida and Hidalgo work, we can capitalize on a human classification of spam and non-spam messages and train our filtering function, before testing it.

In [1]:
import pandas as pd
import numpy as np
In [2]:
data = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])
In [3]:
data.shape
Out[3]:
(5572, 2)
In [4]:
data.head()
Out[4]:
Label SMS
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
In [5]:
data['Label'].value_counts(normalize=True)*100
Out[5]:
ham     86.593683
spam    13.406317
Name: Label, dtype: float64

'spam' means that a message is categorized as spam, while 'ham' indicates a regular non-spam message.
The dataset regroups 5572 text messages information, and we can see that 13.4% are spam.

Our end goal is to create a function that tells us whether a new message is categorized as a spam or not. Apart from coding the function, we expect to use a multinomial naive Bayes model in order to evaluate:

  • p(spam|message)
  • p(non-spam|message)
    Eventually, we need to use as much data as possible to increase the accuracy of our model, and we also want to keep some data from the original dataset in order to test our model on 'new' messages.
    Thus, we will randomize the dataset and separate it between a 'training' dataset (80% of the original dataset) and a 'testing' dataset (20% of the original dataset).
In [6]:
#Randomizing the dataset
data_random = data.sample(frac=1, random_state=1)

#Creating the train and test datasets
train = data_random.sample(frac=0.8, random_state=1)
test_index = set(data_random.index) - set(train.index)
test = data_random.loc[test_index]
In [7]:
#Resetting indexes
train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)
In [8]:
train.shape
Out[8]:
(4458, 2)
In [9]:
test.shape
Out[9]:
(1114, 2)
In [10]:
train['Label'].value_counts(normalize=True)*100
Out[10]:
ham     86.675639
spam    13.324361
Name: Label, dtype: float64
In [11]:
test['Label'].value_counts(normalize=True)*100
Out[11]:
ham     86.265709
spam    13.734291
Name: Label, dtype: float64

We can verify that the percentages of ham and spam messages are quite similar to the original data.

Data wrangling

The messages contents are located in the 'SMS' column, and in order to build our model, we want to be able to count the occurence of each word in each message. Doing so will enable us to evaluate the probability of finding each unique word among spam messages and non-spam messages (p(word_i|spam) and p(word_i|non-spam)), which we will need for our purpose.

In terms of Pandas manipulation, we expect to get a train dataframe with columns indicating the word count for each possible word in the entire messages vocabulary.

In [12]:
#Removing all special characters (replacing special characters, whitespaces included, with whitespaces)
#Setting all letters to lowercase
train['SMS'] = train['SMS'].str.replace('\W', ' ').str.lower()
In [13]:
train.head()
Out[13]:
Label SMS
0 ham good night my dear sleepwell amp take care
1 ham sen told that he is going to join his uncle fi...
2 ham thank you baby i cant wait to taste the real ...
3 ham when can ü come out
4 ham no thank you you ve been wonderful

Creating a vocabulary list

Such list will contain all the words present in the 4458 messages (unique occurences) from the train dataset

In [14]:
vocabulary = []
train['SMS'] = train['SMS'].str.split()

for sms in train['SMS']:
    for word in sms:
        vocabulary.append(word)

#Keeping only one occurence of each word
vocabulary = set(vocabulary)

#Recreating the list
vocabulary = list(vocabulary)
In [15]:
len(vocabulary)
Out[15]:
7712

Creating a dictionnary with words count

Such dictionnary is composed of words as keys and word count per sms as a list of word count.

In [16]:
#Creating a dictionnary counting each word occurence per sms
word_counts_per_sms = {unique_word: [0]*len(train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

#Creating the new dataframe based on the dictionnary we just created
word_counts = pd.DataFrame(word_counts_per_sms)

#Concatenating train_data with train to get 'Label' and 'SMS' columns
train_clean = pd.concat([train, word_counts], axis=1)
In [17]:
train_clean.head()
Out[17]:
Label SMS unclaimed dick gamestar someplace 169 flies angels nowhere ... buff files 80488 matthew a21 noise heard independence evn gives
0 ham [good, night, my, dear, sleepwell, amp, take, ... 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 ham [sen, told, that, he, is, going, to, join, his... 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 ham [thank, you, baby, i, cant, wait, to, taste, t... 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 ham [when, can, ü, come, out] 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 ham [no, thank, you, you, ve, been, wonderful] 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 7714 columns

Building a multinomial naive Bayes model

As mentionned at the beginning of the project, we want to evaluate P(spam | message) and P(nonspam | message), corresponding to the probability that a new message is a spam or a regular message, given its content.

We know that:
$$ P(spam\ |\ message) = \frac{P(spam \cap message)}{P(message)} $$ $$ \iff P(spam\ |\ message) = \frac{P(spam) * \displaystyle\prod_{i=1}^n P(word_i\ |\ spam)}{P(message)} $$
considering that $ message = \displaystyle\sum_{i=1}^n word_i $

As we also consider $$ P(nonspam\ |\ message) = \frac{P(nonspam) * \displaystyle\prod_{i=1}^n P(word_i\ |\ nonspam)}{P(message)}, $$
we can evaluate only the numerators from both equations in order to compare P(spam | message) and P(nonspam | message)

According to Bayes, we get:
$$ P(spam\ |\ message)\ \ \propto \ \ P(spam) * \displaystyle\prod_{i=1}^n P(word_i\ |\ spam) $$ $$ and $$ $$ P(nonspam\ |\ message)\ \ \propto \ \ P(nonspam) * \displaystyle\prod_{i=1}^n P(word_i\ |\ nonspam) $$

Thus, we need to evaluate P(spam), P(nonspam), P(word | spam) and P(word | nonspam) for each word in the entire dataset vocabulary.

For a given word: $$ P(word\ |\ spam) = \frac{N_{word\ |\ spam} + \alpha}{N_{spam} + \alpha * N_{vocabulary}}, $$

with $\alpha$ as the Laplace smoothing factor, which we will set to 1, $N_{word\ |\ spam}$ as the wordcount of word in all spam messages, $N_{spam}$ the total number of words in spam messages and $N_{vocabulary}$ the total number of words (unique occurences) in the entire vocabulary.

The formula for non-spam messages is straightforward: $$ P(word\ |\ nonspam) = \frac{N_{word\ |\ nonspam} + \alpha}{N_{nonspam} + \alpha * N_{vocabulary}} $$

As we set alpha to 1, we get: $$ P(word\ |\ spam) = \frac{N_{word\ |\ spam} + 1}{N_{spam} + N_{vocabulary}} $$

$$ and $$

$$ P(word\ |\ nonspam) = \frac{N_{word\ |\ nonspam} + 1}{N_{nonspam} + N_{vocabulary}} $$

In [18]:
p_spam = train_clean['Label'].value_counts(normalize=True)['spam']
p_spam
Out[18]:
0.13324360699865412
In [19]:
p_ham = train_clean['Label'].value_counts(normalize=True)['ham']
p_ham
Out[19]:
0.8667563930013459
In [20]:
#Calculating N_spam, N_ham and N_vocabulary
train_spam = train_clean[train_clean['Label']=='spam']
words_per_sms_spam = train_spam['SMS'].apply(len)
n_spam = words_per_sms_spam.sum()

train_ham = train_clean[train_clean['Label']=='ham']
words_per_sms_ham = train_ham['SMS'].apply(len)
n_ham = words_per_sms_ham.sum()

n_vocabulary = len(vocabulary)

print('Spam messages vocabulary length equals {}'.format(n_spam))
print('Ham messages vocabulary length equals {}'.format(n_ham))
print('Total vocabulary length (unique words) equals {}'.format(n_vocabulary))
Spam messages vocabulary length equals 15142
Ham messages vocabulary length equals 57140
Total vocabulary length (unique words) equals 7712
In [21]:
#Initiating the alpha variable (Laplace smoothing) used in the naive Bayes theorem equation
alpha = 1
In [22]:
#Creating two dictionnaries with words as keys and probabilities as values
#Those two dictionnaries will look like this:
#dict_spam = {word_i: p(word_i|spam)}
#dict_ham = {word_j: p(word_j|ham)}

dict_spam = {word: 0 for word in vocabulary}
dict_ham = {word: 0 for word in vocabulary}

for word in vocabulary:
    n_word_given_spam = train_clean.loc[train_clean['Label']=='spam', word].sum()
    p_word_given_spam = (n_word_given_spam + alpha)/(n_spam + alpha*n_vocabulary)
    dict_spam[word] = p_word_given_spam
    n_word_given_ham = train_clean.loc[train_clean['Label']=='ham', word].sum()
    p_word_given_ham = (n_word_given_ham + alpha)/(n_ham + alpha*n_vocabulary)
    dict_ham[word] = p_word_given_ham

Creating the function that classifies messages

In [23]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in dict_spam:
            p_spam_given_message *= dict_spam[word]
        if word in dict_ham:
            p_ham_given_message *= dict_ham[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')
In [24]:
#Testing the function with two 'obvious' messages
expected_spam = 'WINNER!! This is the secret code to unlock the money: C3421.'
expected_ham = 'Sounds good, Tom, then see u there'

classify(expected_spam)
print('\n')
classify(expected_ham)
P(Spam|message): 1.273700039190484e-25
P(Ham|message): 2.6479653658243408e-27
Label: Spam


P(Spam|message): 1.0782622513413257e-25
P(Ham|message): 4.248744927807854e-21
Label: Ham

Testing the algorithm on the test dataset

In [25]:
def classify_test(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in dict_spam:
            p_spam_given_message *= dict_spam[word]
        if word in dict_ham:
            p_ham_given_message *= dict_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'
In [26]:
#Creating a new column 'predicted' in the test dataframe
test['predicted'] = test['SMS'].apply(classify_test)
test.head()
Out[26]:
Label SMS predicted
0 spam GSOH? Good with SPAM the ladies?U could b a ma... spam
1 ham Y so late but i need to go n get da laptop... ham
2 spam U have a secret admirer who is looking 2 make ... spam
3 ham Jus finish my lunch on my way home lor... I to... ham
4 spam SIX chances to win CASH! From 100 to 20,000 po... spam
In [27]:
#Measuring the accuracy of the spam filter
test['correct_prediction'] = test['Label']==test['predicted']
accuracy = test['correct_prediction'].sum()/test.shape[0]
print('The accuracy of this spam filter reaches {:.1f}%'.format(accuracy*100))
The accuracy of this spam filter reaches 98.7%

Conclusion

We get an accuracy of 98.7%, which is quite good for a first step towards spam filtering!

In [ ]: