Our goal is to create an algorithm that will tell whether a new message is a spam or not.
We will use de dataset built by Tiago A. Almeida and José Maria Gomez Hidalgo, which classifies 5,572 SMS as spams or non-spams.
The dataset can be downloaded here.
Thanks to Almeida and Hidalgo work, we can capitalize on a human classification of spam and non-spam messages and train our filtering function, before testing it.
import pandas as pd
import numpy as np
data = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])
data.shape
data.head()
data['Label'].value_counts(normalize=True)*100
'spam' means that a message is categorized as spam, while 'ham' indicates a regular non-spam message.
The dataset regroups 5572 text messages information, and we can see that 13.4% are spam.
Our end goal is to create a function that tells us whether a new message is categorized as a spam or not. Apart from coding the function, we expect to use a multinomial naive Bayes model in order to evaluate:
#Randomizing the dataset
data_random = data.sample(frac=1, random_state=1)
#Creating the train and test datasets
train = data_random.sample(frac=0.8, random_state=1)
test_index = set(data_random.index) - set(train.index)
test = data_random.loc[test_index]
#Resetting indexes
train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)
train.shape
test.shape
train['Label'].value_counts(normalize=True)*100
test['Label'].value_counts(normalize=True)*100
We can verify that the percentages of ham and spam messages are quite similar to the original data.
The messages contents are located in the 'SMS' column, and in order to build our model, we want to be able to count the occurence of each word in each message. Doing so will enable us to evaluate the probability of finding each unique word among spam messages and non-spam messages (p(word_i|spam) and p(word_i|non-spam)), which we will need for our purpose.
In terms of Pandas manipulation, we expect to get a train dataframe with columns indicating the word count for each possible word in the entire messages vocabulary.
#Removing all special characters (replacing special characters, whitespaces included, with whitespaces)
#Setting all letters to lowercase
train['SMS'] = train['SMS'].str.replace('\W', ' ').str.lower()
train.head()
Such list will contain all the words present in the 4458 messages (unique occurences) from the train dataset
vocabulary = []
train['SMS'] = train['SMS'].str.split()
for sms in train['SMS']:
for word in sms:
vocabulary.append(word)
#Keeping only one occurence of each word
vocabulary = set(vocabulary)
#Recreating the list
vocabulary = list(vocabulary)
len(vocabulary)
Such dictionnary is composed of words as keys and word count per sms as a list of word count.
#Creating a dictionnary counting each word occurence per sms
word_counts_per_sms = {unique_word: [0]*len(train['SMS']) for unique_word in vocabulary}
for index, sms in enumerate(train['SMS']):
for word in sms:
word_counts_per_sms[word][index] += 1
#Creating the new dataframe based on the dictionnary we just created
word_counts = pd.DataFrame(word_counts_per_sms)
#Concatenating train_data with train to get 'Label' and 'SMS' columns
train_clean = pd.concat([train, word_counts], axis=1)
train_clean.head()
As mentionned at the beginning of the project, we want to evaluate P(spam | message) and P(nonspam | message), corresponding to the probability that a new message is a spam or a regular message, given its content.
We know that:
$$ P(spam\ |\ message) = \frac{P(spam \cap message)}{P(message)} $$
$$ \iff P(spam\ |\ message) = \frac{P(spam) * \displaystyle\prod_{i=1}^n P(word_i\ |\ spam)}{P(message)} $$
considering that $ message = \displaystyle\sum_{i=1}^n word_i $
As we also consider $$ P(nonspam\ |\ message) = \frac{P(nonspam) * \displaystyle\prod_{i=1}^n P(word_i\ |\ nonspam)}{P(message)}, $$
we can evaluate only the numerators from both equations in order to compare P(spam | message) and P(nonspam | message)
According to Bayes, we get:
$$ P(spam\ |\ message)\ \ \propto \ \ P(spam) * \displaystyle\prod_{i=1}^n P(word_i\ |\ spam) $$
$$ and $$
$$ P(nonspam\ |\ message)\ \ \propto \ \ P(nonspam) * \displaystyle\prod_{i=1}^n P(word_i\ |\ nonspam) $$
Thus, we need to evaluate P(spam), P(nonspam), P(word | spam) and P(word | nonspam) for each word in the entire dataset vocabulary.
For a given word: $$ P(word\ |\ spam) = \frac{N_{word\ |\ spam} + \alpha}{N_{spam} + \alpha * N_{vocabulary}}, $$
with $\alpha$ as the Laplace smoothing factor, which we will set to 1, $N_{word\ |\ spam}$ as the wordcount of word in all spam messages, $N_{spam}$ the total number of words in spam messages and $N_{vocabulary}$ the total number of words (unique occurences) in the entire vocabulary.
The formula for non-spam messages is straightforward: $$ P(word\ |\ nonspam) = \frac{N_{word\ |\ nonspam} + \alpha}{N_{nonspam} + \alpha * N_{vocabulary}} $$
As we set alpha to 1, we get: $$ P(word\ |\ spam) = \frac{N_{word\ |\ spam} + 1}{N_{spam} + N_{vocabulary}} $$
$$ and $$ $$ P(word\ |\ nonspam) = \frac{N_{word\ |\ nonspam} + 1}{N_{nonspam} + N_{vocabulary}} $$p_spam = train_clean['Label'].value_counts(normalize=True)['spam']
p_spam
p_ham = train_clean['Label'].value_counts(normalize=True)['ham']
p_ham
#Calculating N_spam, N_ham and N_vocabulary
train_spam = train_clean[train_clean['Label']=='spam']
words_per_sms_spam = train_spam['SMS'].apply(len)
n_spam = words_per_sms_spam.sum()
train_ham = train_clean[train_clean['Label']=='ham']
words_per_sms_ham = train_ham['SMS'].apply(len)
n_ham = words_per_sms_ham.sum()
n_vocabulary = len(vocabulary)
print('Spam messages vocabulary length equals {}'.format(n_spam))
print('Ham messages vocabulary length equals {}'.format(n_ham))
print('Total vocabulary length (unique words) equals {}'.format(n_vocabulary))
#Initiating the alpha variable (Laplace smoothing) used in the naive Bayes theorem equation
alpha = 1
#Creating two dictionnaries with words as keys and probabilities as values
#Those two dictionnaries will look like this:
#dict_spam = {word_i: p(word_i|spam)}
#dict_ham = {word_j: p(word_j|ham)}
dict_spam = {word: 0 for word in vocabulary}
dict_ham = {word: 0 for word in vocabulary}
for word in vocabulary:
n_word_given_spam = train_clean.loc[train_clean['Label']=='spam', word].sum()
p_word_given_spam = (n_word_given_spam + alpha)/(n_spam + alpha*n_vocabulary)
dict_spam[word] = p_word_given_spam
n_word_given_ham = train_clean.loc[train_clean['Label']=='ham', word].sum()
p_word_given_ham = (n_word_given_ham + alpha)/(n_ham + alpha*n_vocabulary)
dict_ham[word] = p_word_given_ham
import re
def classify(message):
message = re.sub('\W', ' ', message)
message = message.lower()
message = message.split()
p_spam_given_message = p_spam
p_ham_given_message = p_ham
for word in message:
if word in dict_spam:
p_spam_given_message *= dict_spam[word]
if word in dict_ham:
p_ham_given_message *= dict_ham[word]
print('P(Spam|message):', p_spam_given_message)
print('P(Ham|message):', p_ham_given_message)
if p_ham_given_message > p_spam_given_message:
print('Label: Ham')
elif p_ham_given_message < p_spam_given_message:
print('Label: Spam')
else:
print('Equal proabilities, have a human classify this!')
#Testing the function with two 'obvious' messages
expected_spam = 'WINNER!! This is the secret code to unlock the money: C3421.'
expected_ham = 'Sounds good, Tom, then see u there'
classify(expected_spam)
print('\n')
classify(expected_ham)
def classify_test(message):
message = re.sub('\W', ' ', message)
message = message.lower()
message = message.split()
p_spam_given_message = p_spam
p_ham_given_message = p_ham
for word in message:
if word in dict_spam:
p_spam_given_message *= dict_spam[word]
if word in dict_ham:
p_ham_given_message *= dict_ham[word]
if p_ham_given_message > p_spam_given_message:
return 'ham'
elif p_ham_given_message < p_spam_given_message:
return 'spam'
else:
return 'needs human classification'
#Creating a new column 'predicted' in the test dataframe
test['predicted'] = test['SMS'].apply(classify_test)
test.head()
#Measuring the accuracy of the spam filter
test['correct_prediction'] = test['Label']==test['predicted']
accuracy = test['correct_prediction'].sum()/test.shape[0]
print('The accuracy of this spam filter reaches {:.1f}%'.format(accuracy*100))
We get an accuracy of 98.7%, which is quite good for a first step towards spam filtering!