Nowadays, to register or login to any website you have to provide your email-id and sometimes your phone number as well. These details are used to verify the user. But, there is a chance that these details can be misused for promotions, fake messages etc. Take for example, if you enter your bank details, phone number and email-id to buy a product from a sketchy-looking website, a few days later you would probably receive a mail from halfway around the world claiming that you have won 100 million dollars. Most of us know that this message is fake and this email should end up in spam. This trick just doesn’t work anymore(I hope so!).
We humans can sometimes be reckless. We enter our email-id and phone number in almost every website that asks for it and we expect our email company and phone company to make sure that no spam messages end up in our inbox. So, instead of being careful while entering our details, we have decided to build algorithms that automatically reads a message and decides whether it is spam or not. If it is spam, the message is removed from your inbox and not shown to you.
You could skip through the spam messages but there is good chance that you could skip through the important messages from authentic senders as well. It has been estimated that around 100 billion spam emails are sent out daily. So, there is good chance that at least 10 spam mails could end up in your inbox daily. Searching through this pile of spam for those important messages can be highly difficult and there is a good chance that you could skim over them. Therefore, our life would be made a lot easier if an algorithm could correctly classify an email as spam and never shows it to us.
First 5 samples in dataset
Now, let’s build our own spam classifier with just a few lines of code. The dataset is a csv file and can be downloaded from this link. The csv file has a column of messages and a target variable which represents whether that message is spam or not. Now, let’s move on to the code.
We read the csv file using pandas library and extract the text and labels from the respective columns and store it in a list. The target variable is a string where ‘ham’ represents that the text is not spam and ‘spam’ represents that the text is spam. The lists are converted to numpy array as numpy helps with vector computations.
We shuffle the data and split it into training and testing samples. 90% of the data is used for training and the rest 10% is used for testing the model. The training and testing data stored in lists are converted to numpy arrays.
We now prepare the text data so that it could be fed into our model. This is the most crucial step as we convert the texts into dense integer representations which our machine learning model can learn from. There are two classes used here, CountVectorizer and TfidfTransformer. Let us look at them in order.
text = ["The quick brown fox jumped over the lazy dog."]
vectorizer = CountVectorizer()vectorizer.fit(text)print(vectorizer.vocabulary_)# Output: {'dog': 1, 'fox': 2, 'over': 5, 'brown': 0, 'quick': 6, 'the': 7, 'lazy': 4, 'jumped': 3}
vector = vectorizer.transform(text)print(vector.toarray())# Output: [[1 1 1 1 1 1 1 2]]
From the above example, you can understand that each unique word is assigned an index value and the occurrence of each of those words are used to represent the text. In the given text, only the word “the” occurs twice and it has an index of 7. Hence, the 7th position has a value of 2 and the rest of the values are 1.
from sklearn.feature_extraction.text import TfidfVectorizer
text = ["The quick brown fox jumped over the lazy dog.","The dog.","The fox"]vectorizer = TfidfVectorizer()vectorizer.fit(text)print(vectorizer.vocabulary_)# Ouput: {'fox': 2, 'lazy': 4, 'dog': 1, 'quick': 6, 'the': 7, 'over': 5, 'brown': 0, 'jumped': 3}
print(vectorizer.idf_)# Output: [ 1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718 1.69314718 1. ]
vector = vectorizer.transform([text[0]])print(vector.toarray())# Output: [[ 0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646 0.36388646 0.42983441]]
From the above example you can observe that the common words between the different texts have downscaled. What this does is that, it allows our machine learning model to concentrate on those low frequency words which mostly contribute into the message being spam or not.
We now use a gradient boosting model called XGBoost and fit our training data and measure the accuracy prediction on the test data.
Thanks to machine learning, we can continue to be reckless and still don’t find any spam in our inbox ;)