In our daily life, we get lots of emails. Some emails are useful and some are not. An unsolicited email sent in the bulk is a spam email. We do not generally want spam emails, so spam classifiers throw them in spam folders before they appear in our inbox section. According to , around . It has been that spam emails impede economic growth and causes loss of billion dollars of GDP. claim economic loss at over $1 trillion if firms were not investing in anti-spam technology. Statista 29% of the emails sent in 2019 were spam emails studied Rao and Railey The statistics are sufficient to underscore the importance of spam filters. With the progress in machine learning and deep learning increasing day by day, spam filters have made use of them to protect customers, and they have been successful to a large extent. From saving email reading time to protecting customers from frauds, deceits, and phishing, spam filters have done excellent work in preventing losses and increasing efficiency. Email Classification Using Naive Bayes Classifiers Today, let’s scratch spam email classification using one of the simplest techniques called naive Bayes classification. Naive Bayes classifiers are the classifiers that are based on Bayes’ theorem, a theorem that gives the probability of an event based on prior knowledge of conditions related to the event. It can be used to build a naive but good enough spam classifier, and we will see its use using a Python machine learning library, . Sklearn At first, let’s import relevant libraries, sub-packages, modules, and classes. matplotlib.pyplot plt nltk numpy np pandas pd seaborn sns import as import import as import as import as In addition, let's import some methods, functions, and classes from Scikit-learn (Sklearn), one of the widely used libraries in data science . sklearn.feature_extraction.text CountVectorizer sklearn.metrics roc_auc_score, roc_curve, confusion_matrix, precision_score, recall_score sklearn.model_selection train_test_split sklearn.naive_bayes MultinomialNB sklearn.utils.multiclass unique_labels from import from import from import from import from import Now, let’s download the email dataset (around 5500 rows) from the dataset URL, which I got from the AIDevNepal’s GitHub repository. The dataset contains non-spam emails and spam emails. Also, let’s convert the labels to numerical values, for spam and for non-spam. 1 0 data = pd.read_csv( )

data[ ] = np.where(data[ ]== , , ) 'https://raw.githubusercontent.com/AiDevNepal/ai-saturdays-workshop-8/master/data/spam.csv' 'target' 'target' 'spam' 1 0 Shall we peek into the data? data.head( ) 10 Before training, let’s divide the dataset into training and validation. By default, Sklearn splits training and testing data in the ratio of 70:30 . X_train, X_test, Y_train, Y_test = train_test_split(data[ ], 
                                                    data[ ], 
                                                    random_state= ) 'text' 'target' 0 Our raw dataset is the email messages. We can not feed such raw datasets to machine learning algorithms. Machine learning algorithms train models by doing computation, and the computation is possible with numerical values. So, let’s extract features from the raw dataset for training. For doing that, we transform all the email messages to the vectorized form using class. Here, we take and , and train using the training examples. CountVectorizer unigram bigram vectorizer = CountVectorizer(ngram_range=( , )).fit(X_train)
X_train_vectorized = vectorizer.transform(X_train) # extract features 1 2 Now, we create a multinomial Naive Bayes model using Sklearn API and train it with the dataset we created. Actually, naive Bayes is a performant machine learning algorithm on small datasets. It generalizes well with a small number of training examples, which complex models like neural networks fail at. model = MultinomialNB(alpha= )
model.fit(X_train_vectorized, Y_train) 0.1 Let’s test the model by doing predictions on the testing set. We are transforming the raw test data by using the vectorizer we previously created. predictions = model.predict(vectorizer.transform(X_test))
print( , * sum(predictions == Y_test) / len(predictions), ) "Accuracy:" 100 '%' The accuracy of our model on testing data is whopping . WOW!!! 98.99% Now, let's test our model with real-life emails and see how they predict. model.predict(vectorizer.transform(
    [ , , ,
    ])
            ) "Thank you, ABC. Can you also share your LinkedIn profile? As you are a good at programming at pyhthon, would be willing to see your personal/college projects." "Hi y’all, We have a Job Openings in the positions of software engineer, IT officer at ABC Company.Kindly, send us your resume and the cover letter as soon as possible if you think you are an eligible candidate and meet the criteria." "Dear ABC, Congratulations! You have been selected as a SOftware Developer at XYZ Company. We were really happy to see your enthusiasm for this vision and mission. We are impressed with your background and we think you would make an excellent addition to the team." Are you eager to see what our model predicts? Okay, here it is. Here the output of the model predictions of all the given three emails is . And as we previously defined, means non-spam. That’s right! I just tested with emails I received from my employers, colleagues, and friends. 0 0 Okay, what about spam emails in my spam folder? Let’s test them. model.predict(vectorizer.transform(
    [ , , ,
        
    ])
            ) "congratulations, you became today's lucky winner" "1-month unlimited calls offer Activate now" "Ram wants your phone number" The output of the above example is: Nailed it! It predicts everything as spam. You are a savior ❤ As we saw, the classifier turned out to be a savior for me in the end, otherwise, I would have been a victim of some fraud activities or phishing attempts. Now, how about testing your emails and see how this naive algorithm performs?

Target

Building Spam Classification Using The Naive Bayes Algorithm

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

How to Create a PDF File from a List of Images with Python

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

How to Create a PDF File from a List of Images with Python

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps