paint-brush
How I Saved My Newsletter From Bots With Machine Learningby@lovestaco

How I Saved My Newsletter From Bots With Machine Learning

by ManeshwarApril 15th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

In this article, the author shares their experience of combating bot signups in their newsletter by delving into machine learning. They detail the process of data preparation, model building using BERT, training, and achieving a 100% accurate bot detector after overcoming initial challenges.
featured image - How I Saved My Newsletter From Bots With Machine Learning
Maneshwar HackerNoon profile picture


My newsletter was overrun by bots! I decided to try a machine learning solution. It was my first ML experiment and I learned a lot. Want to know how I built a bot detector and gained some ML skills along the way? Read on.

The bot invasion

I have a free newsletter that encourages you to read daily.

There are 100+ subscribers, and recently a lot of bots have signed up too.

Bots are signing up to market their own product, newsletters, etc. They usually have a link in the name field and the message that they want to convey.


Example: image

Email

Name

[email protected]

🔶 Withdrawing 32 911 Dollars. Gо tо withdrаwаl >>> https://forms.yandex.com/cloud/65e6228102848f1a71edd8c9?hs=0cebe66d8b7ba4d5f0159e88dd472e8b& 🔶


These spammy signups aren't just annoying, they're a real headache! I was tired of manually blocking bot emails and worrying about how they might hurt my email reputation. I know I have numerous options for filtering out the bot signups by embedding traditional methods like CAPTCHA, Double Opt-in, Regex patterns, or Honeypot Fields in the form.


At the same time, I also had a feeling that I was falling behind in adopting newer technologies, especially in the Machine Learning field. I wanted to get started but didn't have a clue about where to begin. Then one of my mentors, Shrijith suggested why not try creating a solution for the bot signup problem using ML.


I felt this was the right experiment I could begin with to learn ML.

And so, I am here with my first machine learning experiment!

What to expect from the model

Picture this: You've built a website with a newsletter signup form. You want to make sure your subscribers are real people, not automated bots. So, you implement a bot detection system. But what does it mean when someone tells you their system is "95% accurate"?


Let me break it down:

Catching true bots

Imagine 100 signups are actually bots.

A 95% sensitive system should correctly identify 95 of them as bots.

5 bots might slip through the cracks and be mistaken for humans (false negatives), which is okay and not a big deal.

Not mistaking humans

Now, imagine 100 signups are from real humans.

A 95% specific system should accurately recognize 95 of these as humans.

However, 5 people could be mistakenly labeled as bots (false positives), this is very bad as the human is ignored, which is a loss of potential business lead(in general injustice).

The formulas

Sensitivity = True Bots Detected / (True Bots Detected + Bots Missed) The system's ability to find true bots.


Specificity = True Humans Detected / (True Humans Detected + Humans Mistaken for Bots) The system's ability to avoid mislabeling real people.


Accuracy = (True Bots Detected + True Humans Detected) / (Total Signups) Overall correctness, but it can be misleading if your dataset has way more of one type (bots or humans).

If all three are 1.0 then congrats you have the perfect model.

One big mental mistake

I used to underestimate the power of data when training machine learning models. I assumed that algorithms would simply "figure it out" no matter what I fed them.




With a small dataset of 103 signups (only 12 bots!), I threw it at Decision Trees, Logistic Regression, and Random Forest models.

I got an initial accuracy of 77%, but that was a classic overfitting trap. My models were just memorizing the training data, useless for real-world scenarios.


Frustrated, I jumped to transformers, thinking the solution lay in fancy algorithms. I got a slight boost to 87.4%, which was a relief but still left much to be desired.


To hit that 90% target, I needed to debug. Using a confusion matrix, I finally saw the light: it was the data, not the models, holding me back.


I used SMOTE and simply balanced my dataset with equal numbers of bot and human signups, i.e. 90 Humans and 90 Bots then my accuracy shot up to 94%!



Long story short: How I got to the 100% accurate bot detector

Note: my training data is 180 rows

1. Preparation

  1. Imports for models and packages
  2. Extracting data from my newsletter database to CSV.

2. Creating the Dataset

  1. I cannot input the database data directly for the BERT to understand.
  2. I need to use a tokenizer to break the text into tokens (suitable units for BERT). Created a class(NewsletterCollectionDataset) to do the above things.

3. Splitting data and loading

  1. I split the data into three sets
    • training (to teach the model) 144 rows,
    • validation (to check progress during training) 18 rows, and
    • testing (for final evaluation) 18 rows.
  2. Then a function(create_data_loader) turns each of those data splits into 'DataLoaders' which the model can easily train on.

4. Building the model

  1. BotClassifier is a class where my bot-detection model is defined.
  2. It's based on BERT but adds some extra layers:
    • bert: Loads the pre-trained BERT model.
    • drop: A technique called 'dropout' to help prevent overfitting (the model memorizing too much about the training data).
    • out: A final output layer to turn BERT's output into the prediction (bot or human).
  3. Setting up the Model:
    • Get the model ready to run.
    • Specify an optimizer (AdamW).
    • Learning rate scheduler for how the model's learning changes over time.

5. Training the model

  1. Setting the model to training mode.
  2. Looping through the data and updating the model's knowledge(backward propagation) using the optimizer.

6. The main function

  1. A function(start_training) where a loop is present.
  2. This loop runs for a fixed number of epochs (training cycles). In each epoch:
    • The model trains on the training data.
    • The model is evaluated on the validation data.
    • The best-performing model is saved.

7. Final Evaluation

  1. A function(evaluate_model) to get the truest sense of how well the model has learned to generalize to unseen data.
  2. After training was done, I evaluated the model one more time on the held-out testing set (test_data_loader).
  3. A function(test_with_single_data) to test out a signup on the model.

Now I will try to explain the above stages as simple as possible.


How did I create the Dataset?

I have name and email fields in the newsletter signup without verification. I manually blacklisted all the bots in the email service Listmonk.


So the raw data look like:

Status

Email

Name

Available

[email protected]

athreya c

Blocklisted

[email protected]

🔶 Withdrawing 32 911 Dollars. Gо tо withdrаwаl >>> https://forms.yandex.com/cloud/65e6228102848f1a71edd8c9?hs=0cebe66d8b7ba4d5f0159e88dd472e8b& 🔶


This was good enough for me to experiment. I used the above data to get it in a simple format so that I could train it easily.

df = pd.read_csv('https://raw.githubusercontent.com/usrername/repo/dataset.csv')[['name_email', 'bot']]
df.head(2)

image


What are the numbers for training and testing?

I had 103 signup emails. 91 were human and 12 were bot.

I generated data with SMOTE in a such way that I had 90 bots and 90 humans.

Finally, I used 144 signup data entries for training the model, 18 for testing, and 18 for validating.

Data preparation

We use Pandas, Torch, and Sklearn packages to make use of their utils for splitting data into training and testing sets.

sklearn.model_selection import train_test_split as tts
  
INITIAL_TEST_SIZE = 0.2
RANDOM_SEED = 42
VALIDATION_SIZE = 0.5

# Splits the dataset into a training set (for model training) and a testing set (for evaluating its performance).

df_train, df_test = tts(df,
                      test_size=INITIAL_TEST_SIZE,
                      random_state=RANDOM_SEED
                      )

# Further splits the testing set into a validation set (for tuning model parameters) and a final testing set.

df_val, df_test = tts(df_test,
                    test_size=VALIDATION_SIZE,
                    random_state=RANDOM_SEED,
                    )


Custom Dataset

NewsletterCollectionDataset Class This class defines a dataset that can be used with PyTorch models. It takes care of preprocessing the raw name email data using a BERT tokenizer and converting it into suitable input for a machine-learning model.

# Provide tools for creating custom datasets and loading data in batches for machine learning. 
import torch
from torch.utils.data import Dataset

class NewsletterCollectionDataset(Dataset):
 """
 Args:
     bot: Labels for each sample (0 or 1).
     name_emails:  List of name email text samples.
     tokenizer: BERT tokenizer for preprocessing.
     max_len:  Maximum sequence length.
 """
 def __init__(self, bots, name_emails, tokenizer, max_len):
     self.name_emails = name_emails
     self.bots = bots
     self.tokenizer = tokenizer
     self.max_len = max_len
 def __len__(self):
     return len(self.name_emails)


This is the heart of the class.


Here's what happens:

  • Grabs a name email signup and its bot/human label.
  • Uses the BERT tokenizer to turn the text into numbers the model understands.
  • Bundles everything neatly with labels ready for PyTorch.


    def __getitem__(self, i):
        name_email = str(self.name_emails[i])
        bot = self.bots[i]
        encoding = self.tokenizer.encode_plus(
            name_email,
            add_special_tokens=True,
            max_length=self.max_len,
            truncation=True,
            return_token_type_ids=False,
            pad_to_max_length=True,
            return_attention_mask=True,
            return_tensors='pt'
        )
        return {
            'name_email': name_email,
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'bot': torch.tensor(bot, dtype=torch.long)
        }


Data Loaders

create_data_loader Function

Creates DataLoader objects, which handle loading data in batches and shuffling for the training, validation, and testing sets.

from torch.utils.data import DataLoader
from transformers import BertTokenizer

def create_data_loader(df, tokenizer, max_len, batch_size):
    """
    Args:
        df (pandas.DataFrame): The DataFrame containing email name data and 'bot' labels.
        tokenizer: The BERT tokenizer for text preprocessing.
        max_len (int): The maximum length for tokenized sequences.
        batch_size (int): Number of samples per batch.

    Returns:
        torch.utils.data.DataLoader: A DataLoader instance for iterating over the dataset.
    """

    ds = NewsletterCollectionDataset(
        bots=df['bot'].to_numpy(),
        name_emails=df['name_email'].to_numpy(),
        tokenizer=tokenizer,
        max_len=max_len
    )

    return DataLoader(
        ds,
        batch_size=batch_size,
        num_workers=4
    )


Creating model data for training, validation, and testing using the data loaders.

# Loads the BERT tokenizer for text preprocessing.
PRE_TRAINED_MODEL_NAME = 'bert-base-cased'
TOKENIZER = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)
    
# Maximum sequence length for tokenization.
MAX_LEN=512

# Batch size for training.
BATCH_SIZE=16

train_data_loader = create_data_loader(df_train, TOKENIZER, MAX_LEN, BATCH_SIZE)
test_data_loader = create_data_loader(df_test, TOKENIZER, MAX_LEN, BATCH_SIZE)
val_data_loader = create_data_loader(df_val, TOKENIZER, MAX_LEN, BATCH_SIZE)


The Model: BERT Plus a Bit More

My core model (BotClassifier) isn't crazy complex. Think of it like this:

BERT Does the Heavy Lifting: I feed BERT those name email signups and it turns them into meaningful representations.

import torch.nn as nn
from transformers import BertModel

class BotClassifier(nn.Module):
    """
    Args:
        n_classes (int): The number of output classes (e.g., 2 for bot vs. human).
    """

    def __init__(self, n_classes):
        super(BotClassifier, self).__init__()
        self.bert = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
        self.drop = nn.Dropout(p=0.3)
        self.out = nn.Linear(self.bert.config.hidden_size, n_classes)


Dropout: Little Bit of Randomness Dropout randomly zeroes out some connections during training, making the model less prone to overfitting.

The Output Layer: "Bot" or "Not"? A simple linear layer takes BERT's output and makes the final prediction.

Defines the forward pass through the spam classification model.


    def forward(self, input_ids, attention_mask):
        """
        Args:
            input_ids (torch.Tensor): Tokenized input sequences.
            attention_mask (torch.Tensor): Attention mask indicating real vs. padded tokens.

        Returns:
            torch.Tensor: The model's output logits (un normalized class probabilities).
        """

        pooled_output = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask
        )[1]
        output = self.drop(pooled_output)
        return self.out(output)

# Check for CUDA (GPU) availability; otherwise defaults to CPU.
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = BotClassifier(n_classes=2)
model = model.to(DEVICE)


What did the training involve?

The train function is where I teach this model to spot the bots.

import numpy as np
def train(
    model,
    loss_fn,
    optimizer,
    scheduler,
    device,
    data_loader,
    n_examples
):
    """
    Args:
        model (nn.Module): The PyTorch model to train.
        loss_fn (nn.Module): The loss function for calculating error.
        optimizer (torch.optim.Optimizer): The optimizer used for updating model parameters.
        scheduler: A learning rate scheduler to adjust learning rate during training.
        device (torch.device): The device where the model and data should be loaded ('cpu' or 'cuda')
        data_loader (torch.utils.data.DataLoader): A DataLoader providing batches of training data.
        n_examples (int): The total number of training examples in the dataset.

    Returns:
        tuple: A tuple containing:
            * train_acc (float): Training accuracy for the epoch.
            * train_loss (float): Average training loss for the epoch.
    """

    model = model.train()  # Sets the model to training mode

    losses = []
    correct_predictions = 0


For each batch of data, it:

  • Feeds data to the model.
    for d in data_loader:
        # Data preparation
        input_ids = d['input_ids'].to(device)
        attention_mask = d['attention_mask'].to(device)
        targets = d['bot'].to(device)

        # Forward pass
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)


  • Calculates how wrong the model was (that's the loss).
        # Loss calculation
        loss = loss_fn(outputs, targets)

        # Accuracy calculation
        _, preds = torch.max(outputs, dim=1)
        correct_predictions += torch.sum(preds == targets)
        losses.append(loss.item())


  • Tweaks the model to be better next time (backpropagation and the optimizer).
  • Learning rate magic: The scheduler adjusts the learning rate, so the model learns quickly at first and then fine-tunes itself.
        # Back propagation
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # Gradient clipping

        # Optimization
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

    train_acc = correct_predictions.double() / n_examples
    train_loss = np.mean(losses)

    return train_acc, train_loss
from collections import defaultdict
history = defaultdict(list)
EPOCHS=5

def start_training():
    best_accuracy = 0

    for epoch in range(EPOCHS):
        print(f'Epoch {epoch + 1}/{EPOCHS}')
        print('-' * 10)


This is where the core learning happens for one epoch. Accuracy and loss (how wrong the model is) are calculated on your training data.

        train_acc, train_loss = train(
        model,
        loss_fn,
        optimizer,
        scheduler,
        DEVICE,
        train_data_loader,
        len(df_train)
    )
        print(f'Train loss {train_loss} accuracy {train_acc}')


The evaluate_model function tests how well the model is doing on a validation dataset it hasn't seen before. This helps prevent overfitting.

        val_acc, val_loss = evaluate_model(
        model,
        loss_fn,
        DEVICE,
        val_data_loader,
        len(df_val)
    )
        print(f'Validation loss {val_loss} accuracy {val_acc}\n')


If the model beats its previous best performance on the validation set, it's saved.

        history['train_acc'].append(train_acc)
        history['train_loss'].append(train_loss)
        history['val_acc'].append(val_acc)
        history['val_loss'].append(val_loss)
        if val_acc > best_accuracy:
            torch.save(model.state_dict(), 'best_detector_model.bin')
            best_accuracy = val_acc

start_training()


Is it working?

Testing the model with a signup

Single Signups: The test_with_single_data Function demonstrates how to use the model on one signup at a time.

Prepping the Input: Just like during training, we use our trusty BERT tokenizer (TOKENIZER) to turn a new signup into the right format.

def test_with_single_data(data_to_test):
    """Tests a single signup to determine if it's likely from a bot or human.

    Args:
        data_to_test (str): The name and email data from a newsletter signup.

    Prints:
        The input signup data along with the model's prediction (bot or human).
    """

    # Tokenize and prepare input data for the model
    encoding = TOKENIZER.encode_plus(
        data_to_test,
        add_special_tokens=True,
        max_length=MAX_LEN,
        truncation=True,
        return_token_type_ids=False,
        pad_to_max_length=True,
        return_attention_mask=True,
        return_tensors="pt",
    )
    input_ids = encoding["input_ids"].to(DEVICE)
    attention_mask = encoding["attention_mask"].to(DEVICE)


To the Model!: The model spits out a prediction, and we turn its numbers into a probability using torch.nn.functional.softmax.

    # Set model to evaluation mode and run prediction
    model.eval()
    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        prob = torch.nn.functional.softmax(outputs, dim=1)

    # Get the class prediction (0 = human, 1 = bot)
    prediction = torch.argmax(prob, dim=1).item()


Bot or Not? Based on that probability, we decide whether it's likely a bot or a real human signup.

    # Print the input data and the prediction result
    print(f"Input Name Email: {data_to_test}", )
    if prediction == 1:
        print("The signup is likely from a bot.  \n")
    else:
        print("The signup is likely from a human. \n")
        
email = "[email protected]"
name = "Rishi C "

email2 = "[email protected]"
name2 = "🔶Lama2.  G t 12     "

test_with_single_data(name+email)
test_with_single_data(name2+email2)

image


The Method I used for debugging and achieved 94% from 87%

When I wanted to gain more accuracy, I didn't exactly know what was going wrong.

So when I implemented and understood the Confusion Matrix, it displayed one False Positive.


Let’s take a look at what a confusion matrix is:

confusion matrix


The confusion matrix is a simple and powerful tool that provides a clear picture of how well the classification happens.


Sklearn provides a function called confusion_matrix to visualize the classification.

from sklearn.metrics import confusion_matrix
import seaborn as sns 

cm = confusion_matrix(y_test.numpy(), y_pred.numpy()) 
custom_colors = ['#f0a9b1', '#a9f0b9']
sns.heatmap(cm, annot=True, cmap=custom_colors, fmt='d') 
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show() 


For plotting the confusion matrix, I used Matplotlib and the Seaborn library in Python.

confuse

Think of it like a truth table for your model.


It lays everything out:

  • True Negative (Top left): 6 - The model correctly identified 7 human signups.
  • False Positive (Top right): 0 - The model incorrectly identified 0 human signup as a bot.
  • False Negative (Bottom Left): 1 - The model incorrectly identified 1 bot signup as human.
  • True Positive (Bottom right): 11 - The model correctly identified 11 bot signups.


Coming back to the original problem, I had one False Positive. That meant the model was wrongly flagging a real person as a bot! A quick look at my data with my show_misclasified() function. I realized I had mislabeled data during my balancing act. A single human mislabeled as a bot was causing the dip.


One fix, one retrain, and done – 94% accuracy!

Conclusion

image

My bot detector achieved a 91.6% success rate catching bots, with a perfect score (100%) identifying real subscribers. Not bad, since accidentally blocking a real person (false positive) is a much bigger concern than missing a sneaky bot.


This is a good start, but I'm always looking to improve. I'll be gathering more data and experimenting to see if I can boost the accuracy even further.


Want to stay updated on my progress? Subscribe to our journal for next week's content on fine-tuning Stable Diffusion!