My newsletter was overrun by bots! I decided to try a machine learning solution. It was my first ML experiment and I learned a lot. Want to know how I built a bot detector and gained some ML skills along the way? Read on.
I have a free newsletter that encourages you to read daily.
There are 100+ subscribers, and recently a lot of bots have signed up too.
Bots are signing up to market their own product, newsletters, etc. They usually have a link in the name field and the message that they want to convey.
Example:
|
Name |
---|---|
🔶 Withdrawing 32 911 Dollars. Gо tо withdrаwаl >>> https://forms.yandex.com/cloud/65e6228102848f1a71edd8c9?hs=0cebe66d8b7ba4d5f0159e88dd472e8b& 🔶 |
These spammy signups aren't just annoying, they're a real headache! I was tired of manually blocking bot emails and worrying about how they might hurt my email reputation. I know I have numerous options for filtering out the bot signups by embedding traditional methods like CAPTCHA, Double Opt-in, Regex patterns, or Honeypot Fields in the form.
At the same time, I also had a feeling that I was falling behind in adopting newer technologies, especially in the Machine Learning field. I wanted to get started but didn't have a clue about where to begin. Then one of my mentors, Shrijith suggested why not try creating a solution for the bot signup problem using ML.
I felt this was the right experiment I could begin with to learn ML.
And so, I am here with my first machine learning experiment!
Picture this: You've built a website with a newsletter signup form. You want to make sure your subscribers are real people, not automated bots. So, you implement a bot detection system. But what does it mean when someone tells you their system is "95% accurate"?
Let me break it down:
Imagine 100 signups are actually bots.
A 95% sensitive system should correctly identify 95 of them as bots.
5 bots might slip through the cracks and be mistaken for humans (false negatives), which is okay and not a big deal.
Now, imagine 100 signups are from real humans.
A 95% specific system should accurately recognize 95 of these as humans.
However, 5 people could be mistakenly labeled as bots (false positives), this is very bad as the human is ignored, which is a loss of potential business lead(in general injustice).
Sensitivity = True Bots Detected / (True Bots Detected + Bots Missed) The system's ability to find true bots.
Specificity = True Humans Detected / (True Humans Detected + Humans Mistaken for Bots) The system's ability to avoid mislabeling real people.
Accuracy = (True Bots Detected + True Humans Detected) / (Total Signups) Overall correctness, but it can be misleading if your dataset has way more of one type (bots or humans).
If all three are 1.0 then congrats you have the perfect model.
I used to underestimate the power of data when training machine learning models. I assumed that algorithms would simply "figure it out" no matter what I fed them.
With a small dataset of 103 signups (only 12 bots!), I threw it at Decision Trees, Logistic Regression, and Random Forest models.
I got an initial accuracy of 77%, but that was a classic overfitting trap. My models were just memorizing the training data, useless for real-world scenarios.
Frustrated, I jumped to transformers, thinking the solution lay in fancy algorithms. I got a slight boost to 87.4%, which was a relief but still left much to be desired.
To hit that 90% target, I needed to debug. Using a confusion matrix, I finally saw the light: it was the data, not the models, holding me back.
I used SMOTE and simply balanced my dataset with equal numbers of bot and human signups, i.e. 90 Humans and 90 Bots then my accuracy shot up to 94%!
Note: my training data is 180 rows
NewsletterCollectionDataset
) to do the above things.create_data_loader
) turns each of those data splits into 'DataLoaders' which the model can easily train on.BotClassifier
is a class where my bot-detection model is defined.start_training
) where a loop is present.evaluate_model
) to get the truest sense of how well the model has learned to generalize to unseen data.test_data_loader
).test_with_single_data
) to test out a signup on the model.Now I will try to explain the above stages as simple as possible.
I have name and email fields in the newsletter signup without verification. I manually blacklisted all the bots in the email service Listmonk.
So the raw data look like:
Status |
|
Name |
---|---|---|
Available |
athreya c | |
Blocklisted |
🔶 Withdrawing 32 911 Dollars. Gо tо withdrаwаl >>> https://forms.yandex.com/cloud/65e6228102848f1a71edd8c9?hs=0cebe66d8b7ba4d5f0159e88dd472e8b& 🔶 |
This was good enough for me to experiment. I used the above data to get it in a simple format so that I could train it easily.
df = pd.read_csv('https://raw.githubusercontent.com/usrername/repo/dataset.csv')[['name_email', 'bot']]
df.head(2)
I had 103 signup emails. 91 were human and 12 were bot.
I generated data with SMOTE in a such way that I had 90 bots and 90 humans.
Finally, I used 144 signup data entries for training the model, 18 for testing, and 18 for validating.
We use Pandas, Torch, and Sklearn packages to make use of their utils for splitting data into training and testing sets.
sklearn.model_selection import train_test_split as tts
INITIAL_TEST_SIZE = 0.2
RANDOM_SEED = 42
VALIDATION_SIZE = 0.5
# Splits the dataset into a training set (for model training) and a testing set (for evaluating its performance).
df_train, df_test = tts(df,
test_size=INITIAL_TEST_SIZE,
random_state=RANDOM_SEED
)
# Further splits the testing set into a validation set (for tuning model parameters) and a final testing set.
df_val, df_test = tts(df_test,
test_size=VALIDATION_SIZE,
random_state=RANDOM_SEED,
)
NewsletterCollectionDataset
Class This class defines a dataset that can be used with PyTorch models. It takes care of preprocessing the raw name email data using a BERT tokenizer and converting it into suitable input for a machine-learning model.
# Provide tools for creating custom datasets and loading data in batches for machine learning.
import torch
from torch.utils.data import Dataset
class NewsletterCollectionDataset(Dataset):
"""
Args:
bot: Labels for each sample (0 or 1).
name_emails: List of name email text samples.
tokenizer: BERT tokenizer for preprocessing.
max_len: Maximum sequence length.
"""
def __init__(self, bots, name_emails, tokenizer, max_len):
self.name_emails = name_emails
self.bots = bots
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.name_emails)
This is the heart of the class.
Here's what happens:
def __getitem__(self, i):
name_email = str(self.name_emails[i])
bot = self.bots[i]
encoding = self.tokenizer.encode_plus(
name_email,
add_special_tokens=True,
max_length=self.max_len,
truncation=True,
return_token_type_ids=False,
pad_to_max_length=True,
return_attention_mask=True,
return_tensors='pt'
)
return {
'name_email': name_email,
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'bot': torch.tensor(bot, dtype=torch.long)
}
create_data_loader
Function
Creates DataLoader objects, which handle loading data in batches and shuffling for the training, validation, and testing sets.
from torch.utils.data import DataLoader
from transformers import BertTokenizer
def create_data_loader(df, tokenizer, max_len, batch_size):
"""
Args:
df (pandas.DataFrame): The DataFrame containing email name data and 'bot' labels.
tokenizer: The BERT tokenizer for text preprocessing.
max_len (int): The maximum length for tokenized sequences.
batch_size (int): Number of samples per batch.
Returns:
torch.utils.data.DataLoader: A DataLoader instance for iterating over the dataset.
"""
ds = NewsletterCollectionDataset(
bots=df['bot'].to_numpy(),
name_emails=df['name_email'].to_numpy(),
tokenizer=tokenizer,
max_len=max_len
)
return DataLoader(
ds,
batch_size=batch_size,
num_workers=4
)
Creating model data for training, validation, and testing using the data loaders.
# Loads the BERT tokenizer for text preprocessing.
PRE_TRAINED_MODEL_NAME = 'bert-base-cased'
TOKENIZER = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)
# Maximum sequence length for tokenization.
MAX_LEN=512
# Batch size for training.
BATCH_SIZE=16
train_data_loader = create_data_loader(df_train, TOKENIZER, MAX_LEN, BATCH_SIZE)
test_data_loader = create_data_loader(df_test, TOKENIZER, MAX_LEN, BATCH_SIZE)
val_data_loader = create_data_loader(df_val, TOKENIZER, MAX_LEN, BATCH_SIZE)
My core model (BotClassifier
) isn't crazy complex. Think of it like this:
BERT Does the Heavy Lifting: I feed BERT those name email signups and it turns them into meaningful representations.
import torch.nn as nn
from transformers import BertModel
class BotClassifier(nn.Module):
"""
Args:
n_classes (int): The number of output classes (e.g., 2 for bot vs. human).
"""
def __init__(self, n_classes):
super(BotClassifier, self).__init__()
self.bert = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
self.drop = nn.Dropout(p=0.3)
self.out = nn.Linear(self.bert.config.hidden_size, n_classes)
Dropout: Little Bit of Randomness Dropout randomly zeroes out some connections during training, making the model less prone to overfitting.
The Output Layer: "Bot" or "Not"? A simple linear layer takes BERT's output and makes the final prediction.
Defines the forward pass through the spam classification model.
def forward(self, input_ids, attention_mask):
"""
Args:
input_ids (torch.Tensor): Tokenized input sequences.
attention_mask (torch.Tensor): Attention mask indicating real vs. padded tokens.
Returns:
torch.Tensor: The model's output logits (un normalized class probabilities).
"""
pooled_output = self.bert(
input_ids=input_ids,
attention_mask=attention_mask
)[1]
output = self.drop(pooled_output)
return self.out(output)
# Check for CUDA (GPU) availability; otherwise defaults to CPU.
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = BotClassifier(n_classes=2)
model = model.to(DEVICE)
The train
function is where I teach this model to spot the bots.
import numpy as np
def train(
model,
loss_fn,
optimizer,
scheduler,
device,
data_loader,
n_examples
):
"""
Args:
model (nn.Module): The PyTorch model to train.
loss_fn (nn.Module): The loss function for calculating error.
optimizer (torch.optim.Optimizer): The optimizer used for updating model parameters.
scheduler: A learning rate scheduler to adjust learning rate during training.
device (torch.device): The device where the model and data should be loaded ('cpu' or 'cuda')
data_loader (torch.utils.data.DataLoader): A DataLoader providing batches of training data.
n_examples (int): The total number of training examples in the dataset.
Returns:
tuple: A tuple containing:
* train_acc (float): Training accuracy for the epoch.
* train_loss (float): Average training loss for the epoch.
"""
model = model.train() # Sets the model to training mode
losses = []
correct_predictions = 0
For each batch of data, it:
for d in data_loader:
# Data preparation
input_ids = d['input_ids'].to(device)
attention_mask = d['attention_mask'].to(device)
targets = d['bot'].to(device)
# Forward pass
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
# Loss calculation
loss = loss_fn(outputs, targets)
# Accuracy calculation
_, preds = torch.max(outputs, dim=1)
correct_predictions += torch.sum(preds == targets)
losses.append(loss.item())
# Back propagation
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # Gradient clipping
# Optimization
optimizer.step()
scheduler.step()
optimizer.zero_grad()
train_acc = correct_predictions.double() / n_examples
train_loss = np.mean(losses)
return train_acc, train_loss
from collections import defaultdict
history = defaultdict(list)
EPOCHS=5
def start_training():
best_accuracy = 0
for epoch in range(EPOCHS):
print(f'Epoch {epoch + 1}/{EPOCHS}')
print('-' * 10)
This is where the core learning happens for one epoch. Accuracy and loss (how wrong the model is) are calculated on your training data.
train_acc, train_loss = train(
model,
loss_fn,
optimizer,
scheduler,
DEVICE,
train_data_loader,
len(df_train)
)
print(f'Train loss {train_loss} accuracy {train_acc}')
The evaluate_model
function tests how well the model is doing on a validation dataset it hasn't seen before.
This helps prevent overfitting.
val_acc, val_loss = evaluate_model(
model,
loss_fn,
DEVICE,
val_data_loader,
len(df_val)
)
print(f'Validation loss {val_loss} accuracy {val_acc}\n')
If the model beats its previous best performance on the validation set, it's saved.
history['train_acc'].append(train_acc)
history['train_loss'].append(train_loss)
history['val_acc'].append(val_acc)
history['val_loss'].append(val_loss)
if val_acc > best_accuracy:
torch.save(model.state_dict(), 'best_detector_model.bin')
best_accuracy = val_acc
start_training()
Single Signups: The test_with_single_data
Function demonstrates how to use the model on one signup at a time.
Prepping the Input: Just like during training, we use our trusty BERT tokenizer (TOKENIZER) to turn a new signup into the right format.
def test_with_single_data(data_to_test):
"""Tests a single signup to determine if it's likely from a bot or human.
Args:
data_to_test (str): The name and email data from a newsletter signup.
Prints:
The input signup data along with the model's prediction (bot or human).
"""
# Tokenize and prepare input data for the model
encoding = TOKENIZER.encode_plus(
data_to_test,
add_special_tokens=True,
max_length=MAX_LEN,
truncation=True,
return_token_type_ids=False,
pad_to_max_length=True,
return_attention_mask=True,
return_tensors="pt",
)
input_ids = encoding["input_ids"].to(DEVICE)
attention_mask = encoding["attention_mask"].to(DEVICE)
To the Model!: The model spits out a prediction, and we turn its numbers into a probability using torch.nn.functional.softmax.
# Set model to evaluation mode and run prediction
model.eval()
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
prob = torch.nn.functional.softmax(outputs, dim=1)
# Get the class prediction (0 = human, 1 = bot)
prediction = torch.argmax(prob, dim=1).item()
Bot or Not? Based on that probability, we decide whether it's likely a bot or a real human signup.
# Print the input data and the prediction result
print(f"Input Name Email: {data_to_test}", )
if prediction == 1:
print("The signup is likely from a bot. \n")
else:
print("The signup is likely from a human. \n")
email = "[email protected]"
name = "Rishi C "
email2 = "[email protected]"
name2 = "🔶Lama2. G t 12 "
test_with_single_data(name+email)
test_with_single_data(name2+email2)
When I wanted to gain more accuracy, I didn't exactly know what was going wrong.
So when I implemented and understood the Confusion Matrix, it displayed one False Positive.
Let’s take a look at what a confusion matrix is:
The confusion matrix is a simple and powerful tool that provides a clear picture of how well the classification happens.
Sklearn provides a function called confusion_matrix to visualize the classification.
from sklearn.metrics import confusion_matrix
import seaborn as sns
cm = confusion_matrix(y_test.numpy(), y_pred.numpy())
custom_colors = ['#f0a9b1', '#a9f0b9']
sns.heatmap(cm, annot=True, cmap=custom_colors, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
For plotting the confusion matrix, I used Matplotlib and the Seaborn library in Python.
Think of it like a truth table for your model.
It lays everything out:
Coming back to the original problem, I had one False Positive. That meant the model was wrongly flagging a real person as a bot! A quick look at my data with my show_misclasified()
function. I realized I had mislabeled data during my balancing act. A single human mislabeled as a bot was causing the dip.
One fix, one retrain, and done – 94% accuracy!
My bot detector achieved a 91.6% success rate catching bots, with a perfect score (100%) identifying real subscribers. Not bad, since accidentally blocking a real person (false positive) is a much bigger concern than missing a sneaky bot.
This is a good start, but I'm always looking to improve. I'll be gathering more data and experimenting to see if I can boost the accuracy even further.
Want to stay updated on my progress? Subscribe to our journal for next week's content on fine-tuning Stable Diffusion!