4,166 reads

Binary Classification: Understanding Activation and Loss Functions with a PyTorch Example

by Dmitrii Matveichev August 15th, 2023

Too Long; Didn't Read

To build a binary classification neural network you need to use the sigmoid activation function on its final layer together with binary cross-entropy loss. The final layer size should be 1. Such a neural network will output a probability p that the input belongs to class 1 and 1-p that the input belongs to class 0.

featured image - Binary Classification: Understanding Activation and Loss Functions with a PyTorch Example

The basic principles required to solve classification tasks with neural networks are used as building blocks in more complicated deep learning problems such as object detection and instance segmentation. Thus, it is important to understand the reasoning behind choosing one or another activation and loss functions. This post will answer the question "What activation and loss functions do you need to use to solve binary classification task?" with an example of pytorch implementation that you can run yourself in google colab. In the following articles, I'll extend the classification problem to multi-class and multi-label classification and show that you need to add very few modifications to your code to switch between the three classification problems.

Why is it important to understand activation function and loss used for binary classification?

Traditionally binary classification models use sigmoid activation and binary cross-entropy loss (BCE). These two functions are broadly used in more complicated neural networks, such as object detection CNN models and recurrent neural networks. YOLOX object detection model, for example, uses sigmoid activation and BCE in two of its branches as you can see in the figure below.

Recurrent neural networks with gated units, such as LSTM, use sigmoid to help the recurrent NN decide whether to update or forget the data.

If you know the logic behind applying sigmoid activation and BCE loss you are one step closer to understanding and building more complicated NN models.

2 Classification problem formulation

In supervised machine learning the classification problems can be represented as a set of samples {(x_1, y_1), (x_2, y_2),...,(x_n, y_n)}, where x_i is an m-dimensional vector that contains features of sample i and y_i is the class to which x_i belongs. The goal is to build a model that predicts the label y_i for each input sample x_i. There are three types of classification problems:

binary classification - the label y_i can assume one of two values (0 - negative class, 1 - positive class)
multi-class classification - the label y_i can assume one of the k values, where k is the number of classes higher than 2
multi-label classification - the label y_i can assume zero or more than one of the k values, where k is the number of class labels

Moreover, there are two main types of classifiers:

probabilistic classifiers - output probability of each class and the class label is assigned based on the highest class probability. Examples - Naive Bayes, logistic regression, neural networks
deterministic classifiers - output class label without probability estimates. Examples of such classifiers are k-nearest neighbors, and SVM

Examples of binary classification tasks

Binary classification can be applied to real-life problems:

Classifying emails as spam or not spam
Classifying objects in an image between two classes (dog or cat)
Classifying a patient as having a certain disease or not
Identifying a customer as a returning or new customer
Determining whether a loan application will default or be repaid

3 Activation and loss functions for binary classification

As discussed before, in the binary classification you are given:

a set of samples {(x_1, y_1), (x_2, y_2),...,(x_n, y_n)}
x_i is an m-dimensional vector that contains features of sample i
y_i is the class to which x_i belongs
y_i can assume one of two values (0 - negative class, 1 - positive class).

To build a binary classification neural network as a probabilistic classifier we need:

an output linear layer with a size of 1
output values should be in the range [0,1]. The model outputs the probability p that the input sample belongs to the positive class. Note that if p is the probability that the input sample belongs to class 1 (positive class), then (1-p) is the probability that input belongs to class 0 (negative class)
a loss function that has the lowest values when the prediction and the ground truth are the same: (0,0) and (1,1)

3.1 The Sigmoid Activation Function

The final linear layer of a neural network outputs a vector of "raw output values". In the case of classification, the output values represent the model's confidence that the input belongs to one of the classes. As discussed before the output layer needs to be the size of 1 and the output value should be converted into a probability p. To obtain the probability you can use the sigmoid activation function which maps the input to the output between 0 and 1. The sigmoid function is defined as

An example of input-output values for sigmoid is provided in the table below.

Input	-5	-4	-3	-2	-1	0	1	2	3	4	5
Output	0.007	0.018	0.047	0.119	0.269	0.5	0.731	0.881	0.953	0.982	0.993

Let's plot this table with input values as the x-axis and output values as the y-axis to visualize the sigmoid function.

As you can see sigmoid is a function that maps all input values into a range from 0 to 1 and we can use it for the binary classification task with the output layer of size 1.

3.2 Binary Cross-Entropy Loss

The most common loss function for probabilistic binary classifiers is the binary cross-entropy loss, which is defined as

Where N is the number of input samples, y is the ground truth, and p is the predicted probability.

The table below shows loss values if the ground truth is 1 and input values range from 0 to 1. From the table we can make several observations:

BCE loss has a very big value when the prediction has the opposite of the ground truth value
if the ground truth and prediction have the same value the loss is 0
log(0) is undefined, to fix it we can add a very small value of 0.0000001 (called epsilon) to 0

ground truth	1	1	1	1	1	1
prediction	0	0.2	0.4	0.6	0.8	1
BCE loss	inf	1.609	0.916	0.511	0.223	0

Let's remove the sum from the equation and analyze the term inside:

The plot of -log(x) below shows that the function has the minimum value at x=1.

There are two things that can be observed from the plot and the formula:

If y=0 then the loss function is reduced to -log(1-p) and -log(1-p) has the minimum value when p=0 (the same value as the ground truth)
if y=1 then the loss function is reduced to -log(p) and -log(p) has the minimum value when p=1 (the same value as the ground truth)

The observed properties make BCE a perfect loss function for binary classification problems.

4 Binary Classification NN example with PyTorch

Before heading to the code let's summarize what we need to implement a probabilistic binary classification NN:

ground truth and predictions should have dimensions [N,1] where N is the number of input samples
the final linear layer size should be 1
outputs from the final layer should be processed with sigmoid activation to obtain class probability
BCE loss should be applied to predicted class probabilities and ground truth values

Let's code a neural network for binary classification with the PyTorch framework.

First, install torchmetrics - this package will be used later to compute classification accuracy and confusion matrix.

# used for accuracy metric and confusion matrix
!pip install torchmetrics

Import packages that will be used later in the code

from sklearn.datasets import make_classification
import numpy as np
import torch
import torchmetrics

import matplotlib.pyplot as plt
import seaborn as sn
import pandas as pd
from sklearn.decomposition import PCA

4.1 Dataset functions

Set global variable with the number of classes

number_of_classes=2

I will use sklearn.datasets.make_classification to generate a binary classification dataset:

n_samples - is the number of generated samples
n_features - sets the number of dimensions of generated samples X
n_classes - the number of classes in the generated dataset. In the binary classification problem, there should be only 2 classes

The generated dataset will have X with shape [n_samples, n_features] and Y with shape [n_samples, ].

def get_dataset(n_samples=10000, n_features=20, n_classes=2):
    # https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html#sklearn.datasets.make_classification
    data_X, data_y = make_classification(n_samples=n_samples, n_features=n_features, n_classes=n_classes,
                                         n_informative=n_classes, n_redundant=0, n_clusters_per_class=2,
                                         random_state=42,
                                         class_sep=2)
    return data_X, data_y

Define functions to visualize and print out dataset statistics.

show_dataset function uses PCA to reduce the dimensionality of X from any number down to 2 for simplicity of visualization of X with the 2D plot.

def print_dataset(X, y):
    print(f'X shape: {X.shape}, min: {X.min()}, max: {X.max()}')
    print(f'y shape: {y.shape}')
    print(y[:10])

def show_dataset(X, y, title=''):
    if X.shape[1] > 2:
        X_pca = PCA(n_components=2).fit_transform(X)
    else:
        X_pca = X
    fig = plt.figure(figsize=(4, 4))
    plt.scatter(x=X_pca[:, 0], y=X_pca[:, 1], c=y, alpha=0.5)
    # generate colors for all classes
    colors = plt.cm.rainbow(np.linspace(0, 1, number_of_classes))
    # iterate over classes and visualize them with the dedicated color
    for class_id in range(number_of_classes):
        class_mask = np.argwhere(y == class_id)
        X_class = X_pca[class_mask[:, 0]]
        plt.scatter(x=X_class[:, 0], y=X_class[:, 1],
                    c=np.full((X_class[:, 0].shape[0], 4), colors[class_id]),
                    label=class_id, alpha=0.5)
    plt.title(title)
    plt.legend(loc="best", title="Classes")
    plt.xticks()
    plt.yticks()
    plt.show()

Scale the dataset features X to range [0,1] with min max scaler. This is usually done for faster and more stable training.

def scale(x_in):
    return (x_in - x_in.min(axis=0))/(x_in.max(axis=0)-x_in.min(axis=0))

Let's print out the generated dataset statistics and visualized it with the functions from above.

X, y = get_dataset(n_classes=number_of_classes, n_features=2)
print('before scaling')
print_dataset(X, y)
show_dataset(X, y, 'before')

X_scaled = scale(X)
print('after scaling')
print_dataset(X_scaled, y)
show_dataset(X_scaled, y, 'after')

The outputs you should get are below.

before scaling
X shape: (10000, 2), min: -6.049090666105036, max: 5.311074029997754
y shape: (10000,)
[0 0 1 1 0 1 1 0 1 0]

after scaling
X shape: (10000, 2), min: 0.0, max: 1.0
y shape: (10000,)
[0 0 1 1 0 1 1 0 1 0]

As you can see min max scaling does not distort dataset features, it just transforms them into the range [0,1].

Create PyTorch data loaders. sklearn.datasets.make_classification generates the dataset as two numpy arrays. To create PyTorch dataloaders we need to transform the numpy dataset into torch.tensor first.

def get_data_loaders(dataset, batch_size=32, shuffle=True):
    data_X, data_y = dataset
    # https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset
    torch_dataset = torch.utils.data.TensorDataset(torch.tensor(data_X, dtype=torch.float32),
                                                   torch.tensor(data_y, dtype=torch.float32))
    # https://pytorch.org/docs/stable/data.html#torch.utils.data.random_split
    train_dataset, val_dataset = torch.utils.data.random_split(torch_dataset, [int(len(torch_dataset)*0.8),
                                                                               int(len(torch_dataset)*0.2)],
                                                               torch.Generator().manual_seed(42))
    # https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
    loader_train = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=shuffle)
    loader_val = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size, shuffle=shuffle)
    return loader_train, loader_val

Test PyTorch data loaders

dataloader_train, dataloader_val = get_data_loaders(get_dataset(n_classes=number_of_classes), batch_size=32)
train_batch_0 = next(iter(dataloader_train))
print(f'Batches in the train dataloader: {len(dataloader_train)}, X: {train_batch_0[0].shape}, Y: {train_batch_0[1].shape}')
val_batch_0 = next(iter(dataloader_val))
print(f'Batches in the validation dataloader: {len(dataloader_val)}, X: {val_batch_0[0].shape}, Y: {val_batch_0[1].shape}')

The output:

Batches in the train dataloader: 250, X: torch.Size([32, 20]), Y: torch.Size([32])
Batches in the validation dataloader: 63, X: torch.Size([32, 20]), Y: torch.Size([32])

Create pre and postprocessing functions. As you may have noted before current Y shape is [N,], we need it to be [N,1]. To do that we can expand the Y shape to [N,1] with numpy.expand_dims or torch.unsqueeze depending on the type of Y.

def preprocessing(y):
    '''
    expland input labels shape [N,] to [N,1]
    input: y - [N,] numpy array or pytorch Tensor
    output: [N, 1] the same type as input
    '''
    assert type(y) == np.ndarray or torch.is_tensor(
        y), f'input should be numpy array or torch tensor. Received input is: {type(y)}'
    assert len(y.shape) == 1, f'input shape should be [N,]. Received input shape is: {y.shape}'
    if torch.is_tensor(y):
        return torch.unsqueeze(y, dim=1)
    else:
        return np.expand_dims(y, axis=1)

Postprocessing is simply thresholding input values: if the value is larger than the threshold, set it to 1, if it's lower then set it to 0. Postprocessing is used to output class 0 or 1 based on the model's output probability.

def postprocessing(y, threshold=0.5):
    '''
    set input y with values larger than threshold to 1 and lower than threshold to 0
    input: y - [N,1] numpy array or pytorch Tensor
    output: int array [N,1] the same class type as input
    '''
    assert type(y) == np.ndarray or torch.is_tensor(
        y), f'input should be numpy array or torch tensor. Received input is: {type(y)}'
    assert len(y.shape) == 2, f'input shape should be [N,classes]. Received input shape is: {y.shape}'
    if torch.is_tensor(y):
        return (y >= threshold).int()
    else:
        return (y >= threshold).astype(int)

Test the defined pre and postprocessing functions.

y = np.random.rand(10, )
y_preprocessed = preprocessing(y)
print(f'y shape: {y.shape}, y preprocessed shape: {y_preprocessed.shape}')
y_postprocessed = postprocessing(y_preprocessed, threshold=0.5)
print(f'y preprocessed shape: {y_preprocessed.shape},y postprocessed shape: {y_postprocessed.shape}')
print('Postprocessing sets array elements>=threshold to 1 and elements<threshold to 0:')
for i in range(10):
    print(f'\t{y_preprocessed[i, 0]:.2f} >> {y_postprocessed[i, 0]}')

The output:

y shape: (10,), y preprocessed shape: (10, 1)
y preprocessed shape: (10, 1),y postprocessed shape: (10, 1)
Postprocessing sets array elements>=threshold to 1 and elements<threshold to 0:
	0.81 >> 1
	0.67 >> 1
	0.66 >> 1
	0.10 >> 0
	0.39 >> 0
	0.50 >> 1
	0.54 >> 1
	0.06 >> 0
	0.92 >> 1
	0.93 >> 1

4.2 Creating and training Binary classification model

This section shows an implementation of all functions required to train a binary classification model.

4.2.1 Sigmoid activation

The PyTorch-based implementation of the sigmoid formula

def sigmoid(x):
  return 1/(1+torch.exp(-x))

Let's test sigmoid:

generate test_input numpy array in the range [-10, 10] with step 1
preprocess it - extend test_input shape from [21,] to [21,1]
process test_input with the implemented sigmoid function and PyTorch default implementation torch.nn.functional.sigmoid
compare the results (they should be identical)
plot processed by sigmoid test_input

test_input = torch.arange(-10, 11, 1, dtype=torch.float32)
test_input = preprocessing(test_input)
sigmoid_output = sigmoid(test_input)
print(f'Input data shape: {test_input.shape}')
print(f'input data range: [{test_input.min():.3f}, {test_input.max():.3f}]')
print(f'sigmoid output data range: [{sigmoid_output.min():.3f}, {sigmoid_output.max():.3f}]')
print(test_input[:2])
print(sigmoid_output[:2])
# compare the sigmoid implementation with pytorch implementation
torch_sigmoid_output = torch.nn.functional.sigmoid(test_input)
print(f'sigmoid output is the same with pytorch implementation: {(torch_sigmoid_output == sigmoid_output).all().numpy()}')

fig = plt.figure(figsize=(4, 2), facecolor=(0.0, 1.0, 0.0))

ax = fig.add_subplot(1, 1, 1)
ax.plot(test_input, sigmoid_output, color='red')
ax.set_ylim([0, 1])
ax.set_title('sigmoid')
ax.set_facecolor((0.0, 1.0, 0.0))
fig.show()

The output of the code above:

Input data shape: torch.Size([21, 1])
input data range: [-10.000, 10.000]
sigmoid output data range: [0.000, 1.000]
tensor([[-10.],
        [ -9.]])
tensor([[4.5398e-05],
        [1.2339e-04]])
sigmoid output is the same with pytorch implementation: True

4.2.2 Loss function: Binary-cross-entropy

The PyTorch-based implementation of the BCE formula

To make sure that the inner term of log is never 0 use torch.clamp with min=epsilon and max=1-epsilon.

def binary_cross_entropy(pred, y):
    # log(0)=-inf
    # to prevent that clamp NN output values into [eps, 1-eps] values
    eps = 1e-8
    pred = torch.clamp(pred, min=eps, max=1 - eps)
    loss = -y * torch.log(pred) - (1 - y) * torch.log(1 - pred)
    return loss.mean()

Test BCE implementation:

generate test_input an array with shape [10,1] and values in the range [0,1) with torch.rand
threshold test_input to set all values to 0 or 1 and use it as ground truth values
compute loss with the implemented binary_cross_entropy function and PyTorch implementation torch.nn.functional.binary_cross_entropy
compare the results (they should be identical)

test_input = torch.rand(10, 1, dtype=torch.float32)
# get "ground truth" for test input by thresholding test_input
test_input_gt = postprocessing(test_input).float()
print(f'test input shape: {test_input.shape}, gt shape: {test_input_gt.shape}')
print(f'test_input range: [{test_input.min().numpy():.2f}, {test_input.max().numpy():.2f}]')
print(f'test_input gt range: [{test_input_gt.min().numpy()}, {test_input_gt.max().numpy()}]')
# get loss with the binary_cross_entropy implementation
loss = binary_cross_entropy(test_input, test_input_gt)
# get loss with pytorch binary_cross_entropy implementation
loss_pytorch = torch.nn.functional.binary_cross_entropy(test_input, test_input_gt)
print(f'loss outputs are the same: {(loss == loss_pytorch).numpy()}')

The expected output

test input shape: torch.Size([10, 1]), gt shape: torch.Size([10, 1])
test_input range: [0.02, 0.80]
test_input gt range: [0.0, 1.0]
loss outputs are the same: True

4.2.3 Accuracy metric

I will use torchmetrics implementation to compute accuracy based on model predictions and ground truth.

To create binary classification accuracy metric two parameters are required:

task="binary"
the threshold value that will be used to threshold model predictions

# https://torchmetrics.readthedocs.io/en/stable/classification/accuracy.html#module-interface
accuracy_metric=torchmetrics.classification.Accuracy(task="binary", threshold=0.5)

def compute_accuracy(y_pred, y):
  return accuracy_metric(y_pred, y)

4.2.4 NN model

The NN used in this example is a deep NN with 2 hidden layers. Input and hidden layers use ReLU activation and the final layer uses the activation function provided as the class input (it will be the sigmoid activation function that was implemented before).

class ClassifierNN(torch.nn.Module):
    def __init__(self, loss_function, activation_function, input_dims=2, output_dims=1):
        super().__init__()
        self.linear1 = torch.nn.Linear(input_dims, input_dims * 4)
        self.linear2 = torch.nn.Linear(input_dims * 4, input_dims * 8)
        self.linear3 = torch.nn.Linear(input_dims * 8, input_dims * 4)
        self.output = torch.nn.Linear(input_dims * 4, output_dims)
        self.loss_function = loss_function
        self.activation_function = activation_function

    def forward(self, x):
        x = torch.nn.functional.relu(self.linear1(x))
        x = torch.nn.functional.relu(self.linear2(x))
        x = torch.nn.functional.relu(self.linear3(x))
        x = self.activation_function(self.output(x))
        return x

4.2.5 Train model for a single epoch

The figure above depicts the binary classification training logic for a single batch. Later the train_epoch function will be called multiple times (chosen number of epochs).

def train_epoch(model, optimizer, dataloader_train):
    # set the model to the training mode
    # https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.train
    model.train()
    losses = []
    accuracies = []
    for step, (X_batch, y_batch) in enumerate(dataloader_train):
        ### forward propagation
        # get model output and use loss function
        y_pred = model(X_batch) # get class probabilities with shape [N,1]
        # apply loss function on predicted probabilities and ground truth
        loss = model.loss_function(y_pred, y_batch)

        ### backward propagation
        # set gradients to zero before backpropagation
        # https://pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html
        optimizer.zero_grad()
        # compute gradients
        # https://pytorch.org/docs/stable/generated/torch.Tensor.backward.html
        loss.backward()
        # update weights
        # https://pytorch.org/docs/stable/optim.html#taking-an-optimization-step
        optimizer.step()  # update model weights
        # calculate batch accuracy
        acc = compute_accuracy(y_pred, y_batch)
        # append batch loss and accuracy to corresponding lists for later use
        accuracies.append(acc)
        losses.append(float(loss.detach().numpy()))
    # compute average epoch accuracy
    train_acc = np.array(accuracies).mean()
    # compute average epoch loss
    loss_epoch = np.array(losses).mean()
    return train_acc, loss_epoch

4.2.6 Evaluate the model with the provided data loader

The evaluate function iterates over provided PyTorch dataloader and computes current model accuracy and returns average loss and average accuracy.

def evaluate(model, dataloader_in):
    # set the model to the evaluation mode
    # https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.eval
    model.eval()

    val_acc_epoch = 0
    losses = []
    accuracies = []
    # disable gradient calculation for evaluation
    # https://pytorch.org/docs/stable/generated/torch.no_grad.html
    with torch.no_grad():
        for step, (X_batch, y_batch) in enumerate(dataloader_in):
            # get predictions
            y_pred = model(X_batch)
            # calculate loss
            loss = model.loss_function(y_pred, y_batch)
            # calculate batch accuracy
            acc = compute_accuracy(y_pred, y_batch)
            accuracies.append(acc)
            losses.append(float(loss.detach().numpy()))
    # compute average accuracy
    val_acc = np.array(accuracies).mean()
    # compute average loss
    loss_epoch = np.array(losses).mean()
    return val_acc, loss_epoch

4.2.7 Get predictions for the provided dataloader

predict function iterates over the provided dataloader, collects post-processed model predictions and ground truth values into [N,1] PyTorch arrays, and returns both arrays. Later this function will be used to compute the confusion matrix and visualize predictions.

def predict(model, dataloader):
    # set the model to the evaluation mode
    # https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.eval
    model.eval()
    xs, ys = next(iter(dataloader))
    y_pred = torch.empty([0, ys.shape[1]])
    x = torch.empty([0, xs.shape[1]])
    y = torch.empty([0, ys.shape[1]])
    # disable gradient calculation for evaluation
    # https://pytorch.org/docs/stable/generated/torch.no_grad.html
    with torch.no_grad():
        for step, (X_batch, y_batch) in enumerate(dataloader):
            # get predictions
            y_batch_pred = model(X_batch)
            y_pred = torch.cat([y_pred, y_batch_pred])
            y = torch.cat([y, y_batch])
            x = torch.cat([x, X_batch])
            # print(y_pred.shape, y.shape)
    y_pred = postprocessing(y_pred)
    y = postprocessing(y)
    return y_pred, y, x

4.2.8 Model training for the given number of epochs

To train the model we just need to call the train_epoch function N times, where N is the number of epochs. The evaluate function is called to log the current model accuracy on the validation dataset. Finally, the best model is updated based on the validation accuracy. The model_train function returns the best validation accuracy and the training history.

def model_train(model, optimizer, dataloader_train, dataloader_val, n_epochs=50):
    best_acc = 0
    best_weights = None
    history = {'loss': {'train': [], 'validation': []},
               'accuracy': {'train': [], 'validation': []}}
    for epoch in range(n_epochs):
        # train on dataloader_train
        acc_train, loss_train = train_epoch(model, optimizer, dataloader_train)
        # evaluate on dataloader_val
        acc_val, loss_val = evaluate(model, dataloader_val)
        print(f'Epoch: {epoch} | Accuracy: {acc_train:.3f} / {acc_val:.3f} | ' +
              f'loss: {loss_train:.5f} / {loss_val:.5f}')
        # save epoch losses and accuracies in history dictionary
        history['loss']['train'].append(loss_train)
        history['loss']['validation'].append(loss_val)
        history['accuracy']['train'].append(acc_train)
        history['accuracy']['validation'].append(acc_val)
        # Save the best validation accuracy model
        if acc_val >= best_acc:
            print(f'\tBest weights updated. Old accuracy: {best_acc:.4f}. New accuracy: {acc_val:.4f}')
            best_acc = acc_val
            torch.save(model.state_dict(), 'best_weights.pt')
    # restore model and return best accuracy
    model.load_state_dict(torch.load('best_weights.pt'))
    return best_acc, history

4.2.9 Plot training history

def plot_history(history):
    fig = plt.figure(figsize=(8, 4), facecolor=(0.0, 1.0, 0.0))
    ax = fig.add_subplot(1, 2, 1)
    ax.plot(np.arange(0, len(history['loss']['train'])), history['loss']['train'], color='red', label='train')
    ax.plot(np.arange(0, len(history['loss']['validation'])), history['loss']['validation'], color='blue',
            label='validation')
    ax.set_title('Loss history')
    ax.set_facecolor((0.0, 1.0, 0.0))
    ax.legend()
    ax = fig.add_subplot(1, 2, 2)
    ax.plot(np.arange(0, len(history['accuracy']['train'])), history['accuracy']['train'], color='red', label='train')
    ax.plot(np.arange(0, len(history['accuracy']['validation'])), history['accuracy']['validation'], color='blue',
            label='validation')
    ax.set_title('Accuracy history')
    ax.legend()
    fig.tight_layout()
    ax.set_facecolor((0.0, 1.0, 0.0))
    fig.show()

4.3 Get the dataset, create the model, and train it

Let's put everything together and train the binary classification model.

#########################################
# Get the dataset
X, y = get_dataset(n_classes=number_of_classes)
print(f'Generated dataset shape. X:{X.shape}, y:{y.shape}')
# change y numpy array shape from [N,] to [N, 1] for binary classification
y = preprocessing(y)
print(f'Dataset shape prepared for binary classification with sigmoid  activation and BCE loss.')
print(f'X:{X.shape}, y:{y.shape}')
# Get train and validation dataloaders
dataloader_train, dataloader_val = get_data_loaders(dataset=(scale(X), y), batch_size=32)

# get a batch from dataloader and output intput and output shape
X_0, y_0 = next(iter(dataloader_train))
print(f'Model input data shape: {X_0.shape}, output (ground truth) data shape: {y_0.shape}')

#########################################
# Create ClassifierNN for binary classification problem
# input dims: [N, features]
# output dims: [N, 1]
# activation - sigmoid to output probability p in range [0,1]
# loss - binary cross-entropy
model = ClassifierNN(loss_function=binary_cross_entropy,
                     activation_function=sigmoid,
                     input_dims=X.shape[1],
                     output_dims=y.shape[1])

#########################################
# create optimizer and train the model on the dataset
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
print(f'Model size: {sum([x.reshape(-1).shape[0] for x in model.parameters()])} parameters')
print('#' * 10)
print('Start training')
acc, history = model_train(model, optimizer, dataloader_train, dataloader_val, n_epochs=20)
print('Finished training')
print('#' * 10)
print("Model accuracy: %.2f%%" % (acc * 100))
plot_history(history)

The expected output should be similar to the one provided below.

Generated dataset shape. X:(10000, 20), y:(10000,)
Dataset shape prepared for binary classification with sigmoid  activation and BCE loss.
X:(10000, 20), y:(10000, 1)
Model input data shape: torch.Size([32, 20]), output (ground truth) data shape: torch.Size([32, 1])
Model size: 27601 parameters
##########
Start training
Epoch: 0 | Accuracy: 0.690 / 0.952 | loss: 0.65095 / 0.53560
	Best weights updated. Old accuracy: 0.0000. New accuracy: 0.9524
Epoch: 1 | Accuracy: 0.956 / 0.970 | loss: 0.33146 / 0.18328
	Best weights updated. Old accuracy: 0.9524. New accuracy: 0.9702
Epoch: 2 | Accuracy: 0.965 / 0.973 | loss: 0.14162 / 0.11417
	Best weights updated. Old accuracy: 0.9702. New accuracy: 0.9732
Epoch: 3 | Accuracy: 0.970 / 0.975 | loss: 0.10551 / 0.09519
	Best weights updated. Old accuracy: 0.9732. New accuracy: 0.9752
Epoch: 4 | Accuracy: 0.972 / 0.976 | loss: 0.09295 / 0.09127
	Best weights updated. Old accuracy: 0.9752. New accuracy: 0.9762
Epoch: 5 | Accuracy: 0.974 / 0.977 | loss: 0.08666 / 0.08467
	Best weights updated. Old accuracy: 0.9762. New accuracy: 0.9772
Epoch: 6 | Accuracy: 0.976 / 0.977 | loss: 0.08243 / 0.08312
	Best weights updated. Old accuracy: 0.9772. New accuracy: 0.9772
Epoch: 7 | Accuracy: 0.977 / 0.979 | loss: 0.07981 / 0.08914
	Best weights updated. Old accuracy: 0.9772. New accuracy: 0.9787
Epoch: 8 | Accuracy: 0.977 / 0.981 | loss: 0.07876 / 0.08224
	Best weights updated. Old accuracy: 0.9787. New accuracy: 0.9807
Epoch: 9 | Accuracy: 0.978 / 0.979 | loss: 0.07692 / 0.08362
Epoch: 10 | Accuracy: 0.979 / 0.979 | loss: 0.07478 / 0.07739
Epoch: 11 | Accuracy: 0.980 / 0.980 | loss: 0.07375 / 0.07708
Epoch: 12 | Accuracy: 0.980 / 0.980 | loss: 0.07253 / 0.07613
Epoch: 13 | Accuracy: 0.981 / 0.979 | loss: 0.07119 / 0.07788
Epoch: 14 | Accuracy: 0.982 / 0.982 | loss: 0.07148 / 0.07483
	Best weights updated. Old accuracy: 0.9807. New accuracy: 0.9816
Epoch: 15 | Accuracy: 0.982 / 0.981 | loss: 0.06973 / 0.07474
Epoch: 16 | Accuracy: 0.981 / 0.982 | loss: 0.06900 / 0.07401
	Best weights updated. Old accuracy: 0.9816. New accuracy: 0.9821
Epoch: 17 | Accuracy: 0.982 / 0.979 | loss: 0.06850 / 0.08130
Epoch: 18 | Accuracy: 0.982 / 0.980 | loss: 0.06796 / 0.07966
Epoch: 19 | Accuracy: 0.982 / 0.981 | loss: 0.06714 / 0.07458
Finished training
##########
Model accuracy: 98.21%

4.4 Evaluate the model

acc_train, _ = evaluate(model, dataloader_train)
acc_validation, _ = evaluate(model, dataloader_val)
print(f'Accuracy - Train: {acc_train:.4f} | Validation: {acc_validation:.4f}')

Accuracy - Train: 0.9816 | Validation: 0.9816

val_preds, val_y, _ = predict(model, dataloader_val)
print(val_preds.shape, val_y.shape)
binary_confusion_matrix = torchmetrics.classification.ConfusionMatrix('binary')
cm = binary_confusion_matrix(val_preds, val_y)
print(cm)

df_cm = pd.DataFrame(cm)
plt.figure(figsize=(6, 5), facecolor=(0.0,1.0,0.0))
sn.heatmap(df_cm, annot=True, fmt='d')
plt.show()

val_preds, val_y, val_x = predict(model, dataloader_val)
show_dataset(val_x.numpy(), postprocessing(val_y).numpy(), 'Ground Truth')
show_dataset(val_x.numpy(), postprocessing(val_preds).numpy(), 'Predictions')

Conclusion

Binary classification is a foundation for many deep learning tasks. For binary classification, you need to use sigmoid activation and binary cross-entropy loss. If you understand how these two functions work you will be able to understand not only classification NN models but more complicated NN architectures.