In the last decade, Artificial Intelligence (AI) has stepped firmly into the public spotlight, in large part owing to advances in Machine Learning (ML) and Artificial Neural Networks (ANNs).
But with promising new technologies comes a whole lot of buzz, and there is now an overwhelming amount of noise in the field. That’s why I thought it would be useful to get back to basics and actually implement a single neuron from scratch using Python.
Before we dive in, I just wanted to quickly talk about what a neuron is in the first place. Early proponents of artificial intelligence noticed that the biological neuron was capable of conceptualizing and learning from large volumes of data, and postulated that modelling this neuron in a machine might allow for a similar capability.¹ To that end, the neuron was abstracted into a model of inputs, outputs, and weights.
In machine learning terminology, each input to the neuron (x1, x2, … xn) is known as a feature, and each feature is weighted with a number to represent the strength of that input (w1j, w2j, … wnj). The weighted sum of inputs (netj) then passes through an activation function, whose general purpose is to model the “firing rate” of a biological neuron by converting the weighted sum into a new number according to a formula. Although we don’t need to go into the mechanics of the activation function just yet, here is some reading material by Avinash Sharma V for those who are curious.
So if that’s how a neuron works, let’s look at how it learns. In simple terms, training a neuron refers to iteratively updating the weights associated with each of its inputs so that it can progressively approximate the underlying relationship in the dataset it’s been given. Once properly trained, a neuron can be used to do things like correctly sort entirely new samples — say, images of cats and dogs — into separate buckets, just like people can. In machine learning terminology, this is known as classification.
To train a simple classifier, let’s use the publicly available sklearn breast cancer dataset, which has the following properties:
| Classes | 2 |
| Num Samples | 569 |
| Num Benign | 357 |
| Num Malignant | 212 |
Each sample in the dataset is an image of a breast mass that has been translated into a set of 30 numbers (features). Using a portion of samples to train our neuron, we will see if it is able to categorize the unseen portion of breast masses as either malignant or benign. In other words, we need to perform a supervised learning task, using explicitly labelled data points as teachers for the neuron to learn relevant patterns.
To run and modify the below code below, check out the script here: single-layer-perceptron.py
First we load the dataset and randomly shuffle the malignant and benign samples together while keeping each individual sample labelled. This is because we do not want our neuron to draw conclusions based on the order of samples it sees—only on the features of each individual sample.
To train our neuron, we basically need to do three things:
1. Ask the neuron to classify a sample.
2. Update the neuron’s weights based on how wrong the prediction is.
Since a neuron is essentially just a collection of weights, we can use Python’s matrix manipulation package numpy to randomly initialize a vector of weights.² The number of weights initialized corresponds to the number of features (inputs) to the neuron, as each feature is weighted before being summed. The activation is a static function, and so does not need a specific representation in software.
In the first step of training, we ask the neuron to make a prediction about the training samples. This is known as a forward pass, and it involves taking a weighted sum of the input features and passing that sum through the activation function.
l0 in the above code snippet is a matrix of features with the shape (n_samples * n_features). The weights representing our single neuron are of shape (n_features * 1). Therefore, a matrix multiplication between these two matrices will provide a weighted sum of all features for each sample. (If you try this out by hand, you’ll see it’s not as complicated as it sounds.)
When passed through an activation function, these weighted sums will effectively become class predictions for each training sample.
The sigmoid function is a special case of the logistic function, and is chosen here as our activation function for a few reasons: it is easily differentiable, non-linear, and bounded, with the following shape and definition:
The function is implemented with a single line as follows:
Conveniently, a standard logistic function has an easily calculated derivative of the form
where f(x) represents the logistic function. As we will see shortly, this property is extremely useful when trying to minimize the error in our neuron’s predictions.
The derivative of the sigmoid function is implemented as follows:
Now here comes the fun (and tricky) part — actually having our neuron learn the underlying relationship in the dataset. Now that we have bounded predictions for each of the training samples, we can calculate the error/loss in these predictions and update our neuron weights proportional to this error.
We will use the gradient descent optimization algorithm for this weight update. In order to use this algorithm, we need an error function to represent the gap between our neuron’s predictions and ground truth. This error function is defined to be a scaled version of mean squared error (scaling makes differentiation easy).
In code, this mean-squared-error function is implemented as follows:
A gradient, in mathematics, is the derivative vector of a function that is dependent on more than one variable. Recall that vectors have both a magnitude and a direction. Our neuron’s error is dependent on all of the weights feeding into it. So, the gradient is the set of partial derivatives of error with respect of each of the weights.
As a vector, the gradient points in the direction of the greatest rate of increase of a function. Moving in the opposite direction to the gradient, therefore, minimizes the function. If we are able to calculate the gradient of the error function of our neuron relative to each of its weights, we can update the weights proportionally to minimize the error. Think of the error function as a surface with ridges and valleys. By descending opposite to the gradient, we are moving into the valley where error is lower.
Below is a simple derivation of the gradient of the error function using the chain rule. A more rigorous derivation can be found here.
Here, E is the error function, wij is one particular weight, oj is the output of a neuron, and netj is the weighted sum of inputs to a neuron. Indices i and j correspond to the weight and the neuron, respectively. We will calculate each of the factors of the partial derivative separately.
The first factor is simple, and is the derivative of error with respect to the neuron’s output:
The second factor is also straightforward, and is the derivative of the sigmoid function we described above in Figure 4.
The third and final factor simplifies to equal the inputs to a particular neuron.
In Figure 9, oi is the vector of inputs to that neuron, which in our case are the features from our training set.
Tying the partial derivatives we just saw together with descent gives us a rule for updating the weights representing our neuron:
Figure 10 shows that each weight will be updated in the negative direction of the gradient, proportional to an additional term, n. A scaling factor, n determines how large a step we take when updating neuron weights, effectively controlling the rate at which the neuron learns. We call this the learning rate.
Below is the implementation of the gradient calculation and weight update for our single neuron. You can follow the comments to find each step of the derivatives required for the weight update rule.
While training neural networks, the same training data is run through the network many times, with each full pass being referred to as an epoch. (SAGAR SHARMA explains well why we use the same data many times with neural networks in this post.) With each epoch, the weights update further to try and lower the error. For our simple example, the number of epochs and the learning rate are selected through trial and error, observing mse loss decrease and converge.
Figure 11 shows that over hundreds of epochs, loss decreases and accuracy increases on both training and test datasets. Next, we check if this training process is repeatable by training 10 different randomly initialized neurons on the same dataset. At the end of 10 training runs, average test accuracy is 90.49% (s=2.40%) and the average total accuracy is 90.33% (s=0.304%).
If we had seen a large discrepancy in training and test accuracy, or if the test loss had increased while the training loss decreased, we would have reason to believe that the neuron was not learning the pattern hidden in the dataset. Although this level of validation is not nearly enough to put this neuron into a production environment, signs indicate that the neuron has learned a pattern in the dataset.
We looked here at the simplest form of an artificial neural network, namely one with a single neuron powered by gradient descent. Networks can be made of many neurons or other trainable filters/units, and use a variety of loss and activation functions based on their purpose. All of these extensions allow ANNs to perform a broad range of tasks such as object detection, language translation, time-series forecasting, and more.
In our next post, we will explore the limitations of a single neuron and dig deeper into the flow of error backwards through a chain of neurons, or layers. The backward flow of error through a neural network is what allows a collection of neurons to converge on a solution together. Consequently, we will be able to pass much larger and much more complex datasets through our neural networks.
As always, leave comments and call out any mistakes you find here. I will work to fix them as quickly as possible!