Implementing a Single Neuron in Python from Scratch In the last decade, Artificial Intelligence (AI) has stepped firmly into the public spotlight, in large part owing to advances in (ML) and Artificial Neural Networks (ANNs). Machine Learning But with promising new technologies comes a whole lot of buzz, and there is now an overwhelming amount of noise in the field. That’s why I thought it would be useful to get back to basics and actually using Python. implement a single neuron from scratch The Artificial Neuron Before we dive in, I just wanted to quickly talk about what a neuron in the first place. Early proponents of artificial intelligence noticed that the biological neuron was capable of conceptualizing and learning from large volumes of data, and postulated that modelling this neuron in a machine might allow for a similar capability.¹ To that end, the neuron was abstracted into a model of inputs, outputs, and weights. is Figure 1: Simple neuron model. In machine learning terminology, each input to the neuron (x1, x2, … xn) is known as a , and each feature is weighted with a number to represent the strength of that input (w1j, w2j, … wnj). The weighted sum of inputs (netj) then passes through an , whose general purpose is to model the “firing rate” of a biological neuron by converting the weighted sum into a new number according to a formula. Although we don’t need to go into the mechanics of the activation function just yet, here is some by for those who are curious. feature activation function reading material Avinash Sharma V So if that’s how a neuron , let’s look at how it . In simple terms, a neuron refers to iteratively updating the weights associated with each of its inputs so that it can progressively approximate the underlying relationship in the dataset it’s been given. Once properly trained, a neuron can be used to do things like correctly sort entirely new samples — say, images of cats and dogs — into separate buckets, just like people can. In machine learning terminology, this is known as . works learns training classification Training To train a simple classifier, let’s use the publicly available , which has the following properties: sklearn breast cancer dataset +---------------+-----+| Classes | 2 |+---------------+-----+| Num Samples | 569 |+---------------+-----+| Num Benign | 357 |+---------------+-----+| Num Malignant | 212 |+---------------+-----+ Each sample in the dataset is an image of a breast mass that has been translated into a set of 30 numbers (features). Using a portion of samples to train our neuron, we will see if it is able to categorize the portion of breast masses as either malignant or benign. In other words, we need to perform a learning task, using explicitly labelled data points as teachers for the neuron to learn relevant patterns. unseen supervised To run and modify the below code below, check out the script here: single-layer-perceptron.py First we load the dataset and randomly shuffle the malignant and benign samples together while keeping each individual sample labelled. This is because we do not want our neuron to draw conclusions based on the order of samples it sees—only on the features of each individual sample. To train our neuron, we basically need to do three things: 1. Ask the neuron to classify a sample.2. Update the neuron’s weights based on how wrong the prediction is. 3. Repeat. Since a neuron is essentially just a collection of weights, we can use Python’s matrix manipulation package to randomly initialize a vector of weights.² The number of weights initialized corresponds to the number of features (inputs) to the neuron, as each feature is weighted before being summed. The activation is a static function, and so does not need a specific representation in software. numpy Forward Pass In the first step of training, we ask the neuron to make a prediction about the training samples. This is known as a , and it involves taking a weighted sum of the input features and passing that sum through the activation function. forward pass in the above code snippet is a matrix of features with the shape (n_samples * n_features). The weights representing our single neuron are of shape (n_features * 1). Therefore, a matrix multiplication between these two matrices will provide a weighted sum of all features for each sample. (If you try this out by hand, you’ll see it’s not as complicated as it sounds.) l0 When passed through an activation function, these weighted sums will effectively become class predictions for each training sample. The Sigmoid Function The function is a special case of the logistic function, and is chosen here as our activation function for a few reasons: it is , , and , with the following shape and definition: sigmoid easily differentiable non-linear bounded Figure 2: Sigmoid function shape. Figure 3: Sigmoid function definition. The function is implemented with a single line as follows: Conveniently, a standard logistic function has an of the form easily calculated derivative Figure 4: Derivative of a Sigmoid Function where f(x) represents the logistic function. As we will see shortly, this property is extremely useful when trying to minimize the error in our neuron’s predictions. The derivative of the sigmoid function is implemented as follows: Gradient Descent Now here comes the fun (and tricky) part — actually having our neuron learn the underlying relationship in the dataset. Now that we have bounded predictions for each of the training samples, we can calculate the / in these predictions and update our neuron weights proportional to this error. error loss We will use the for this weight update. In order to use this algorithm, we need an error function to represent the gap between our neuron’s predictions and ground truth. This error function is defined to be a scaled version of mean squared error (scaling makes differentiation easy). gradient descent optimization algorithm Figure 5: Mean-squared-error function. In code, this mean-squared-error function is implemented as follows: A gradient, in mathematics, is the derivative of a function that is dependent on more than one variable. Recall that vectors have both a magnitude and a direction. Our neuron’s error is dependent on all of the weights feeding into it. So, the gradient is the set of partial derivatives of error with respect of each of the weights. vector As a vector, the gradient points in the direction of the greatest rate of increase of a function. Moving in the opposite direction to the gradient, therefore, minimizes the function. Think of the error function as a surface with ridges and valleys. By descending opposite to the gradient, we are moving into the valley where error is lower. If we are able to calculate the gradient of the error function of our neuron relative to each of its weights, we can update the weights proportionally to minimize the error. Below is a simple derivation of the gradient of the error function using the chain rule. A more rigorous derivation can be found . here Figure 6: Partial derivative of error with respect to each neuron weight. Here, is the error function, is one particular weight, is the output of a neuron, and is the weighted sum of inputs to a neuron. Indices and correspond to the weight and the neuron, respectively. We will calculate each of the factors of the partial derivative separately. E wij oj netj i j The first factor is simple, and is the derivative of error with respect to the neuron’s output: Figure 7: Derivative of output error with respect to neuron output. The second factor is also straightforward, and is the derivative of the sigmoid function we described above in Figure 4. Figure 8: Derivative of neuron output with respect to the weighted sum. The third and final factor simplifies to equal the inputs to a particular neuron. Figure 9: Partial derivative of the weighted sum of inputs with respect to each weight. In Figure 9, is the vector of inputs to that neuron, which in our case are the features from our training set. oi Weight Update Rule Tying the partial derivatives we just saw together with descent gives us a rule for updating the weights representing our neuron: Figure 10: Weight update rule. Figure 10 shows that each weight will be updated in the negative direction of the gradient, proportional to an additional term, . A scaling factor, n determines how large a step we take when updating neuron weights, effectively controlling the rate at which the neuron learns. We call this the . n learning rate Implementing Weight Updates Below is the implementation of the gradient calculation and weight update for our single neuron. You can follow the comments to find each step of the derivatives required for the weight update rule. While training neural networks, the same training data is run through the many times, with each full pass being referred to as an . ( explains well why we use the same data many times with neural networks in .) With each epoch, the weights update further to try and lower the error. For our simple example, the number of epochs and the learning rate are selected through trial and error, observing mse loss decrease and converge. network epoch SAGAR SHARMA this post Results Figure 11: Training results. Figure 11 shows that over hundreds of epochs, loss decreases and accuracy increases on both training and test datasets. Next, we check if this training process is repeatable by training 10 different randomly initialized neurons on the same dataset. At the end of 10 training runs, average test accuracy is 90.49% (s=2.40%) and the average total accuracy is 90.33% (s=0.304%). If we had seen a large discrepancy in training and test accuracy, or if the test loss had increased while the training loss decreased, we would have reason to believe that the neuron was not learning the pattern hidden in the dataset. Although this level of validation is not nearly enough to put this neuron into a production environment, signs indicate that the neuron has learned a pattern in the dataset. Conclusion We looked here at the simplest form of an artificial neural network, namely one with a single neuron powered by gradient descent. Networks can be made of many neurons or other trainable filters/units, and use a variety of loss and activation functions based on their purpose. All of these extensions allow ANNs to perform a broad range of tasks such as object detection, language translation, time-series forecasting, and more. In our next post, we will explore the limitations of a single neuron and dig deeper into the flow of error backwards through a chain of neurons, or . The backward flow of error through a neural network is what allows a collection of neurons to converge on a solution together. Consequently, we will be able to pass much larger and much more complex datasets through our neural networks. layers Author’s Remarks I wanted to give a huge shoutout to , , , and for helping proofread and edit this article through multiple revisions. Eli Burnstein William Wen Guy Tonye Thomas Aston As always, leave comments and call out any mistakes you find here. I will work to fix them as quickly as possible! Footnotes , by Frank Rosenblatt. The Perceptron, A Perceiving and Recognizing Automaton In our very simple example, we centred the weights around a mean of zero. However, there are better ways to initialize weights for larger models. is a great introduction to best practices for weight initialization by . Here Neerja Doshi Dhruv is an AI Software Engineer at Connected Lab , a product development firm that works with clients to drive impact through software-powered products. www.connectedlab.com