Building a Feedforward Neural Network from Scratch in Python

Photo by on Chris Ried Unsplash In this post, we will see how to implement the feedforward neural network from scratch in python. This is a follow up to my previous post on the . feedforward neural networks Feedforward Neural Networks Feedforward neural networks are also known as (MLN). These network of models are called feedforward because the information only travels forward in the neural network, through the input nodes then through the hidden layers (single or many layers) and finally through the output nodes. Multi-layered Network of Neurons Generic Network with Connections Traditional models such as McCulloch Pitts, Perceptron and Sigmoid neuron models capacity is limited to linear functions. To handle the complex non-linear decision boundary between input and the output we are using the Multi-layered Network of Neurons. To understand the feedforward neural network learning algorithm and the computations present in the network, kindly refer to my previous post on Feedforward Neural Networks. Deep Learning: Feedforward Neural Networks Explained Coding Part Photo by on Hitesh Choudhary Unsplash In the coding section, we will be covering the following topics. Generate data that is not linearly separable Train with Sigmoid Neuron and see performance Write from scratch our first feedforward network Train the FF network on the data and compare with Sigmoid Neuron Write a generic class for a FF network Train generic class on binary classification Train a FF network for multi-class data using a cross-entropy loss function If you want to skip the theory part and get into the code right away, Niranjankumar-c/Feedforward_NeuralNetworrks PS: If you are interested in converting the code into R, send me a message once it is done. I will feature your work here and also on the GitHub page. Before we start building our network, first we need to import the required libraries. We are importing the to evaluate the matrix multiplication and dot product between two vectors, to visualize the data and from package we are importing functions to generate data and evaluate the network performance. numpy matplotlib thesklearn numpy np matplotlib.pyplot plt matplotlib.colors sklearn.model_selection train_test_split sklearn.metrics accuracy_score, mean_squared_error tqdm tqdm_notebook sklearn.preprocessing OneHotEncoder sklearn.datasets make_blobs import as import as import from import from import from import from import from import Generate Dummy Data Remember that we are using feedforward neural networks because we wanted to deal with non-linearly separable data. In this section, we will see how to randomly generate non-linearly separable data. #creating my own color map better visualization my_cmap = matplotlib.colors.LinearSegmentedColormap.from_list( , [ , , ]) #Generating observations labels - multi = make_blobs(n_samples= , centers= , n_features= , random_state= ) print(data.shape, labels.shape) #visualize the data plt.scatter(data[:, ], data[:, ], c=labels, cmap=my_cmap) plt.show() #converting the multi- = labels labels = np.mod(labels_orig, ) plt.scatter(data[:, ], data[:, ], c=labels, cmap=my_cmap) plt.show() #split the binary data X_train, X_val, Y_train, Y_val = train_test_split(data, labels, stratify=labels, random_state= ) print(X_train.shape, X_val.shape) for "" "red" "yellow" "green" 1000 with 4 , class data labels 1000 4 2 0 0 1 class to binary labels_orig 2 0 1 0 To generate data randomly we will use to generate blobs of points with a Gaussian distribution. I have generated 1000 data points in 2D space with four blobs as a multi-class classification prediction problem. Each data point has two inputs and 0, 1, 2 or 3 class labels. The code present in helps to visualize the data using a scatter plot. We can see that they are 4 centers present and the data is linearly separable (almost). make_blobs centers=4 Line 9, 10 Multi-Class Data In the above plot, I was able to represent 3 Dimensions — 2 Inputs and class labels as colors using a simple scatter plot. Note that make_blobs() function will generate linearly separable data, but we need to have non-linearly separable data for binary classification. labels_orig = labels labels = np.mod(labels_orig, ) 2 One way to convert the 4 classes to binary classification is to take the remainder of these 4 classes when they are divided by 2 so that I can get the new labels as 0 and 1. Binary Class Data From the plot, we can see that the centers of blobs are merged such that we now have a binary classification problem where the decision boundary is not linear. Once we have our data ready, I have used the function to split the data for and in the ratio of 90:10 train_test_split training validation Train with Sigmoid Neuron Before we start training the data on the sigmoid neuron, We will build our model inside a class called SigmoidNeuron. = None self.b = None #forward pass def perceptron(self, x): np.dot(x, self.w.T) + self.b def sigmoid(self, x): /( + np.exp(-x)) #updating the gradients using mean squared error loss def grad_w_mse(self, x, y): y_pred = self.sigmoid(self.perceptron(x)) (y_pred - y) * y_pred * ( - y_pred) * x def grad_b_mse(self, x, y): y_pred = self.sigmoid(self.perceptron(x)) (y_pred - y) * y_pred * ( - y_pred) #updating the gradients using cross entropy loss def grad_w_ce(self, x, y): y_pred = self.sigmoid(self.perceptron(x)) y == : y_pred * x elif y == : * ( - y_pred) * x : raise ValueError( ) def grad_b_ce(self, x, y): y_pred = self.sigmoid(self.perceptron(x)) y == : y_pred elif y == : * ( - y_pred) : raise ValueError( ) #model fit method def fit(self, X, Y, epochs= , learning_rate= , initialise=True, loss_fn= , display_loss=False): # initialise w, b initialise: self.w = np.random.randn( , X.shape[ ]) self.b = display_loss: loss = {} i tqdm_notebook(range(epochs), total=epochs, unit= ): dw = db = x, y zip(X, Y): loss_fn == : dw += self.grad_w_mse(x, y) db += self.grad_b_mse(x, y) elif loss_fn == : dw += self.grad_w_ce(x, y) db += self.grad_b_ce(x, y) m = X.shape[ ] self.w -= learning_rate * dw/m self.b -= learning_rate * db/m display_loss: Y_pred = self.sigmoid(self.perceptron(X)) loss_fn == : loss[i] = mean_squared_error(Y, Y_pred) elif loss_fn == : loss[i] = log_loss(Y, Y_pred) display_loss: plt.plot(loss.values()) plt.xlabel( ) loss_fn == : plt.ylabel( ) elif loss_fn == : plt.ylabel( ) plt.show() def predict(self, X): Y_pred = [] x X: y_pred = self.sigmoid(self.perceptron(x)) Y_pred.append(y_pred) np.array(Y_pred) : # ( ): . class SigmoidNeuron intialization def __init__ self self w return return 1.0 1.0 return 1 return 1 if 0 return 1 return -1 1 else "y should be 0 or 1" if 0 return 1 return -1 1 else "y should be 0 or 1" 1 1 "mse" if 1 1 0 if for in "epoch" 0 0 for in if "mse" "ce" 1 if if "mse" "ce" if 'Epochs' if "mse" 'Mean Squared Error' "ce" 'Log Loss' for in return In the class we have 9 functions, I will walk you through these functions one by one and explain what they are doing. SigmoidNeuron def __init__(self): self.w = None self.b = None The function (constructor function) helps to initialize the parameters of sigmoid neuron weights and biases to None. __init__ w b #forward pass def perceptron(self, x): np.dot(x, self.w.T) + self.b def sigmoid(self, x): /( + np.exp(-x)) return return 1.0 1.0 Next, we will define two functions and which characterizes the forward pass. In case of a sigmoid neuron forward pass involves two steps perceptron sigmoid — Computes the dot product between the input & weights and adds bias perceptron x w b — Takes the output of perceptron and applies the sigmoid (logistic) function on top of it. sigmoid #updating the gradients using mean squared error loss def grad_w_mse(self, x, y): ..... def grad_b_mse(self, x, y): ..... #updating the gradients using cross entropy loss def grad_w_ce(self, x, y): ..... def grad_b_ce(self, x, y): ..... The next four functions characterize the gradient computation. I have written two separate functions for updating weights and biases using mean squared error loss and cross-entropy loss. w b def fit(self, X, Y, epochs= , learning_rate= , initialise=True, loss_fn= , display_loss=False): ..... return 1 1 "mse" Next, we define ‘fit’ method that accepts a few parameters, — Inputs X — Labels Y — Number of epochs we will allow our algorithm through iterate on the data, default value set to 1 epochs — The magnitude of change for our weights during each step through our training data, default value set to 1 learning_rate — To randomly initialize the parameters of the model or not. If it is set to True weights will be initialized, you can set it to False if you want to retrain the trained model. intialise — To select the loss function for the algorithm to update the parameters. It can be “mse” or “ce” loss_fn — Boolean Variable indicating whether to show the decrease of loss for each epoch display_loss In the method, we go through the data passed through parameters X and Y and compute the update values for the parameters either using mean squared loss or cross entropy loss. Once we the update value we go and update the weights and bias terms ( ). fit Line 49–62 def predict(self, X): Now we define our predict function takes inputs as an argument, which it expects to be an array. In the predict function, we will compute the forward pass of each input with the trained model and send back a numpy which contains the predicted value of each input data. X numpy array #create a = SigmoidNeuron() #train the model sn.fit(X_train, Y_train, epochs= , learning_rate= , display_loss=True) #prediction on training data Y_pred_train = sn.predict(X_train) Y_pred_binarised_train = (Y_pred_train >= ).astype( ).ravel() #prediction on testing data Y_pred_val = sn.predict(X_val) Y_pred_binarised_val = (Y_pred_val >= ).astype( ).ravel() #model accuracy accuracy_train = accuracy_score(Y_pred_binarised_train, Y_train) accuracy_val = accuracy_score(Y_pred_binarised_val, Y_val) print( , round(accuracy_train, )) print( , round(accuracy_val, )) #visualizing the results plt.scatter(X_train[:, ], X_train[:, ], c=Y_pred_binarised_train, cmap=my_cmap, s= *(np.abs(Y_pred_binarised_train-Y_train)+ )) plt.show() class object sn 1000 0.5 0.5 "int" 0.5 "int" "Training accuracy" 2 "Validation accuracy" 2 0 1 15 .2 Now we will train our data on the sigmoid neuron which we created. First, we instantiate the Sigmoid Neuron Class and then call the method on the training data with 1000 epochs and learning rate set to 1 (These values are arbitrary not the optimal values for this data, you can play around these values and find the best number of epochs and the learning rate). By default, the loss function is set to but you can change it to as well. fit mean square error loss cross entropy loss Sigmoid Neuron Loss Variation As you can see that loss of the Sigmoid Neuron is decreasing but there is a lot of oscillations may be because of the large learning rate. You can decrease the learning rate and check the loss variation. Once we trained the model, we can make predictions on the testing data and binarise those predictions by taking 0.5 as the threshold. We can compute the training and validation accuracy of the model to evaluate the performance of the model and check for any scope of improvement by changing the number of epochs or learning rate. #visualizing the results plt.scatter(X_train[:, ], X_train[:, ], c=Y_pred_binarised_train, cmap=my_cmap, s= *(np.abs(Y_pred_binarised_train-Y_train)+ )) plt.show() 0 1 15 .2 To know which of the data points that the model is predicting correctly or not for each point in the training set. we will use the scatter plot function from . The function takes two inputs as the first and second features, for the color I have used and defined a custom ‘cmap’ for visualization. As you can see that the size of each point is different in the below plot. matplotlib.pyplot Y_pred_binarised_train 4D Scatter Plot The size of each point in the plot is given by a formula, s= *(np.abs(Y_pred_binarised_train-Y_train)+ ) 15 .2 The formula takes the absolute difference between the predicted value and the actual value. If the ground truth is equal to the predicted value then size = 3 If the ground truth is not equal to the predicted value the size = 18 All the small points in the plot indicate that the model is predicting those observations correctly and large points indicate that those observations are incorrectly classified. 4D Scatter Plot In this plot, we are able to represent 4 Dimensions — Two input features, color to indicate different labels and size of the point indicates whether it is predicted correctly or not. The important note from the plot is that sigmoid neuron is not able to handle the non-linearly separable data. If you want to learn sigmoid neuron learning algorithm in detail with math check out my previous post. Sigmoid Neuron Learning Algorithm Explained With Math Write First Feedforward Neural Network In this section, we will take a very simple feedforward neural network and build it from scratch in python. The network has three neurons in total — two in the first hidden layer and one in the output layer. For each of these neurons, pre-activation is represented by ‘a’ and post-activation is represented by ‘h’. In the network, we have a total of 9 parameters — 6 weight parameters and 3 bias terms. Simple Feedforward Network Similar to the Sigmoid Neuron implementation, we will write our neural network in a class called FirstFFNetwork. = np.random.randn() self.w2 = np.random.randn() self.w3 = np.random.randn() self.w4 = np.random.randn() self.w5 = np.random.randn() self.w6 = np.random.randn() self.b1 = self.b2 = self.b3 = def sigmoid(self, x): /( + np.exp(-x)) def forward_pass(self, x): #forward pass - preactivation and activation self.x1, self.x2 = x self.a1 = self.w1*self.x1 + self.w2*self.x2 + self.b1 self.h1 = self.sigmoid(self.a1) self.a2 = self.w3*self.x1 + self.w4*self.x2 + self.b2 self.h2 = self.sigmoid(self.a2) self.a3 = self.w5*self.h1 + self.w6*self.h2 + self.b3 self.h3 = self.sigmoid(self.a3) self.h3 def grad(self, x, y): #back propagation self.forward_pass(x) self.dw5 = (self.h3-y) * self.h3*( -self.h3) * self.h1 self.dw6 = (self.h3-y) * self.h3*( -self.h3) * self.h2 self.db3 = (self.h3-y) * self.h3*( -self.h3) self.dw1 = (self.h3-y) * self.h3*( -self.h3) * self.w5 * self.h1*( -self.h1) * self.x1 self.dw2 = (self.h3-y) * self.h3*( -self.h3) * self.w5 * self.h1*( -self.h1) * self.x2 self.db1 = (self.h3-y) * self.h3*( -self.h3) * self.w5 * self.h1*( -self.h1) self.dw3 = (self.h3-y) * self.h3*( -self.h3) * self.w6 * self.h2*( -self.h2) * self.x1 self.dw4 = (self.h3-y) * self.h3*( -self.h3) * self.w6 * self.h2*( -self.h2) * self.x2 self.db2 = (self.h3-y) * self.h3*( -self.h3) * self.w6 * self.h2*( -self.h2) def fit(self, X, Y, epochs= , learning_rate= , initialise=True, display_loss=False): # initialise w, b initialise: self.w1 = np.random.randn() self.w2 = np.random.randn() self.w3 = np.random.randn() self.w4 = np.random.randn() self.w5 = np.random.randn() self.w6 = np.random.randn() self.b1 = self.b2 = self.b3 = display_loss: loss = {} i tqdm_notebook(range(epochs), total=epochs, unit= ): dw1, dw2, dw3, dw4, dw5, dw6, db1, db2, db3 = [ ]* x, y zip(X, Y): self.grad(x, y) dw1 += self.dw1 dw2 += self.dw2 dw3 += self.dw3 dw4 += self.dw4 dw5 += self.dw5 dw6 += self.dw6 db1 += self.db1 db2 += self.db2 db3 += self.db3 m = X.shape[ ] self.w1 -= learning_rate * dw1 / m self.w2 -= learning_rate * dw2 / m self.w3 -= learning_rate * dw3 / m self.w4 -= learning_rate * dw4 / m self.w5 -= learning_rate * dw5 / m self.w6 -= learning_rate * dw6 / m self.b1 -= learning_rate * db1 / m self.b2 -= learning_rate * db2 / m self.b3 -= learning_rate * db3 / m display_loss: Y_pred = self.predict(X) loss[i] = mean_squared_error(Y_pred, Y) display_loss: plt.plot(loss.values()) plt.xlabel( ) plt.ylabel( ) plt.show() def predict(self, X): #predicting the results on unseen data Y_pred = [] x X: y_pred = self.forward_pass(x) Y_pred.append(y_pred) np.array(Y_pred) : # ( ): . class FirstFFNetwork intialize the parameters def __init__ self self w1 0 0 0 return 1.0 1.0 return 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 if 0 0 0 if for in "epoch" 0 9 for in 1 if if 'Epochs' 'Mean Squared Error' for in return In the class have 6 functions, we will go over these functions one by one. FirstFFNetworkwe def __init__(self): ..... The function initializes all the parameters of the network including weights and biases. Unlike the sigmoid neuron where we have only two parameters in the neural network, we have 9 parameters to be initialized. All the 6 weights are initialized randomly and 3 biases are set to zero. __init__ def sigmoid(self, x): /( + np.exp(-x)) return 1.0 1.0 Next, we define the sigmoid function used for post-activation for each of the neurons in the network. def forward_pass(self, x): #forward pass - preactivation and activation self.x1, self.x2 = x self.a1 = self.w1*self.x1 + self.w2*self.x2 + self.b1 self.h1 = self.sigmoid(self.a1) self.a2 = self.w3*self.x1 + self.w4*self.x2 + self.b2 self.h2 = self.sigmoid(self.a2) self.a3 = self.w5*self.h1 + self.w6*self.h2 + self.b3 self.h3 = self.sigmoid(self.a3) self.h3 return Now we have the forward pass function, which takes an input and computes the output. First, I have initialized two local variables and equated to input which has 2 features. x x For each of these 3 neurons, two things will happen, Pre-activation represented by ‘a’: It is a weighted sum of inputs plus the bias. Activation represented by ‘h’: Activation function is Sigmoid function. The pre-activation for the first neuron is given by, a₁ = w₁ * x₁ + w₂ * x₂ + b₁ To get the post-activation value for the first neuron we simply apply the logistic function to the output of pre-activation a₁. h₁ = sigmoid(a₁) Repeat the same process for the second neuron to get a₂ and h₂. The outputs of the two neurons present in the first hidden layer will act as the input to the third neuron. The pre-activation for the third neuron is given by, a₃ = w₅ * h₁ + w₆ * h₂ + b₃ and applying the sigmoid on a₃ will give the final predicted output. def grad(self, x, y): #back propagation ...... Next, we have the function which takes inputs and as arguments and computes the forward pass. Based on the forward pass it computes the partial derivates of these weights with respect to the loss function, which is mean squared error loss in this case. grad x y Note: In this post, I am not explaining how do we arrive at these partial derivatives for the parameters. Just consider this function as a black box for now, in my next article I will explain how do we compute these partial derivatives in backpropagation. def fit(self, X, Y, epochs= , learning_rate= , initialise=True, display_loss=False): ...... 1 1 Then, we have the function similar to the sigmoid neuron. In this function, we iterate through each data point, compute the partial derivates by calling the function and store those values in a new variable for each parameter ( ). Then, we go ahead and update the values of all the parameters ( ). We also have the condition, if set to it will display the plot of network loss variation across all the epochs. fit grad Line 63–75 Line 77–87 display_loss True def predict(self, X): #predicting the results on unseen data ..... Finally, we have the predict function that takes a large set of values as inputs and compute the predicted value for each input by calling the forward_pass function on each of the input. Train the FF network on the data We will now train our data on the Feedforward network which we created. First, we instantiate the FirstFFNetwork Class and then call the fit method on the training data with 2000 epochs and learning rate set to 0.01. ffn = FirstFFNetwork() #train the model on the data ffn.fit(X_train, Y_train, epochs= , learning_rate= , display_loss=True) #predictions Y_pred_train = ffn.predict(X_train) Y_pred_binarised_train = (Y_pred_train >= ).astype( ).ravel() Y_pred_val = ffn.predict(X_val) Y_pred_binarised_val = (Y_pred_val >= ).astype( ).ravel() accuracy_train = accuracy_score(Y_pred_binarised_train, Y_train) accuracy_val = accuracy_score(Y_pred_binarised_val, Y_val) #model performance print( , round(accuracy_train, )) print( , round(accuracy_val, )) #visualize the predictions plt.scatter(X_train[:, ], X_train[:, ], c=Y_pred_binarised_train, cmap=my_cmap, s= *(np.abs(Y_pred_binarised_train-Y_train)+ )) plt.show() 2000 .01 0.5 "int" 0.5 "int" "Training accuracy" 2 "Validation accuracy" 2 0 1 15 .2 #visualize the predictions plt.scatter(X_train[:, ], X_train[:, ], c=Y_pred_binarised_train, cmap=my_cmap, s= *(np.abs(Y_pred_binarised_train-Y_train)+ )) plt.show() 0 1 15 .2 To get a better idea about the performance of the neural network, we will use the same 4D visualization plot that we used in sigmoid neuron and compare it with the sigmoid neuron model. Single Sigmoid Neuron (Left) & Neural Network (Right) As you can see most of the points are classified correctly by the neural network. The key takeaway is that just by combining three sigmoid neurons we are able to solve the problem of non-linearly separable data. Generic Class for Feedforward Neural Network In this section, we will write a generic class where it can generate a neural network, by taking the number of hidden layers and the number of neurons in each hidden layer as input parameters. The generic class also takes the number of inputs as parameter earlier we have only two inputs but now we can have ’n’ dimensional inputs as well. Note: In this case, I am considering the network for binary classification only. Generic Feedforward Network Before we start to write code for the generic neural network, let us understand the format of indices to represent the weights and biases associated with a particular neuron. W(Layer number)(Neuron number in the layer)(Input number) b(Layer number)(Bias number associated for that input) a(Layer number) (Input number) W₁₁₁ — Weight associated with the first neuron present in the first hidden layer connected to the first input. W₁₁₂ — Weight associated with the first neuron present in the first hidden layer connected to the second input. b₁₁ — Bias associated with the first neuron present in the first hidden layer. b₁₂ — Bias associated with the second neuron present in the first hidden layer. : The Code =[ ]): #intialize the inputs self.nx = n_inputs self.ny = self.nh = len(hidden_sizes) self.sizes = [self.nx] + hidden_sizes + [self.ny] self.W = {} self.B = {} i range(self.nh+ ): self.W[i+ ] = np.random.randn(self.sizes[i], self.sizes[i+ ]) self.B[i+ ] = np.zeros(( , self.sizes[i+ ])) def sigmoid(self, x): /( + np.exp(-x)) def forward_pass(self, x): self.A = {} self.H = {} self.H[ ] = x.reshape( , ) i range(self.nh+ ): self.A[i+ ] = np.matmul(self.H[i], self.W[i+ ]) + self.B[i+ ] self.H[i+ ] = self.sigmoid(self.A[i+ ]) self.H[self.nh+ ] def grad_sigmoid(self, x): x*( -x) def grad(self, x, y): self.forward_pass(x) self.dW = {} self.dB = {} self.dH = {} self.dA = {} L = self.nh + self.dA[L] = (self.H[L] - y) k range(L, , ): self.dW[k] = np.matmul(self.H[k ].T, self.dA[k]) self.dB[k] = self.dA[k] self.dH[k ] = np.matmul(self.dA[k], self.W[k].T) self.dA[k ] = np.multiply(self.dH[k ], self.grad_sigmoid(self.H[k ])) def fit(self, X, Y, epochs= , learning_rate= , initialise=True, display_loss=False): # initialise w, b initialise: i range(self.nh+ ): self.W[i+ ] = np.random.randn(self.sizes[i], self.sizes[i+ ]) self.B[i+ ] = np.zeros(( , self.sizes[i+ ])) display_loss: loss = {} e tqdm_notebook(range(epochs), total=epochs, unit= ): dW = {} dB = {} i range(self.nh+ ): dW[i+ ] = np.zeros((self.sizes[i], self.sizes[i+ ])) dB[i+ ] = np.zeros(( , self.sizes[i+ ])) x, y zip(X, Y): self.grad(x, y) i range(self.nh+ ): dW[i+ ] += self.dW[i+ ] dB[i+ ] += self.dB[i+ ] m = X.shape[ ] i range(self.nh+ ): self.W[i+ ] -= learning_rate * dW[i+ ] / m self.B[i+ ] -= learning_rate * dB[i+ ] / m display_loss: Y_pred = self.predict(X) loss[e] = mean_squared_error(Y_pred, Y) display_loss: plt.plot(loss.values()) plt.xlabel( ) plt.ylabel( ) plt.show() def predict(self, X): Y_pred = [] x X: y_pred = self.forward_pass(x) Y_pred.append(y_pred) np.array(Y_pred).squeeze() : ( , , class FFSNNetwork def __init__ self n_inputs hidden_sizes 2 1 for in 1 1 1 1 1 1 return 1.0 1.0 0 1 -1 for in 1 1 1 1 1 1 return 1 return 1 1 for in 0 -1 -1 -1 -1 -1 -1 1 1 if for in 1 1 1 1 1 1 if for in "epoch" for in 1 1 1 1 1 1 for in for in 1 1 1 1 1 1 for in 1 1 1 1 1 if if 'Epochs' 'Mean Squared Error' for in return Function by function explanation, def __init__(self, n_inputs, hidden_sizes=[ ]): #intialize the inputs self.nx = n_inputs self.ny = #one final neuron binary classification. self.nh = len(hidden_sizes) self.sizes = [self.nx] + hidden_sizes + [self.ny] ..... 2 1 for The function takes a few arguments, __init__ — Number of inputs going into the network. n_inputs — Expects a list of integers, represents the number of neurons present in the hidden layer. hidden_sizes [2] — One hidden layer with 2 neurons [2,3] — Two hidden layers with 2 neurons in the first layer and the 3 neurons in the second layer. In this function, we initialize two dictionaries and to store the randomly initialized weights and biases for each hidden layer in the network. W B def forward_pass(self, x): self.A = {} self.H = {} self.H[ ] = x.reshape( , ) .... 0 1 -1 In the function, we have initialized two dictionaries and and instead of representing the inputs as I am representing it as H₀ so that we can save that in the post-activation dictionary . Then, we will loop through all the layers and compute the pre-activation & post-activation values and store them in their respective dictionaries. The pre-activation output of the final layer is the same as the predicted value of our network. The function will return this value outside. So that we can use this value to calculate the loss of the neuron. forward_pass A H X H Remember that in the previous class , we have hardcoded the computation of pre-activation and post-activation for each neuron separately but this not the case in our generic class. FirstFFNetwork def grad_sigmoid(self, x): x*( -x) def grad(self, x, y): self.forward_pass(x) ..... return 1 Next, we define two functions which help to compute the partial derivatives of the parameters with respect to the loss function. def fit(self, X, Y, epochs= , learning_rate= , initialise=True, display_loss=False): # initialise w, b initialise: i range(self.nh+ ): self.W[i+ ] = np.random.randn(self.sizes[i], self.sizes[i+ ]) self.B[i+ ] = np.zeros(( , self.sizes[i+ ])) 1 1 if for in 1 1 1 1 1 1 Then, we define our function which is essentially the same but in here we loop through each of the input and update the weights and biases in generalized fashion rather than updating the individual parameter. fit def predict(self, X): #predicting the results on unseen data ..... Finally, we have the predict function that takes a large set of values as inputs and compute the predicted value for each input by calling the function on each of the input. forward_pass Train Generic Class for Feedforward Neural Network We will now train our data on the Generic Feedforward network which we created. First, we instantiate the and then call the method on the training data with 2000 epochs and learning rate set to 0.01. FFSNetwork Class fit #train the network two hidden layers - neurons and neurons ffsnn = FFSNNetwork( , [ , ]) ffsnn.fit(X_train, Y_train, epochs= , learning_rate= , display_loss=True) Y_pred_train = ffsnn.predict(X_train) Y_pred_binarised_train = (Y_pred_train >= ).astype( ).ravel() Y_pred_val = ffsnn.predict(X_val) Y_pred_binarised_val = (Y_pred_val >= ).astype( ).ravel() accuracy_train = accuracy_score(Y_pred_binarised_train, Y_train) accuracy_val = accuracy_score(Y_pred_binarised_val, Y_val) print( , round(accuracy_train, )) print( , round(accuracy_val, )) #visualize the results plt.scatter(X_train[:, ], X_train[:, ], c=Y_pred_binarised_train, cmap=my_cmap, s= *(np.abs(Y_pred_binarised_train-Y_train)+ )) plt.show() with 2 3 2 2 3 1000 .001 0.5 "int" 0.5 "int" "Training accuracy" 2 "Validation accuracy" 2 0 1 15 .2 The variation of loss for the neural network for training data is given below, From the plot, we see that the loss function falls a bit slower than the previous network because in this case, we have two hidden layers with 2 and 3 neurons respectively. Because it is a large network with more parameters, the learning algorithm takes more time to learn all the parameters and propagate the loss through the network. #visualize the predictions plt.scatter(X_train[:, ], X_train[:, ], c=Y_pred_binarised_train, cmap=my_cmap, s= *(np.abs(Y_pred_binarised_train-Y_train)+ )) plt.show() 0 1 15 .2 Again we will use the same 4D plot to visualize the predictions of our generic network. Remember that, small points indicate these observations are correctly classified and large points indicate these observations are miss-classified. You can play with the number of epochs and the learning rate and see if can push the error lower than the current value. Also, you can create a much deeper network with many neurons in each layer and see how that network performs. Generic FF Class for Multi-Class Classification In this section, we will extend our generic function written in the previous section to support multi-class classification. Before we proceed to build our generic class, we need to do some data preprocessing. Remember that initially, we generated the data with 4 classes and then we converted that multi-class data to binary class data. In this section, we will use that original data to train our multi-class neural network. #remember that we have label_org four classes. #split that data into train and val X_train, X_val, Y_train, Y_val = train_test_split(data, labels_orig, stratify=labels_orig, random_state= ) print(X_train.shape, X_val.shape, labels_orig.shape) #one hot encoder enc = OneHotEncoder() # -> ( , , , ), -> ( , , , ), -> ( , , , ), -> ( , , , ) y_OH_train = enc.fit_transform(np.expand_dims(Y_train, )).toarray() y_OH_val = enc.fit_transform(np.expand_dims(Y_val, )).toarray() print(y_OH_train.shape, y_OH_val.shape) with 0 0 1 0 0 0 1 0 1 0 0 2 0 0 1 0 3 0 0 0 1 1 1 Here we have 4 different classes, so we encode each label so that the machine can understand and do computations on top it. To encode the labels, we will use on training and validation labels. sklearn.OneHotEncoder We will write our generic feedforward network for multi-class classification in a class called FFSN_MultiClass. =[ ]): self.nx = n_inputs self.ny = n_outputs self.nh = len(hidden_sizes) self.sizes = [self.nx] + hidden_sizes + [self.ny] self.W = {} self.B = {} i range(self.nh+ ): self.W[i+ ] = np.random.randn(self.sizes[i], self.sizes[i+ ]) self.B[i+ ] = np.zeros(( , self.sizes[i+ ])) def sigmoid(self, x): /( + np.exp(-x)) def softmax(self, x): exps = np.exp(x) exps / np.sum(exps) def forward_pass(self, x): self.A = {} self.H = {} self.H[ ] = x.reshape( , ) i range(self.nh): self.A[i+ ] = np.matmul(self.H[i], self.W[i+ ]) + self.B[i+ ] self.H[i+ ] = self.sigmoid(self.A[i+ ]) self.A[self.nh+ ] = np.matmul(self.H[self.nh], self.W[self.nh+ ]) + self.B[self.nh+ ] self.H[self.nh+ ] = self.softmax(self.A[self.nh+ ]) self.H[self.nh+ ] def predict(self, X): Y_pred = [] x X: y_pred = self.forward_pass(x) Y_pred.append(y_pred) np.array(Y_pred).squeeze() def grad_sigmoid(self, x): x*( -x) def cross_entropy(self,label,pred): yl=np.multiply(pred,label) yl=yl[yl!= ] yl=-np.log(yl) yl=np.mean(yl) yl def grad(self, x, y): self.forward_pass(x) self.dW = {} self.dB = {} self.dH = {} self.dA = {} L = self.nh + self.dA[L] = (self.H[L] - y) k range(L, , ): self.dW[k] = np.matmul(self.H[k ].T, self.dA[k]) self.dB[k] = self.dA[k] self.dH[k ] = np.matmul(self.dA[k], self.W[k].T) self.dA[k ] = np.multiply(self.dH[k ], self.grad_sigmoid(self.H[k ])) def fit(self, X, Y, epochs= , initialize= , learning_rate= , display_loss=False): display_loss: loss = {} initialize: i range(self.nh+ ): self.W[i+ ] = np.random.randn(self.sizes[i], self.sizes[i+ ]) self.B[i+ ] = np.zeros(( , self.sizes[i+ ])) epoch tqdm_notebook(range(epochs), total=epochs, unit= ): dW = {} dB = {} i range(self.nh+ ): dW[i+ ] = np.zeros((self.sizes[i], self.sizes[i+ ])) dB[i+ ] = np.zeros(( , self.sizes[i+ ])) x, y zip(X, Y): self.grad(x, y) i range(self.nh+ ): dW[i+ ] += self.dW[i+ ] dB[i+ ] += self.dB[i+ ] m = X.shape[ ] i range(self.nh+ ): self.W[i+ ] -= learning_rate * (dW[i+ ]/m) self.B[i+ ] -= learning_rate * (dB[i+ ]/m) display_loss: Y_pred = self.predict(X) loss[epoch] = self.cross_entropy(Y, Y_pred) display_loss: plt.plot(loss.values()) plt.xlabel( ) plt.ylabel( ) plt.show() : ( , , , class FFSN_MultiClass def __init__ self n_inputs n_outputs hidden_sizes 3 for in 1 1 1 1 1 1 return 1.0 1.0 return 0 1 -1 for in 1 1 1 1 1 1 1 1 1 1 return 1 for in return return 1 0 return 1 for in 0 -1 -1 -1 -1 -1 -1 100 'True' 0.01 if if for in 1 1 1 1 1 1 for in "epoch" for in 1 1 1 1 1 1 for in for in 1 1 1 1 1 1 for in 1 1 1 1 1 if if 'Epochs' 'CE' I will explain changes what are the changes made in our previous class FFSNetwork to make it work for multi-class classification. First, we have our function, forward_pass def forward_pass(self, x): self.A = {} self.H = {} self.H[ ] = x.reshape( , ) i range(self.nh): self.A[i+ ] = np.matmul(self.H[i], self.W[i+ ]) + self.B[i+ ] self.H[i+ ] = self.sigmoid(self.A[i+ ]) self.A[self.nh+ ] = np.matmul(self.H[self.nh], self.W[self.nh+ ]) + self.B[self.nh+ ] self.H[self.nh+ ] = self.softmax(self.A[self.nh+ ]) self.H[self.nh+ ] 0 1 -1 for in 1 1 1 1 1 1 1 1 1 1 return 1 Since we have multi-class output from the network, we are using softmax activation instead of sigmoid activation at the output layer. At we are using softmax layer to compute the forward pass at the output layer. Line 29–30 def cross_entropy(self,label,pred): yl=np.multiply(pred,label) yl=yl[yl!= ] yl=-np.log(yl) yl=np.mean(yl) yl 0 return Next, we have our loss function. In this case, instead of the mean square error, we are using the cross-entropy loss function. By using the cross-entropy loss we can find the difference between the predicted probability distribution and actual probability distribution to compute the loss of the network. Train Generic Class for Multi-Class Classification We will now train our data on the Generic Multi-Class Feedforward network which we created. First, we instantiate the FFSN_MultiClass Class and then call the fit method on the training data with 2000 epochs and learning rate set to 0.005. Remember that our data has two inputs and 4 encoded labels. #train the network ffsn_multi = FFSN_MultiClass( , ,[ , ]) ffsn_multi.fit(X_train,y_OH_train,epochs= ,learning_rate= ,display_loss=True) Y_pred_train = ffsn_multi.predict(X_train) Y_pred_train = np.argmax(Y_pred_train, ) Y_pred_val = ffsn_multi.predict(X_val) Y_pred_val = np.argmax(Y_pred_val, ) accuracy_train = accuracy_score(Y_pred_train, Y_train) accuracy_val = accuracy_score(Y_pred_val, Y_val) print( , round(accuracy_train, )) print( , round(accuracy_val, )) #visualize plt.scatter(X_train[:, ], X_train[:, ], c=Y_pred_train, cmap=my_cmap, s= *(np.abs(np.sign(Y_pred_train-Y_train))+ )) plt.show() 2 4 2 3 2000 .005 1 1 "Training accuracy" 2 "Validation accuracy" 2 0 1 15 .1 The variation of loss for the neural network for training data is given below, Again we will use the same 4D plot to visualize the predictions of our generic network. To plot the graph we need to get the one final predicted label from the network, in order to get that predicted value I have applied the function to get the label with the highest probability. Using that label we can plot our 4D graph and compare it with the actual input data scatter plot. argmax Original Labels (Left) & Predicted Labels(Right) There you have it, we have successfully built our generic neural network for multi-class classification from scratch. Photo by on Vasily Koloda Unsplash What’s Next? LEARN BY CODING In this article, we have used function to generate toy data and we have seen that generate linearly separable data. If you want to generate some complex non-linearly separable data to train your feedforward neural network, you can use function from package. make_blobs make_blobs make_moons sklearn Make Moons Function Data The make_moons function generates two interleaving half circular data essentially gives you a non-linearly separable data. Also, you can add some Gaussian noise into the data to make it more complex for the neural network to arrive at a non-linearly separable decision boundary. Using our generic neural network class you can create a much deeper network with more number of neurons in each layer (also different number of neurons in each layer) and play with learning rate & a number of epochs to check under which parameters neural network is able to arrive at best decision boundary possible. The entire code discussed in the article is present in this GitHub repository. Feel free to fork it or download it. Niranjankumar-c/Feedforward_NeuralNetworrk Conclusion In this post, we have built a simple neuron network from scratch and seen that it performs well while our sigmoid neuron couldn't handle non-linearly separable data. Then we have seen how to write a generic class which can take ’ ’ number of inputs and ‘ ’ number of hidden layers (with many neurons for each layer) for binary classification using mean squared error as loss function. After that, we extended our generic class to handle multi-class classification using softmax and cross-entropy as loss function and saw that it’s performing reasonably well. n L Continue Learning if you are interested in learning more about Artificial Neural Network, check out the by Abhishek and Pukhraj from . Also, this course will be taught in the latest version of Tensorflow 2.0 (Keras backend). Artificial Neural Networks Starttechacademy They also have a very good bundle on in both Python and R languages. machine learning (Basics + Advanced) Recommended Reading Deep Learning: Feedforward Neural Networks Explained In my next post, I will explain backpropagation in detail along with some math. So make sure you follow me on medium to get notified as soon as it drops. Until then Peace :) NK. — There might be some affiliate links in this post to relevant resources. You can purchase the bundle at the lowest price possible. I will receive a small commission if you purchase the course. Disclaimer