After having completed the deeplearning.ai Deep Learning specialization taught by Andrew Ng, I have decided to work through some of the assignments of the specialization and try to figure out the code myself without only filling in certain parts of it. Doing so, I want to deepen my understanding of neural networks and help others gain intuition by documenting my progress in articles. The complete notebook is available here.
In this article, I’m going to build a neural network in Python only using NumPy based on the project structure proposed in the deeplearning.ai Deep Learning specialization:
2. Initialize the parameters of the neural network defined in step one
3. Loop through the following bullet points for a pre-specified number of iterations:
First, I have to define a function that computes the structure of the neural network I want to build. In my case, I will restrict the function to only being able to define a neural network with one hidden layer. The result will look something like this:
Translating this visualization into Python results in the following function:
This function will take the number of rows (X.shape) to define the size of the input layer and do the same for the output layer. The size of the hidden layer can be set manually using the hidden_size parameter of this function.
The next step consists of initializing our parameters. To do so, I am going to use NumPy’s random.randn() function that randomly generates normally distributed numbers with mean zero and standard deviation one. It is important to randomly initialize the parameters to avoid that all hidden units compute the same function. Multiplying the randomly generated numbers by 0.001 makes sure that the gradient descent will not be slowed down since I am going to use the tanh activation function for the hidden layer.
Before defining functions for forward and backpropagation, I’m going to define the activation functions for both layers. Assuming a binary classification problem, I’m going to use a tanh activation function for the hidden layer and a sigmoid function for the output layer that would be able to classify the output into binary labels given a cutoff probability.
Forward propagation consists of two parts: computing a linear output that then gets transformed by an activation function. Essentially, the process is very similar to that of a logistic regression except for the fact that I am only going to use the sigmoid function on the output layer and the tanh function in the hidden layer. The tanh function is usually superior to the sigmoid function (given it is not the output layer) because of its functional form. When the sigmoid function approaches zero or one, its derivative approaches zero. The tanh function, on the other hand, centers the output for the next layer. To better understand this, let’s visually compare the two activation functions:
Let’s see what that looks like in Python:
After forward propagation, I’m going to compute the loss. All this does is calculate how far off the predictions are from the actual response variable.
CCE stands for cross-categorical entropy where N is the number of samples and j either zero or one depending on the class of the response variable.
In Python, the loss function looks as follows:
In the function above, np.squeeze() makes sure that the output is a float and not a NumPy array.
Next, I’m going to use the outputs computed during forward propagation to calculate the derivatives of the parameters. Following backpropagation, these derivatives are then going to be used to update the parameters in order to reduce the loss in the next iteration.
As mentioned above, the next and final step before putting all of the functions together is to update the parameters using the derivatives retrieved in the backpropagation in order to reduce the loss in the following iteration. Updating the parameters is very straightforward and always follows the same pattern:
Since I’ve already described gradient descent in an earlier article I’m not going to thoroughly explain it here and only provide a link to a good explanation of gradient descent. Alpha is also called the learning rate and will be manually set in this function:
Now, after having separately defined all required functions, I can finally create a single function combining all of them that will represent the neural network. All that’s needed in this step is to sequentially add all previously defined functions to this final function as well as add a for-loop that will run over a pre-specified number of iterations.
Despite the fact that this particular network is not very powerful it can still give you an edge in classification tasks by picking up on patterns that a regular logistic regression would not. Moreover, it provides a great opportunity to grasp the basic functionalities of neural networks without letting TensorFlow or Keras do all the work behind the scenes.
As always, if you have any questions or found mistakes, please do not hesitate to reach out to me. A link to the notebook with my code is provided at the beginning of this article.
 Andrew Ng, Deep Learning Specialization, Coursera