Photo of from gwanghwamu pixelbay So yesterday I found this paper “ ” from and implemented . But then something hit me, Res Net and High Way net are built in a way that allows direction connection between the input data X and the transformed data X`. Dilated Recurrent Neural Networks NIPS 2017 here Why can’t we do the exact same for back propagation as well? Connect the gradient from previous layers to deeper layers….. I mean, if you are not using frameworks to perform auto differentiation, why don’t we connect the gradients from latest layer to more deeper layer, and just see how that goes? In this post, we’ll do exactly that, also lets go one step further and compare with a model that applies . Google Brain’s Gradient Noise Since I got the inspiration after reading the Dilated RNN, I’ll just call this Dilated Back Propagation, however if anyone knows other papers where they performed back propagation in this fashion, please let me know in the comment section. Also, I will assume you already have read my Blog post about Implementing Dilated RNN, if not please click here. Network Architecture (Feed Forward Direction) Side View Front View → Final Output of the Network, a (1*10) vector for One hot encoding of predicted number Red Circle → Hidden State 0 for Layer 2 → Hidden States for Layer 2 Brown Circle Lime Circle → Hidden State 0 for Layer 1 → Hidden State for Layer 1 Pink Circle Black Circle → Input for each Time Stamp (Please of this since I am going to use this knowledge to explain the Training / Test Data) Blue Numbers 1,2,3,4,5 TAKE NOTE → Direction of Feed Forward Operation, Pinkish? Arrow As seen above, the network architecture is exactly the same as the previous post. However there is one thing I changed and that is the input data for each time stamp. Training Data / Test Data → Input at Time Stamp 1 (Vectorized 14*14 Pixel Image) → Input at Time Stamp 2 (Vectorized 14*14 Pixel Image) → Input at Time Stamp 3 (Vectorized 14*14 Pixel Image) → Input at Time Stamp 4 (Vectorized 14*14 Pixel Image) → Input at Time Stamp 5 (Vectorized 14*14 Pixel Image) Pink Box Yellow Box Blue Box Purple Box Green Box Despite some images looking bigger then other, all of them are (14*14) pixel images. And each of the image are made by applying different kind of pooling operation to the original image which is (28*28) pixel image. Each of the pooling operations are described below. → Mean Pooling Using function → Variance Pooling Using function → Max Pooling Using function → Standard Deviation Pooling Using function → Median Pooling Using function Pink Box np.mean Yellow Box np.var Blue Box np.max Purple Box np.std Green Box np.median Below is the code of achieving this. And with that in mind, lets take a look at other training data. Finally, the reason I did this, is simply put, I wanted to. Image of 0 Image of 3 Case 1: Normal Back Propagation → Standard Direction of Gradient Flow Purple Arrow Above is the normal (or standard) back propagation, where we compute the gradient in each layer, pass them on to the next layer. And each of weights at different time stamps uses them to update their weights and the gradient flow continuous on. Case 2: Google Brain Gradient Noise → Standard Direction of Gradient Flow → Added Gradient Noise Purple Arrow Yellow Arrow Again, the purple arrow represent standard gradient flow, however this time before updating each weights we are going to add some Noise to the Gradient. Below is the screen shot of how we can achieve this. Case 3: Dilated Back Propagation → Standard Direction of Gradient Flow → Dilated Back Propagation where we pass on some portion of the gradient to the previous layers, which are not directly connected. Purple Arrow Black Arrows Now, here we introduce our new theory in hopes to improve the accuracy of the model. There are two things to note here. We are only going to pass on some portion of the gradient to the previous layers. As seen above, we have some variable called ‘decay proportion rate’ and we are going to use to decrease the amount of gradient it can pass on to the previous layers as time goes on. As seen in the Green Box, since we multiply the gradients from future layers with decay proportion rate, as training continuous the amount of Dilated Gradient flow decreases. inverse time decay rate 2. The Dilated Gradient Flow skips every 2 Layers. As seen above in the Red Box, the gradient at time stamp 5, ONLY goes to the gradient at time stamp 3. However this architecture can be further explored to make the gradient flow much more denser. Case 4: Dilated Back Propagation + Google Brain Gradient Noise → Standard Direction of Gradient Flow → Dilated Back Propagation where we pass on some portion of the gradient to the previous layers, which are not directly connected. → Added Gradient Noise Purple Arrow Black Arrows Yellow Arrow Here we are not only adding Gradient Noise to each weight update, but also making the Gradient Flow better. Training and Results (Google Colab, Local Setting) Above are results when running the code on Google Colab. The accuracy bar represents model’s correct guesses for 100 test images. Unfortunately I forgot to print out the exact accuracy rate but we can see that Case 2 (Google Brain Gradient Noise) had the highest accuracy. Also, cases with non standard back propagation performed better than standard back propagation. In the cost over time function, we can see that standard back propagation had the highest cost rate. Above are results when running the code on my local laptop. The accuracy bar represents model’s correct guesses for 100 test images. Here it was interesting to see Case 3(Dilated Back Propagation) under performing when compared to standard back propagation. However combination of Dilated Back propagation and Google Brain’s Gradient noise have out performed every model. Interactive Code I moved to Google Colab for Interactive codes! So you would need a google account to view the codes, also you can’t run read only scripts in Google Colab so make a copy on your play ground. Finally, I will never ask for permission to access your files on Google Drive, just FYI. Happy Coding! Please click here to access the interactive code. Final Words I love frameworks such as Tensorflow, and Keras. However, I strongly believe we need to explore more different ways to perform back propagation. If any errors are found, please email me at jae.duk.seo@gmail.com, if you wish to see the list of all of my writing please . view my website here Meanwhile follow me on my twitter , and visit , or my for more content. I also did comparison of Decoupled Neural Network are interested. here my website Youtube channel here if you Reference Chang, S., Zhang, Y., Han, W., Yu, M., Guo, X., Tan, W., … & Huang, T. S. (2017). Dilated recurrent neural networks. In (pp. 76–86). Advances in Neural Information Processing Systems Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser, L., Kurach, K., & Martens, J. (2015). Adding gradient noise improves learning for very deep networks. . arXiv preprint arXiv:1511.06807 Seo, J. D. (2018, February 14). Only Numpy: NIPS 2017 — Implementing Dilated Recurrent Neural Networks with Interactive Code. Retrieved February 15, 2018, from https://towardsdatascience.com/only-numpy-nips-2017-implementing-dilated-recurrent-neural-networks-with-interactive-code-e83abe8c9b27 Index. (n.d.). Retrieved February 15, 2018, from https://docs.scipy.org/doc/numpy/genindex.html [1]”tf.train.inverse_time_decay | TensorFlow”, , 2018. [Online]. Available: [Accessed: 16- Feb- 2018]. TensorFlow https://www.tensorflow.org/api_docs/python/tf/train/inverse_time_decay.