Zen of Stochastic Gradient Descent Principle

Written by samzer | Published 2019/08/12
Tech Story Tags: artificial-intelligence | machine-learning | philosophy | datascience | optimization | stochastic-gradient-descent | sgd | hackernoon-top-story

TLDR Stochastic Gradient Descent is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable) It is called stochastic because the method uses randomly selected (or shuffled) samples to evaluate the gradients" The key is to do something that is not perfect but it will be close enough to minimise your cost function. The next step is to start off with a basic research, choose one and start the basic exercise club.via the TL;DR App

A world of Information

We are living in an age where almost any information is just a swipe away. There are innumerable blogs, videos, articles, papers and even podcasts on any topic that you want to consume information on. If you are looking at living healthy then
  • What is the best diet? - Keto vs Vegan
  • Which strength training to opt for? - Cross-fit vs Gym
  • Which type of meditation should I do?
What about Eastern philosophy vs Western philosophy? There are so many school thoughts in each and a lot of information on each in the internet. Living healthy is just an example that I took, the same can be applied to another area, say for instance - Tech
  • Django vs Rails
  • Github vs Gitlab
  • React vs Angular vs Vue
  • Postgres vs MySQL vs Mongo
You are kinda getting the gist of where I am getting at. There is just so much information out there that it causes analysis paralysis and we become just consumers of information with nothing to act on.

A brief introduction to SGD

The definition of Stochastic Gradient Descent according to wikipedia
"Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It is called stochastic because the method uses randomly selected (or shuffled) samples to evaluate the gradients"
I know that is quite a mouthful if you aren't from a mathematics or a machine learning related background which is okay. I'll try my best to explain the concept at a high level and also add a 3Blue1Brown video that explains it in detail for the people who want to deep dive into it.
There is a cost function C (also called objective function). The role of gradient descent is to minimise C by changing a variable w that C depends on.
The red ball above is the value of C for a particular value of w. The Gradient Descent helps us to roll the ball to the bottom by changing the value of w which will minimise C.
If there are million data points then the gradient descent has to go through million data points to take a single small step towards the lowest point direction. And if there are more than one variable to change then its a "million multiplied by the number of variables" computations to take just a single step. Basically, it will take forever to minimise C because of this information overload.
This is where the Stochastic approach excels, instead of taking all the million data points, you just take a limited number say 30 randomly selected data points in order to the take that step towards the bottom direction. You keep repeating this for every 30 new randomly selected data points till you reach the lowest C value. It might not be perfect but it will be close enough.
"Perfection Is Overrated"
The basic concept of SGD is that you don't need to analyse the whole data at once to move forward. You just require a subset of data to move forward and then keep repeating the same process till you reach the intended destination or at least close to it.
If you want to deep dive on how SGD works in detail then this 3Blue1brown video explains it really well.

Zen of SGD Principle

To act, limit the information in order to gain enough insight to take the first step.
Let's say you want to lose weight. The cost function you would like to minimise is the difference between your actual weight and the expected weight. There might be other hidden variables in your cost function like your emotional level and will power used. You really don't want your emotion to go haywire and willpower is generally limited until and unless, you are someone like David Goggins. As a whole, you need to minimise this cost function.
Next is to consume enough information in order to take the first step. This could be which diet to start off with. After doing some basic research, let's say you arrive at X diet then you try it out. After trying it out for a week, you can determine if it's working based on the cost function that you have defined for yourself. If it's working then you stick to it otherwise move onto the next diet that makes sense. The next step could be which exercise to start off - running, cycling, strength training etc. Do the basic research, choose one and start. The next step could be to club another form of exercise or to optimise the existing exercise, do the basic research, choose one and start.

Conclusion

I find this is a good approach to avoid analysis paralysis. You just have to define your cost function -> Analyse -> Choose direction -> Iterate. There is a lot of information available and its easy to get overwhelmed. The key is to do something actionable and it might not be perfect but it will be close enough eventually.
 

Written by samzer | Founder at Modelchimp. 9 years in data science.
Published by HackerNoon on 2019/08/12