How machine learning and optimization theory can change your perspective
Recently, I’ve been learning about Gradient Descent. It’s a really beautiful algorithm that helps find the optimal solution to a function, and forms the foundations of how we train intelligent systems today. It’s based off a very simple idea — rather than figuring out the best solution to a problem immediately, it guesses an initial solution and steps it in a direction that’s closer to a better solution. The algorithm is to repeat this procedure, until the solution is good enough.
My theory is that this is the way we humans work. We all follow some form of gradient descent, both passively and actively, when living our lives. Understanding this framework of gradient descent explicitly can help with improving the set up of your goals, speeding up progress on those goals, understanding other people, and seeing the bigger picture. The objective of this post is two fold: 1) To view life through the lens of gradient descent and 2) To view gradient descent through the lens of life. Throughout the post, I will provide examples that use this terminology precisely. Let’s begin by understanding gradient descent.
For the mathematically inclined
If you want to minimize the equation x² (whose derivative is 2x), and your guess for the solution is 3, then you can take a baby step (.1) in the direction opposite of the gradient at x=3, which is -6. So the next guess might be 2.4, the next one 1.8, the next 1.5… until finally we reach zero. This is drastically simplifying the use of gradient descent because we can solve for the solution directly in this case, but bear with me for now.
For the visually inclined
Imagine a marble at the rim of a large bowl. This is the first guess of a solution. It starts far from the bottom center of the bowl, but it eventually gets there by rolling there, bit by bit. Instead of teleporting instantaneously from the rim to the bottom, it takes a gradual path, following the path of least resistance.
How standard supervised machine learning works
In machine learning, this method is used to iterate a solution towards the minimal cost using labeled data. For example, if we want the classifier to predict whether an image has a cat or not, we’d provide it with a labeled set of images that do and don’t have cats. To carry out one iteration of gradient descent, the classifier at it’s current state will be used to classify all the images, use the predictions and the labels to compute the cost, and then compute the gradient to figure out how to shift the classifier weights to predict more images correctly. You can view this process as figuring out where the classifier is on the hill and rolling it a bit in the right direction.
In life’s gradient descent, we are these marbles in a multitude of different hills. These hills vary across different emotions and measures, such as happiness, wealth, love, status, craft, etc. Let’s call these functions. In each of these functions, we are sitting somewhere on a hillside trying to get to the valley or the peak (let’s call this optima). As long as we have a measurable cost/reward function, and some relevant data that we can supervise, we have a good basis for practicing gradient descent.
Leo wants to be a better striker in soccer.
Leo’s reward function is determined by things like the number of goals he scores, the minutes he plays, the teams he scores them against, and the trophies he wins. Currently he’s scoring about 10 to 15 goals a season, but he thinks he could hit 25 to 30 with a solid improvement. He drills for 2 hours longer a day than everyone else on the team and takes penalties for an hour every Sunday. This brings his goal count to 18 the next year. After looking at his goal count, he sees that almost none of his goals are coming from headers on crosses. This sets the direction for his progress. He does all sorts of calf raises and box jump work outs. His aggression in the box increases. All these changes culminate into a 22 goal season.
In parallel, Leo also has a cost function he’s trying to minimize — injuries. Any injury would be a gigantic setback towards any of his reward functions. On a daily basis with his trainer, he analyzes his body for any pains, stresses, inflexibilities and assuages them. At any point in time, Leo is evaluating a set of different fitness functions and meanwhile assessing the fitness functions themselves. Maybe his objective of being a better striker should change to attacking midfielder. That way he can maximize goals for his team.
The key ingredients to Leo’s gradient descent algorithm are functions that can measured (number of goals in a season or days out on injury), data that contributes towards or detracts from the function (headers, key passes, strains, fractures, etc), and baby steps in his routines after having analyzed the data.
Batch vs Stochastic Gradient Descent
Let’s jump back to machine learning for a sec. In my image classification example, we compute the predictions for all of the images, and used the results of all of those to iterate our solution. This is actually a specific variant of gradient descent called batch gradient descent. We could just as well take a single image (randomly picked), predict, compute the cost, and iterate our classifier in the right direction. This is called stochastic gradient descent; it basically says that we might step in the wrong direction once in a while, but on average, we’ll step in the right direction. Stochastic gradient descent is cheap to compute, but can be noisy. How does this translate to life?
View each day as a sample and our actions as the features for each sample. By looking at our actions with respect to our goals, we can see how we’ve erred and adjust our behaviors. This is actually ingrained in our culture in the form of New Years Resolutions. However, humans suck at this. We don’t keep track of our goals. We have notoriously terrible memories — we’re not computers. We heavily overweight the most recent samples. We heavily overweight the most emotional samples. Kahneman and Frederickson, proved this in dozens of studies, coining this the peak-end rule.
Okay, you say, batch gradient descent doesn’t work. Why not use stochastic gradient descent? We can observe our behaviors on a daily basis, and adjust towards our goals. We humans are busy and lazy. If we had enough will power to monitor and change our behaviors on a daily basis, then we’d probably be at the optimal solution to all our goals. This reminds me a joke that Todd Barry told. He described a TV special about a wardrobe organization strategy that involved meticulously arranging the direction of the hangers based on what you wear — and he thought, anyone who manages to keep that habit is already the most organized person in the world.
Fortunately, machine learning researchers and practitioners have already come up with a solution to this issue. We can combine the best of both worlds using mini-batch gradient descent. Analyzing your errors on a weekly or biweekly basis is a good balance between human psychology and changing directions toward a local optima. In addition, we have computers to help record each sample uniformly. Most successful weightlifters keep a journal logging every single one of their lifts and meals. Data-driven companies are ubiquitous these days. Fitbits, smartphones, smart-scales are all useful tools in helping compute the gradient.
Gradient descent is putting the blinders on
The pursuit of a goal can blind us in two ways. First, when taking baby steps in the ‘right’ direction, we might not see that going in the ‘wrong’ direction for a bit or starting from somewhere else could help us get to a better solution. In the diagram to the left we can see that if the stick figure, Bob, moved to the left he’d get to a more optimal valley.
Be wary that gradient descent is a local optimizer
We have many different fitness functions, or hills, in our lives. Traversing any of these hills can blind us to all the other hills out there! Imagine Leo the soccer player, scoring goals in machine-gun rapid fire in his local 12 year old league. As long as he keeps honing his skills there, he won’t be improving his skills in a more challenging and diverse soccer environment, such as Europe. Leo needs to move to a hill with a much deeper valley.
Moreover, the function that we’re looking at might not even be the right one. These functions change through passive or active means. This happens all the time growing up. Remember when you measured yourself by how many Pokemon cards you had or how expensive your shoes were? Those were probably not the right function to measure yourself by.
This can help you understand people
Two people who live under the same roof, work in the same position, or look identical, can be traversing completely different functions. Bob and Alice could be two cashiers and Bob is measuring himself by how many smiles he receives from his customers and Alice is measuring herself through the rate of customers she gets through per hour.
When thinking deeply about someone, try not to assess them by their base traits (job, wealth, intelligence, etc) but rather by the hills they traverse, the valleys and peaks they hike towards, and their process for selecting these hills.
Speeding up your progress (or hiking faster)
In gradient descent, the learning rate affects how much change we enact to get closer to the local optimum. Once we’ve figured out the right direction, the learning rate is how far we jump in that direction.
This is why so many self-help guides advise taking baby steps. Baby steps imply a small learning rate. Baby steps decrease your chances of failing in your endeavor. If our learning rate is too high, we might actually diverge from the optimum solution. Essentially, drastic changes can move you away from the optimum. You can see what that looks like below:
An example of a high learning rate is attempting a black diamond in your first attempt at snowboarding. This increases the chances of injury, repeated failure, possibly leading to demotivation. Perhaps the optimal learning rate is a green or blue hill, providing ample learning opportunities without as much risk of diverging from your goal.
In any goal-oriented situation, this framework of gradient descent bubbles up useful insights. You can ask yourself useful questions such as: Am I at a local minimum right now? What’s my batch size (1 week or 1 year)? Is there somewhere else that has a more optimal solution? Can I tweak my learning rate to be a little faster? Am I following the steepest path towards my objective? Is my cost function the one I really want?
People that are good at setting up these functions and learning rates can move along to more complicated learning techniques, by adapting learning rates on the fly, or restarting from many random positions in your function to get closer to the global optimum. Why do so many people advocate travel? It makes finding new functions to traverse and find better solutions to existing functions using the idea of random-restarts.
The world’s top companies are experts at adapting these functions, measuring a ton of data, and setting directions to better solutions. Agile software development is a methodology heavily connected to the idea of gradient descent — start with a solution as soon as possible, measure and iterate as frequently as possible.
Real life gradient descent is more possible than ever before
Today, there’s an entire market dedicated to the Internet Of Things (IoT). Wearables to measure your movement, pillows for your sleep, and water bottles to measure how much you drink. All the activity on your phone and laptops is tracked in a plethora of ways. Analyzing this rich time-series data crafts a more precise compass for gradient descent.
If you didn’t learn anything towards self-improvement — I hope you learned a thing or two about gradient descent and machine learning. If you didn’t learn a thing about machine learning, then I hope you learned something about forming goals and progressing towards them. In my next post, I will speak more concretely in setting up these goals in the form of small measurable habits.