Gradient Descent. Image taken from - https://community.deeplearning.ai/t/difference-between-rmsprop-and-adam/310187
What Is Gradient Descent?
Gradient descent is like hiking downhill with your eyes closed, following the slope until you hit the bottom (or at least a nice flat spot to rest). Technically, it is a method to minimize an objective function F(Īø), parameterized by a modelās parameters ĪøāRn, by updating them in the opposite direction of the gradient āF(Īø).
The size of each step is controlled by the learning rate Ī±. Think of Ī± as the cautious guide who ensures you donāt tumble down too fast. If this sounds like Greek, feel free to check out my previous article, Quick Glance At Gradient Descent In Machine Learning ā I promise itās friendlier than it sounds! š
Circle back: An objective function is the mathematical formula or function that your model aims to minimize (or maximize, depending on the goal) during training. It represents a measure of how far off your modelās predictions are from the actual outcomes or desired values.
For example:
In machine learning, it could be a loss function like Mean Squared Error (MSE) for regression tasks or Cross-Entropy Loss for classification tasks.
The objective function maps the modelās predictions to a numerical value, where smaller values indicate better performance.
In simpler terms:
Itās like a āfitness trackerā for your model ā it tells you how good or bad your modelās predictions are. During optimization, gradient descent helps adjust the modelās parameters Īø to reduce this value step by step, moving closer to an ideal solution. Got it? š
Gradient Descent Variants
Gradient descent isnāt a one-size-fits-all deal. It comes in three variants, each like a different hiker ā some take the scenic route, others sprint downhill, and a few prefer shortcuts (like me š ). These variants balance accuracy and speed, depending on how much data they use to calculate the gradient.
1. Batch Gradient Descent
Batch gradient descent, the āall-or-nothingā hiker, uses the entire dataset to compute the gradient of the cost function:
Imagine stopping to look š at every rock šŖØ, tree š“, and bird š before deciding where to place your foot š¦¶š½next. Itās thorough, but not ideal if your dataset is as big as, say, the Amazon rainforest š©. Itās also not great if you need to learn on the fly ā like updating your hiking route after spotting a bear š»āāļø.
Code Example:
for i in range(nb_epochs):
params_grad = evaluate_gradient(loss_function, data, params)
params = params - learning_rate * params_grad
Batch gradient descent shines when you have all the time in the world and a dataset that fits neatly into memory. Itās guaranteed to find the global minimum for convex surfaces (smooth hills) or a local minimum for non-convex surfaces (rugged mountains).
2. Stochastic Gradient Descent (SGD)
SGD is the āimpulsiveā hiker who takes one step at a time based on the current terrain:
Itās faster because it doesnāt bother calculating gradients for the entire landscape. Instead, it uses one training example at a time. While this saves time, the frequent updates can make SGD look like itās zigzagging downhill, which can be both exciting and a little chaotic. š
Imagine updating your grocery list after each aisle ā you get to the essentials faster, but your cart might look wild in the process. However, with a learning rate that slows down over time, SGD can eventually reach the bottom (or the best local minimum).
Code Example:
for i in range(nb_epochs):
np.random.shuffle(data)
for example in data:
params_grad = evaluate_gradient(loss_function, example, params)
params = params - learning_rate * params_grad
3. The Hero of the Day: Adam šš½
Now letās talk about Adam ā the āhiking guruā who combines the wisdom of Momentum and RMSprop. Adaptive Moment Estimation (Adam) is like having a smart guide who tracks the terrain and adjusts your steps based on past experiences and current conditions. Itās the go-to optimizer when you want to train neural networks and still have time for coffee.
Why Youāll Love Adam
-
Low Memory Requirements: Itās like carrying a lightweight backpack ā efficient but still packed with essentials.
-
Minimal Hyperparameter Tuning: Adam works well out of the box, so you wonāt need to fiddle with too many knobs (just keep an eye on the learning rate).
-
Practical Use: From improving product recommendations to recognizing images of cats on the internet, Adam powers machine learning systems that make your everyday tech smarter.
How Adam Works
Adam maintains moving averages of gradients (mtā) and squared gradients (vtā) to adapt learning rates for each parameter:
Hereās the cool part: Adam corrects biases to ensure accuracy, and updates parameters using this formula:
where m^t and v^t are bias-corrected estimates, and epsilon Ļµ is a small number to prevent division by zero.
Conclusion
Adam is the Swiss Army knife of optimizers ā versatile, efficient, and reliable. Whether youāre training neural networks to detect fraud or create next-gen chatbots, Adam helps you get there faster and with fewer headaches. So, embrace Adam, take confident steps, and enjoy the view from the summit of machine learning success!
References:
https://www.ceremade.dauphine.fr/\~waldspurger/tds/22_23_s1/advanced_gradient_descent.pdf
https://www.geeksforgeeks.org/rmsprop-optimizer-in-deep-learning/