AI and machine learning are very hot topics these days. Self-driving cars, real-time automatic translation, voice recognition, etc. These are only some of the applications that cannot exist without machine learning. But how can machines learn? I will show you how the magic works in this article, but I won’t talk about neural networks! I will show you what is in the deepest deep of machine learning. One of the best presentations about machine learning is . Fei Fei Li’s TED talk As Li said in her talk, in the early days, programmers tried to solve computer vision tasks by algorithms. These algorithms were looking for shapes and tried to identify things by preprogrammed rules. But they could not solve it in this way. Computer vision is a very complex problem, and we cannot solve it algorithmically. So, here comes machine learning into the picture. If we cannot write the program, let’s write a program that writes it instead of us. Machine learning algorithms do this. If we have enough data, they can write the algorithm that calculates the expected output from the inputs for us. In the case of image recognition, the input is the image, and the output is a label or a description of the content of the image. This is why Li and her team worked really hard to build ImageNet, which was the biggest labeled image database in the world, with 15 million images and 22,000 categories. Thanks to ImageNet, we had enough data, but how did programmers use it to solve the problem of image recognition? This is the point where I should talk about neural networks like Li does, but I won’t. The biological brain indeed inspires neural networks, but today’s transformers are far from the biological model. This is why I think the name “neural network” is misleading. But what would be a better name? My favorite is Karpathy’s (head of AI @ Tesla) . Software 2.0 As Karpathy said , Software 1.0 is the classical software, where the programmer writes the code, but in the case of Software 2.0, another software finds it based on big data. But how can a software write another software? in this talk Let me quote Karpathy: “Gradient descent can write code better than you. I’m sorry.” is the name of the magic that is used by most machine learning systems. To understand it, let’s imagine a machine learning system. It’s a black box with inputs and outputs. If the purpose of the system is image recognition, then the input is an array of pixels, and the output is a probability distribution vector. Gradient descent If the system can identify cats and dogs, and the input is an image of a cat, the output will be like 10% dog and 90% cat. So, the inputs are numbers, the outputs are also numbers, and the content of the black box is a giant (really huge) mathematical expression. The black box calculates the output vector from the pixel data. I promised to talk about programs writing programs, and now I’m talking about mathematical expression. But in fact, programs are mathematical expressions. CPUs are built from logic gates. They use the binary representation of numbers and logic operators. Any existing algorithm can be implemented by these logical expressions, so a long logical expression can represent any program that can run on classical computers. As you see, classical programs are also not more than binary calculations from a binary input to a binary output. Machine learning systems use real numbers and operators instead of binary numbers and operators, but basically, these mathematical expressions are also “programs.” Alan Turing about “B-type unorganized machines.” These machines are built from interconnected logical NAND gates and can be trained by enabling/disabling the wires between the nodes. In binary algebra, NAND is a universal operator, because every other operator can be expressed by it. These B-type unorganized machines are universal computers because every algorithm can be implemented on them. wrote a paper in 1948 These B-type machines are similar to nowadays neural networks, but they are implemented upon logic gates like today’s CPUs, so the algorithms that are implemented by these B-type machines would be more like our today’s algorithms. Unfortunately, Turing never published the paper, this is why we do not know him as an inventor of early neural networks. The problem with these B-type networks is that they cannot be trained efficiently. If we have a set of inputs and outputs and a parameterized expression, how can we find the correct parameters that calculate the outputs from the inputs with the fewest error? It is something like a black box that has many potentiometers. Every combination of the potentiometer positions is a program, and we are searching for the correct positions. To solve this problem, let’s imagine the error function. It looks like a landscape with hills and valleys. Every single parameter of the expression is a dimension of the landscape, and the height of the current point is the error with the given parameters. When we initialize the expression with random numbers, we are at a random location on the landscape. To minimize the errors, we have to come down to the lowest point (that represents the lowest error). The problem is that we are completely blind. How can we come down from the hill? We can grope around to find the steepest slope and go there. But how can we determine the slope of a function? Here comes the gradient in the picture. Gradient shows how steep the slope on the function is at a given point. This is why we call this method “gradient descent.” The gradient at the given point can be calculated by partial derivation. A function has to satisfy some requirements to be derivable. If we want to use gradient descent for optimization, we have to use these types of functions. If the functions are derivable, then the chain of functions will also be derivable, and there is an algorithmic method to calculate the gradient. Now we have all the knowledge to understand how and (the two most popular machine learning frameworks) find the correct parameters for our expression. TensorFlow PyTorch First, let’s look at TensorFlow. In Tensorflow, there is a gradient registry where gradient functions are registered for the operators by the method. In the learning phase, in every step, a has to be started. is something like a video recorder. When TensorFlow does any operation, the gradient tape logs it. RegisterGradient GradientTape GradientTape After the end of the forward phase (when the output is generated from the input), we stop the gradient tape that goes backward on the log by using the error and calculating the gradients using the registered gradient functions. We can modify the parameters and repeat the process until we reach the minimum error by using the gradients. Let’s look at the code in Python: # Linear regression using GradientTape
# based on https://sanjayasubedi.com.np/deeplearning/tensorflow-2-linear-regression-from-scratch/

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

class Model:
    def __init__(self):
        self.W = tf.Variable(16.0)
        self.b = tf.Variable(10.0)

    def __call__(self, x):
        return self.W * x + self.b

TRUE_W = 3.0 # slope
TRUE_b = 0.5 # intercept

NUM_EXAMPLES = 1000

X = tf.random.normal(shape=(NUM_EXAMPLES,))
noise = tf.random.normal(shape=(NUM_EXAMPLES,))
y = X * TRUE_W + TRUE_b + noise

model = Model()

plt.figure()
plt.scatter(X, y, label="true")
plt.scatter(X, model(X), label="predicted")
plt.legend()
plt.show()

def loss(y, y_pred):
    return tf.reduce_mean(tf.square(y - y_pred))

def train(model, X, y, lr=0.01):
    with tf.GradientTape() as t:
        current_loss = loss(y, model(X))

    dW, db = t.gradient(current_loss, [model.W, model.b])
    model.W.assign_sub(lr * dW)
    model.b.assign_sub(lr * db)

Ws, bs = [], []
epochs = 20
for epoch in range(epochs):
    Ws.append(model.W.numpy()) # eager execution allows us to do this
    bs.append(model.b.numpy())

    current_loss = loss(y, model(X))

    train(model, X, y, lr=0.1)
    print(f"Epoch {epoch}: Loss: {current_loss.numpy()}")

plt.figure()
plt.plot(range(epochs), Ws, 'r', range(epochs), bs, 'b')
plt.plot([TRUE_W] * epochs, 'r--', [TRUE_b] * epochs, 'b--')
plt.legend(['W', 'b', 'true W', 'true b'])
plt.show()

plt.figure()
plt.scatter(X, y, label="true")
plt.scatter(X, model(X), label="predicted")
plt.legend()
plt.show() This code shows how we can solve the simplest problem (linear regression) by using gradient descent in TensorFlow. The aim of linear regression is to find the parameters of a line that is closest to each point, where closest means that the sum of squares of the distances is minimal. TensorFlow uses tensor operations for the calculations. Tensors are generalizations of matrices. From the programmer’s perspective, tensors are simple arrays. A zero-dimensional tensor is a scalar, a one-dimensional tensor is a vector, a two-dimensional tensor is a matrix, and from third dimension, tensors are simply tensors. Tensor operations can be done in parallel to run efficiently, especially on GPUs or TPUs. At the beginning of the code, we define our model, which is a linear expression. It has two scalar parameters, and , and the expression is . The default value of is , and is . This is in our black box, our code will change the and the to minimize the error. Real-world models have millions or billions of parameters, but these two parameters are enough to understand the method. W b y=W*x+b W 16 b 10 W b From line 21 to line 23, we define the random point set. The method generates a vector with 1,000 random numbers in a normal distribution, and we use that to generate the points near a line. tf.random.normal The loss function is defined in line 34. and parameters are vectors. is the actual output of our model, and is the expected output. The square function calculates the square of every vector element, and the output is also a vector with the squares. The function calculates the mean of elements, and its result is a scalar. This is the error itself that we want to minimize. y y_pred y_pred y reduce_mean The gradient descent is from line 36 to line 42. This is the essence of the code where the learning happens. The in line 37 is a Python expression. It calls the parameter object’s method at the beginning of the block and at the end. with __enter__ __exit__ In the case of , the method starts the recording, and stops it. In the block (line 38), we calculate the model output by and the error. In line 40, the calculates the gradients for the parameters ( and ), and in lines 41 and 42, modify the parameters. GradientTape __enter__ __exit__ model(X) GradientTape dW db There are different optimization strategies. We are using the simplest, where the gradients are multiplied by a fixed learning rate ( ). lr In a nutshell, this is how gradient descent and TensorFlow’s work. You can find many . Neural networks for image recognition, reinforcement learning, etc., but keep in your mind, there are always tensor operations and a GradientTape. GradientTape tutorials on TensorFlow’s webpage Now, let’s see how gradient descent works in the other big framework, PyTorch. PyTorch uses the system for gradient calculation, which is embedded into the torch tensors. If a tensor is a result of an operator, it contains a back pointer to the operator and the source tensors. The source tensors also contain back pointers, etc., and the full operator chain is traceable. autograd Every operator can calculate its own gradient. When you call the backward method on the last tensor, it goes back on the chain and calculates the gradients to the tensors. Let’s see the previous linear regression code in PyTorch: import torch
import matplotlib.pyplot as plt
from torch.autograd import Variable


class Model:
    def __init__(self):
        self.W = Variable(torch.as_tensor(16.), requires_grad=True)
        self.b = Variable(torch.as_tensor(10.), requires_grad=True)

    def __call__(self, x):
        return self.W * x + self.b


TRUE_W = 3.0  # slope
TRUE_b = 0.5  # intercept

NUM_EXAMPLES = 1000

X = torch.normal(0.0, 1.0, size=(NUM_EXAMPLES,))
noise = torch.normal(0.0, 1.0, size=(NUM_EXAMPLES,))
y = X * TRUE_W + TRUE_b + noise

model = Model()

plt.figure()
plt.scatter(X, y, label="true")
plt.scatter(X, model(X).detach().numpy(), label="predicted")
plt.legend()
plt.show()


def loss(y, y_pred):
    return torch.square(y_pred - y).mean()


def train(model, X, y, lr=0.01):
    current_loss = loss(y, model(X))
    current_loss.backward()

    with torch.no_grad():
        model.W -= model.W.grad.data * lr
        model.b -= model.b.grad.data * lr

    model.W.grad.data.zero_()
    model.b.grad.data.zero_()


Ws, bs = [], []
epochs = 20
for epoch in range(epochs):
    with torch.no_grad():
        Ws.append(model.W.numpy().item())
        bs.append(model.b.numpy().item())

        current_loss = loss(y, model(X))

    train(model, X, y, lr=0.1)
    print(f"Epoch {epoch}: Loss: {current_loss.numpy()}")

plt.figure()
plt.plot(range(epochs), Ws, 'r', range(epochs), bs, 'b')
plt.plot([TRUE_W] * epochs, 'r--', [TRUE_b] * epochs, 'b--')
plt.legend(['W', 'b', 'true W', 'true b'])
plt.show()

plt.figure()
plt.scatter(X, y, label="true")
plt.scatter(X, model(X).detach().numpy(), label="predicted")
plt.legend()
plt.show() The model and the loss part are very similar to TensorFlow. You find the difference in the method from line 37 to line 46. After the calculation of the tensor, we call the backward method on it. It recursively goes back on the chain and calculates the gradient for the and tensors. train current_loss W b From line 41 to line 43, we modify the and tensors. It’s important that this calculation is in a block. The method temporarily disables the gradient calculation for the operators, which is not needed when we modify the parameters. W b torch.no_grad() no_grad() At the end of the method calling the zero method clears the gradients. Without this, PyTorch will sum up the gradients, which results in strange behavior. The other parts of the code are very similar to TensorFlow. Like TensorFlow, PyTorch also has good tutorials, community, and documentation. train You will find everything to build any neural network, but the most important part is the autograd system, which is the base of the training. Next time, when you see a picture that DALL-E generates, a car that drives itself, or simply wonder how Google Assistant understands what you say, you will know how the magic works, and how an algorithm (the gradient descent) wrote these cool algorithms for us. First Published here

Chain

Google

Scalar

Tesla

YouTube

How to Make Your WordPress Site Safe and Fast With Amazon Cloudfront

How to Run MacOS on VirtualBox in Less Than 30 mins

Nominated for 2022 - HackerNoon Contributor of the Year - Neural Networks

Nominated for 2022 - HackerNoon Contributor of the Year - Proof Of Stake

Nominated for 2022 - HackerNoon Contributor of the Year - Dao

Nominated for 2022 - Ios Writer of the Year

Too Long; Didn't Read

Training Machine Learning Models Using TensorFlow or PyTorch

Training Machine Learning Models Using TensorFlow or PyTorch

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Brief History of Money: From the Economy of Favors to Bitcoin

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

A Brief History of Money: From the Economy of Favors to Bitcoin

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps