# Uni-Variate, Polynomial and Multi-Variate Regression using OLS/Normal Equation Approach (A-Z)

## Learn, Code and Tune…

As I mentioned in my previous article:

apart from Gradient Descent Optimization, there is another approach known as Ordinary Least Squares or Normal Equation Method. This approach, by far is the most successful and adopted in many Machine Learning Toolboxes. There is a descriptive reason behind OLS/Normal Equation Method being successful and widely preferred for Gradient Descent, to which I will be coming in the latter half of this article. Firstly, let’s move on to the Approach and its Python Implementation, secondly, applying it on Practice Datasets and finally, describing intuitively that why this over-performs Gradient Descent.

The Ordinary Least Squares approach for finding the list/vector of parameters which is the minima of the Cost Function, theta is given by:

It can be expressed in the form of Vector Algebra known as Normal Equation.

Now, let’s implement this in Python for Uni-Variate Linear Regression, Polynomial Regression and Multi-Variate Linear Regression:

OLS Uni-Variate Linear Regression using the General Form of OLS:

A 2-module Implementation of Uni-Variate Linear Regression using OLS is formulated:

=>hypothesis(): It is the function that calculates and outputs the hypothesis value of the Target Variable, given theta (theta_0, theta_1, theta_2, …., theta_n) and Feature X as input. The implementation of hypothesis() is given below:

def hypothesis(theta, X):
h = np.ones((X.shape[0],1))
X = X.reshape(X.shape[0], 1)
one_column = np.ones((X.shape[0],1))
X = np.concatenate((one_column, X), axis = 1)
for i in range(0,X.shape[0]):
h[i] = float(np.matmul(theta, X[i]))
h = h.reshape(X.shape[0])
return h

=>coeff_cal(): It is the principal function that takes the Feature Set, X and Label Set, y as input and returns the vector of parameters, theta that is the minima of the Cost Function. The implementation of coeff_cal() is given below:

def coeff_cal(X,y):
num = sum(np.multiply((X - np.mean(X)), (y - np.mean(y))))
din = sum(np.square(X - np.mean(X)))
theta_1 = num/din
theta_0 = np.mean(y) - theta_1 * np.mean(X)
return np.array([theta_0,theta_1])

Now, applying the OLS Linear Regression on a Practice Dataset. Considering the Profit Prediction from Population Dataset available at

Problem Statement: “Given the population of a city, analyze and predict the profit of a company using Linear Regression

Data Reading into Numpy Arrays :

X_train = data[:,0] #the feature_set
y_train = data[:,1] #the labels

Using the 2-module-OLS Uni-Variate Linear Regression:

theta = coeff_cal(X_train, y_train)
# getting the predictions...
predictions = hypothesis(theta, X_train)

The Regression Line Visualization of the obtained theta is done on Scatter Plot:

scatter = plt.scatter(X_train, y_train, label="training data")
regression_line = plt.plot(X_train, predictions,
label="linear regression")
plt.legend()
plt.xlabel('Population of City in 10,000s')
plt.ylabel('Profit in \$10,000s')

The Regression Line Visualization comes out to be:

Now, let’s analyze the Model Performance and Statistically Compare OLS and Batch Gradient Descent in which the Performance Analysis of Gradient Descent Uni-Variate Linear Regression is obtained from:

Here, OLS performs better than BGD in terms of Mean Absolute Error.

OLS Polynomial Regression using Vector Algebra Form of OLS

A 1-module Implementation of Uni-Variate Linear Regression using OLS is formulated:

=>hypothesis(): It is the function that calculates and outputs the hypothesis value of the Target Variable, given theta (theta_0, theta_1, theta_2, …., theta_n), Feature X and Degree of the Polynomial n as input. The implementation of the hypothesis() is given below:

def hypothesis(theta, X, n):
h = np.ones((X.shape[0],1))
theta = theta.reshape(1,n+1)
for i in range(0,X.shape[0]):
x_array = np.ones(n+1)
for j in range(0,n+1):
x_array[j] = pow(X[i],j)
x_array = x_array.reshape(n+1,1)
h[i] = float(np.matmul(theta, x_array))
h = h.reshape(X.shape[0])
return h

=>Obtaining the minima of Cost Function as theta:

from numpy.linalg import inv
theta = np.matmul(np.matmul(inv(np.matmul(x_array.transpose()
, x_array)), x_array.transpose()), y_train)

Now, applying the OLS Polynomial (here Quadratic) Regression on the same dataset.

X_train = data[:,0] #the feature_set
y_train = data[:,1] #the labels
x_array = np.ones((X_train.shape[0], n+1))
for i in range(0,X_train.shape[0]):
for j in range(0,n+1):
x_array[i][j] = pow(X_train[i],j)
theta = np.matmul(np.matmul(inv(np.matmul(x_array.transpose()
, x_array)), x_array.transpose()), y_train)

The Regression Line Visualization of the obtained theta is done on Scatter Plot:

import matplotlib.pyplot as plt
#getting the predictions...
training_predictions = hypothesis(theta, X_train, 2)
scatter = plt.scatter(X_train, y_train, label="training data")
regression_line = plt.plot(X_train, training_predictions, label="polynomial (degree 2) regression")
plt.legend()
plt.xlabel('Population of City in 10,000s')
plt.ylabel('Profit in \$10,000s')

The Regression Line Visualization comes out to be:

Performance Analysis:

The Model Performance is analyzed and 2-way comparison is done. In other words, the performance of OLS Polynomial Regression is compared with Gradient Descent Quadratic Regression and is also compared with OLS Linear Regression.

From the table, the following conclusions can be drawn,

2. OLS Quadratic Regression fetches the least Mean Square Error and Root Mean Square Error
3. OLS Quadratic and Linear Regression fetch the highest R Square Score.

Based on the above 3 points, it can be confirmed that

OLS Approach is more successful than Gradient Descent Optimization

Reason : The possible reason is that in Gradient Descent, if the Algorithm, given in

is carefully studied, it can be seen that there is no fixed number of iterations mentioned. It is just mentioned “until convergence”. So, by giving a random number of iterations will not fetch best performance. The implementation should be done in such a way that the Algorithm itself finds the required number of iterations for convergence. Another important potential threat of Gradient Descent is

What will happen if Local Minima is obtained instead of Global Minima ?

Well, for this issue, the learning rate of Gradient Descent has to be chosen appropriately, in order to avoid the wiggliness of Cost Function.

Modified Implementation of Gradient Descent (for Polynomial Regression) is given below:

def modified_BGD(theta, alpha, h, X, y, n):
# n = 1 for Linear Regression
k = 0 # number of iterations by the algorithm set as counter
theta[0] = theta[0] - (alpha/X.shape[0]) * sum(h - y)
for j in range(1,n+1):
theta[j] = theta[j]-(alpha/X.shape[0])*sum((h-y)*pow(X,j))
h = hypothesis(theta, X, n)
cost = (1 / X.shape[0]) * 0.5 * sum(np.square(h - y))
while(1):
theta[0] = theta[0] - (alpha/X.shape[0]) * sum(h - y)
for j in range(1,n+1):
theta[j]=theta[j]-(alpha/X.shape[0])*sum((h-y)*pow(X,j))
h = hypothesis(theta, X, n)
if ((1/X.shape[0])*0.5*sum(np.square(h - y))) >= cost:
break
cost = (1 / X.shape[0]) * 0.5 * sum(np.square(h - y))
k = k + 1
theta = theta.reshape(1,n+1)
return theta, cost, k

Following this implementation of BGD, there will statistically be very small or no difference in performance measures of Multi-Variate and Polynomial Regression.

Also from the above 2-way comparison table, it can also be confirmed that:

General Polynomial Regression over-performs Linear Regression, if proper degree of the polynomial is chosen

OLS Multi-Variate Linear Regression using Vector Algebra Form of OLS

A 1-module Implementation of Uni-Variate Linear Regression using OLS is formulated:

=>hypothesis(): It is the function that calculates and outputs the hypothesis value of the Target Variable, given theta (theta_0, theta_1, theta_2, …., theta_n) and Feature Set X (vector of 1’s of dimension = number of samples) as input. The implementation of hypothesis() is given below:

def hypothesis(theta, X):
n = X.shape[1]
h = np.ones((X.shape[0],1))
theta = theta.reshape(1,n)
for i in range(0,X.shape[0]):
h[i] = float(np.matmul(theta, X[i]))
h = h.reshape(X.shape[0])
return h

=>Obtaining the minima of Cost Function as theta:

from numpy.linalg import inv
theta = np.matmul(np.matmul(inv(np.matmul(X_train.transpose(), X_train)), X_train.transpose()), y_train)

Application of OLS/Normal Equation Method Linear Regression on Housing Price Prediction Dataset of Portland, Oregon in which the housing price depends on 2 features, size of the house (in sq.feet) and number of bedrooms. The Data-Set is available at,

Problem Statement: “Given the size of the house and number of bedrooms, analyze and predict the possible price of the house”

X_train = data[:,[0,1]] #feature set
y_train = data[:,2] # label set
one_column = np.ones((X_train.shape[0],1))
X_train = np.concatenate((one_column, X_train), axis = 1)
from numpy.linalg import inv
theta = np.matmul(np.matmul(inv(np.matmul(X_train.transpose()
, X_train)), X_train.transpose()), y_train)

Side-by-Side Visualization of Features and Target Variable Actual and Prediction using 3-D Scatter Plots :

=>Actual Target Variable Visualization:

from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D
sequence_containing_x_vals = list(X_train.transpose()[1])
sequence_containing_y_vals = list(X_train.transpose()[2])
sequence_containing_z_vals = list(y_train)
fig = pyplot.figure()
ax = Axes3D(fig)
ax.scatter(sequence_containing_x_vals, sequence_containing_y_vals,
sequence_containing_z_vals)
ax.set_xlabel('Living Room Area', fontsize=10)
ax.set_ylabel('Number of Bed Rooms', fontsize=10)
ax.set_zlabel('Actual Housing Price', fontsize=10)

=>Prediction Target Variable Visualization:

# Getting the predictions...
predictions = hypothesis(theta, X_train)
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D
sequence_containing_x_vals = list(X_train.transpose()[1])
sequence_containing_y_vals = list(X_train.transpose()[2])
sequence_containing_z_vals = list(predictions)
fig = pyplot.figure()
ax = Axes3D(fig)
ax.scatter(sequence_containing_x_vals, sequence_containing_y_vals,
sequence_containing_z_vals)
ax.set_xlabel('Living Room Area', fontsize=10)
ax.set_ylabel('Number of Bed Rooms', fontsize=10)
ax.set_zlabel('Housing Price Predictions', fontsize=10)

Performance Analysis and Statistical Comparison with Gradient Descent Multi-Variate Linear Regression:

The Performance of Gradient Descent Multi-Variate Linear Regression is obtained from

Similarly, OLS/Normal Equation Method performed better than Gradient Descent. The Modified Implementation of Gradient Descent for Multi-Variate Linear Regression is given below:

def modified_BGD(theta, alpha, h, X, y, n):
# n is the number of features
k = 0 # required number of iterations set as counter variable
theta[0] = theta[0] - (alpha/X.shape[0]) * sum(h - y)
for j in range(1,n+1):
theta[j] = theta[j] - (alpha/X.shape[0]) * sum((h - y)
* X.transpose()[j])
h = hypothesis(theta, X, n)
cost = (1 / X.shape[0]) * 0.5 * sum(np.square(h - y))
while(1):
theta[0] = theta[0] - (alpha/X.shape[0]) * sum(h - y)
for j in range(1,n+1):
theta[j] = theta[j] - (alpha/X.shape[0]) * sum((h - y)
* X.transpose()[j])
h = hypothesis(theta, X, n)
if((1/X.shape[0])*0.5*sum(np.square(h-y_train))) >= cost:
break
cost = (1 / X.shape[0]) * 0.5 * sum(np.square(h - y))
k = k + 1
theta = theta.reshape(1,n+1)
return theta, cost, k

A major disadvantage of the Modified Implementation of Gradient Descent is that, it takes a lot of time to converge i.e., it requires to iterate a large number say, 600k iterations to converge or to reach the Global Minima. But on the other hand, by running a single statement, we are getting the theta Global Minima in case of OLS/Normal Equation Method.

That’s all about the Implementation of Uni-Variate, Polynomial and Multi-Variate Regression using OLS/Normal Equation Approach

For Personal Contacts regarding the article or discussions on Machine Learning/Data Mining or any department of Data Science, feel free to reach out to me on LinkedIn.

# More by Navoneel Chakrabarty

Topics of interest