paint-brush
Mastering Logistic Regression: A Comprehensive Guide with Practical Exampleby@dotslashbit
261 reads

Mastering Logistic Regression: A Comprehensive Guide with Practical Example

by SahilSeptember 20th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Master Logistic Regression with our comprehensive guide. From basics to real-world applications, this article elevates your skills. Learn to make accurate predictions, whether you're a novice or seasoned data scientist.
featured image - Mastering Logistic Regression: A Comprehensive Guide with Practical Example
Sahil HackerNoon profile picture

Logistic regression is a statistical method used for analyzing a dataset in which there are one or more independent variables that can be used to predict the outcome of a categorical dependent variable. In simpler terms, it's a technique used for classification problems where the target variable is binary (i.e., it can take on two possible values, such as 0 or 1).


The main goal of logistic regression is to find the best-fitting model to describe the relationship between the independent variables and the probability of a particular outcome. It uses the logistic function (also known as the sigmoid function) to map predicted values to probabilities.

Difference Between Linear Regression and Logistic Regression

Linear regression is used when the target variable is continuous and can take any numerical value. It predicts a value that is unbounded, which means it can be any real number. Linear regression uses a straight line to model the relationship between the independent variables and the dependent variable.


Logistic regression, on the other hand, is used when the target variable is categorical and represents two classes (binary classification). It predicts the probability that a given instance belongs to a particular class. The output of logistic regression is bounded between 0 and 1, thanks to the logistic (sigmoid) function.


Real World Applications of Logistic Regression

Logistic regression finds applications in various fields due to its simplicity, interpretability, and effectiveness in binary classification problems. Some real-world applications include:


  1. Medical Diagnostics:
    • Predicting whether a patient has a particular disease based on symptoms and test results.
  2. Finance:
    • Credit scoring to determine if a borrower is likely to default on a loan.
    • Predicting the likelihood of a customer subscribing to a service based on demographic data.


The logistic regression model uses the logistic function (also known as the sigmoid function) to model the relationship between the independent variables and the probability of a particular outcome.

Mathematical Formula for Logistic Regression

The logistic regression model can be mathematically expressed as:

logistic regression model equation

Logistic Function (Sigmoid Function):

The logistic function is defined as:

logistic function

Here, S(t) is the output (probability), and t is the linear combination of the independent variables and their coefficients:

linear combination and coefficients

The logistic function transforms any value of tto a value between 0 and 1. This is crucial in logistic regression because we want to model probabilities.

Transformation of Linear Combination into Probabilities

The linear combination t represents the weighted sum of the independent variables. When we apply the logistic function to t, we obtain a value between 0 and 1. This value represents the estimated probability of the event Y=1 given the values of the independent variables.

In logistic regression, the goal is to find the best-fitting set of coefficients that maximizes the likelihood of the observed data. This is typically done through optimization techniques like gradient descent.


Binary and Multinomial Logistic Regression

Binary Logistic Regression:

Binary logistic regression is used when the dependent variable has two possible categories or classes. These classes are typically coded as 0 and 1, representing the absence or presence of an outcome or event.

For example, binary logistic regression can be used for tasks like:

  • Predicting whether an email is spam (1) or not spam (0).
  • Determining if a patient has a certain disease (1) or does not have it (0).

In binary logistic regression, the logistic function is used to model the probability of belonging to one of the two classes.

Multinomial Logistic Regression:

Multinomial logistic regression, also known as softmax regression, is used when the dependent variable has more than two categories. It's an extension of logistic regression that can handle multiple classes.

For example, multinomial logistic regression can be used for tasks like:

  • Predicting the type of fruit in a basket (e.g., apple, banana, orange).
  • Classifying different species of animals (e.g., cat, dog, bird).

In multinomial logistic regression, the logistic function is extended to handle multiple classes. The softmax function is used to compute the probabilities of each class, and the class with the highest probability is predicted.

Here's how it works:

  1. Softmax Function:
    • The softmax function takes a vector of scores (one for each class) and transforms them into probabilities that sum to 1.
    • It assigns a probability to each class based on the scores.
  2. Prediction:
    • The class with the highest probability is predicted as the outcome.


Code

Step 1: Import necessary libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Step 2: Load and prepare the Titanic dataset

data = pd.read_csv('train.csv')

Step 3: Select relevant features and target variable

X = data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch']]
y = data['Survived']

Step 4: Data Cleaning and Encoding

# Convert categorical variable 'Sex' to numeric
X['Sex'] = X['Sex'].map({'male': 0, 'female': 1})

# Handle missing values in 'Age' (replace with mean)
X['Age'].fillna(X['Age'].mean(), inplace=True)

Step 5: Split Dataset

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 6: Initialize the Model

model = LogisticRegression()
model.fit(X_train, y_train)

Step 7: Evaluate the Model

# Step 5: Make predictions
y_pred = model.predict(X_test)

# Step 6: Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

# ------------------ OUTPUT -------------- #
Accuracy: 82.68%