Logistic regression is a statistical method used for analyzing a dataset in which there are one or more independent variables that can be used to predict the outcome of a categorical dependent variable. In simpler terms, it's a technique used for classification problems where the target variable is binary (i.e., it can take on two possible values, such as 0 or 1).
The main goal of logistic regression is to find the best-fitting model to describe the relationship between the independent variables and the probability of a particular outcome. It uses the logistic function (also known as the sigmoid function) to map predicted values to probabilities.
Linear regression is used when the target variable is continuous and can take any numerical value. It predicts a value that is unbounded, which means it can be any real number. Linear regression uses a straight line to model the relationship between the independent variables and the dependent variable.
Logistic regression, on the other hand, is used when the target variable is categorical and represents two classes (binary classification). It predicts the probability that a given instance belongs to a particular class. The output of logistic regression is bounded between 0 and 1, thanks to the logistic (sigmoid) function.
Logistic regression finds applications in various fields due to its simplicity, interpretability, and effectiveness in binary classification problems. Some real-world applications include:
The logistic regression model uses the logistic function (also known as the sigmoid function) to model the relationship between the independent variables and the probability of a particular outcome.
The logistic regression model can be mathematically expressed as:
The logistic function is defined as:
Here, S(t) is the output (probability), and t is the linear combination of the independent variables and their coefficients:
The logistic function transforms any value of tto a value between 0 and 1. This is crucial in logistic regression because we want to model probabilities.
The linear combination t represents the weighted sum of the independent variables. When we apply the logistic function to t, we obtain a value between 0 and 1. This value represents the estimated probability of the event Y=1 given the values of the independent variables.
In logistic regression, the goal is to find the best-fitting set of coefficients that maximizes the likelihood of the observed data. This is typically done through optimization techniques like gradient descent.
Binary logistic regression is used when the dependent variable has two possible categories or classes. These classes are typically coded as 0 and 1, representing the absence or presence of an outcome or event.
For example, binary logistic regression can be used for tasks like:
In binary logistic regression, the logistic function is used to model the probability of belonging to one of the two classes.
Multinomial logistic regression, also known as softmax regression, is used when the dependent variable has more than two categories. It's an extension of logistic regression that can handle multiple classes.
For example, multinomial logistic regression can be used for tasks like:
In multinomial logistic regression, the logistic function is extended to handle multiple classes. The softmax function is used to compute the probabilities of each class, and the class with the highest probability is predicted.
Here's how it works:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
data = pd.read_csv('train.csv')
X = data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch']]
y = data['Survived']
# Convert categorical variable 'Sex' to numeric
X['Sex'] = X['Sex'].map({'male': 0, 'female': 1})
# Handle missing values in 'Age' (replace with mean)
X['Age'].fillna(X['Age'].mean(), inplace=True)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
# Step 5: Make predictions
y_pred = model.predict(X_test)
# Step 6: Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')
# ------------------ OUTPUT -------------- #
Accuracy: 82.68%