paint-brush
Understanding Principal Component Analysisby@dotslashbit
405 reads
405 reads

Understanding Principal Component Analysis

by SahilAugust 29th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction in data analysis and machine learning. It aims to transform high-dimensional data into a lower-dimensional representation while retaining as much relevant information as possible. The transformed dimensions, known as principal components, are new axes in the feature space that capture the maximum variance present in the original data.
featured image - Understanding Principal Component Analysis
Sahil HackerNoon profile picture

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction in data analysis and machine learning. It aims to transform high-dimensional data into a lower-dimensional representation while retaining as much relevant information as possible. The transformed dimensions, known as principal components, are new axes in the feature space that capture the maximum variance present in the original data.


Importance of PCA in Dimensionality Reduction:

  1. Curse of Dimensionality: As the number of features (dimensions) in a dataset increases, the complexity of the data also increases. This can lead to various problems such as increased computational requirements, higher risk of overfitting, and difficulties in visualization and interpretation. PCA helps mitigate these issues by reducing the dimensionality while preserving the essence of the data.
  2. Variance Concentration: In high-dimensional spaces, data points can be scattered in various directions. PCA identifies the directions (principal components) along which the data has the highest variance. By retaining the top principal components, which capture most of the variance, PCA allows us to summarize the data effectively.
  3. Noise Reduction: High-dimensional data often contains noise or irrelevant information. PCA tends to diminish the impact of noisy dimensions by giving them lower importance in terms of variance. This results in a more robust representation of the data.
  4. Visualization: It's challenging to visualize and interpret data with many dimensions. By reducing the data to two or three principal components, PCA facilitates visualization without significant loss of information. This is particularly valuable for understanding patterns and relationships in the data.
  5. Efficient Computation: With fewer dimensions, computational tasks become less resource-intensive and faster. This is beneficial in machine learning, where training models on high-dimensional data can be time-consuming and memory-intensive.
  6. Feature Engineering: PCA can be used as a preprocessing step to transform the original features into a more compact representation. This transformed data can then be used as input for machine learning algorithms, potentially improving their performance. In my previous article, I have used PCA to do feature engineering.


In essence, PCA addresses the challenge of handling high-dimensional data by projecting it onto a lower-dimensional subspace that captures the most significant variations. This reduction in dimensionality often results in a more manageable, interpretable, and computationally efficient representation of the data, making PCA a powerful tool in various data analysis and machine learning scenarios.

Applications

Principal Component Analysis (PCA) has a wide range of applications across various fields due to its ability to reduce dimensionality while preserving important information. Here are some notable applications of PCA.


PCA is widely used for:

  • Dimensionality reduction in machine learning to improve model efficiency and prevent overfitting.
  • Data visualization by projecting high-dimensional data into lower dimensions.
  • Noise reduction by focusing on the principal components with the highest signal-to-noise ratio.


In summary, PCA is a fundamental technique that simplifies high-dimensional data by identifying the most important patterns and variations. It's a valuable tool for understanding data, improving computational efficiency, and enhancing model performance.

Understanding Principal Component Analysis

Principal Component Analysis (PCA) is a powerful dimensionality reduction technique used in data analysis and machine learning to transform high-dimensional data into a lower-dimensional representation, while retaining as much relevant information as possible. It achieves this by identifying new axes, called principal components, in the feature space that capture the most significant variations in the data.


Interpreting Results:

  • The first principal component captures the direction of maximum variance in the data.
  • Subsequent principal components are orthogonal to the previous ones and capture decreasing amounts of variance.
  • By selecting fewer principal components, you can achieve dimensionality reduction while preserving a significant portion of the data's variance.

Mathematical Background

Covariance Matrix

The covariance matrix is at the core of PCA. It captures the relationships and interactions between different features (variables) in the dataset. Mathematically, for a dataset with 'n' data points and 'm' features, the covariance matrix C is calculated as:


covariance equation


Where:

  • X is the centered data matrix (each column represents a feature and each row represents a data point).
  • X^T is the transpose of the centered data matrix.
  • n is the number of data points.


Each entry C_ij​ in the covariance matrix represents the covariance between feature i and feature j. If C_ij​ is positive, it means that when feature i increases, feature j tends to increase as well. If C_ij​ is negative, it means that when feature i increases, feature j tends to decrease.

Eigendecomposition

Eigendecomposition is a mathematical operation that decomposes a matrix into its constituent eigenvectors and eigenvalues. For the covariance matrix C, eigendecomposition involves solving the equation:


eigenvalue equation

Where:

  • v is an eigenvector of C.
  • λ (lambda) is the eigenvalue associated with the eigenvector v.


Solving this equation for v and λ yields a set of eigenvectors and their corresponding eigenvalues. Eigenvectors are directions in the feature space, and eigenvalues represent the amount of variance captured along these directions.

Choosing Number of Components Based on Eigen Values

When applying PCA, we sort the eigenvectors in descending order based on their corresponding eigenvalues. The eigenvector with the highest eigenvalue captures the direction of maximum variance in the data, and it becomes the first principal component. Subsequent eigenvectors capture decreasing amounts of variance and become subsequent principal components.


Choosing how many principal components to retain is based on the cumulative explained variance. The cumulative explained variance is the sum of the eigenvalues as they are added, divided by the total sum of eigenvalues. Retaining, for instance, 95% of the cumulative explained variance means that you're preserving the directions along which 95% of the data's variance is captured.

Interpretation

  • Eigenvectors are normalized, meaning their length is 1, and they represent directions of maximal variance.
  • Eigenvalues indicate the importance of the corresponding eigenvectors in capturing the data's variance.
  • By selecting fewer principal components, you're choosing to represent the data in a lower-dimensional subspace that preserves most of the variance while discarding less important variations.


In summary, PCA is a mathematical technique that revolves around covariance, eigendecomposition, and the selection of principal components based on eigenvalues. It's a method to find the most informative directions (principal components) in the data, allowing for effective dimensionality reduction.

Step-by-Step PCA Algorithm:

  1. Data Preparation:
    • Center the data by subtracting the mean of each feature from the dataset.
    • Compute the covariance matrix
  2. Eigendecomposition:
    • Compute the eigenvectors and eigenvalues of the covariance matrix.
    • Sort the eigenvectors in descending order based on their corresponding eigenvalues.
  3. Choosing Principal Components:
    • Decide on the number of principal components to retain based on the cumulative explained variance.
  4. Transformation:
    • Select the top 'k' eigenvectors (principal components).
    • Form a transformation matrix 'W' with these 'k' eigenvectors as columns.
    • Transform the centered data by multiplying it with 'W'
  5. Interpretation and Analysis:
    • The transformed data 'X_new' is the lower-dimensional representation of the original data.
    • These transformed dimensions (principal components) are orthogonal and uncorrelated.


By following these steps, PCA effectively reduces the dimensionality of the data while capturing the most important patterns and variations present in the original dataset.

Visualizing PCA

Visualizing High-Dimensional Data Using PCA


Visualizing high-dimensional data is a challenging task, but PCA can help by reducing the data to a lower-dimensional space while preserving essential information. Let's go through the process of visualizing high-dimensional data in both 2D and 3D using PCA, and then demonstrate scatter plots and biplots.


Visualizing Data in 2D

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Convert the dataset to a pandas DataFrame for better visualization
iris_df = pd.DataFrame(data=np.c_[X, y], columns=feature_names + ['species'])
species_names = iris.target_names
iris_df['species'] = iris_df['species'].map({i: species_names[i] for i in range(len(species_names))})

# Display the first few rows of the DataFrame
print(iris_df.head())

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Visualize in 2D using a scatter plot
plt.figure(figsize=(10, 8))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA: Iris Dataset in 2D')
plt.colorbar(label='Species')
plt.show()


2D PCA Visualization


This plot shows how the data points look in a two-dimensional space formed by the first two principal components. Each data point is represented by a dot, and the color of the dot indicates the class or category of the data point.

Visualizing Data in 3D

# Apply PCA for visualization
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)

# Visualize in 3D using a scatter plot
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=y, cmap='viridis')
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
plt.title('PCA: Iris Dataset in 3D')
plt.colorbar(scatter, ax=ax, label='Species')
plt.show()



3D PCA Visualization


In this plot, we extend the 2D scatter plot to three dimensions by adding the third principal component. We use the Axes3D module from Matplotlib to create a 3D subplot.

Biplots for Visualizing Features and Principal Components

A biplot combines a scatter plot of the data points with vectors indicating the direction of the original features and the principal components.

def plot_biplot(pca, X, feature_names):
    plt.figure(figsize=(10, 8))
    plt.scatter(X[:, 0], X[:, 1], alpha=0.7)
    
    for i, feature in enumerate(feature_names):
        plt.arrow(0, 0, pca.components_[0, i], pca.components_[1, i], color='r', alpha=0.5)
        plt.text(pca.components_[0, i]*1.3, pca.components_[1, i]*1.3, feature, color='g')
    
    plt.xlabel('Principal Component 1')
    plt.ylabel('Principal Component 2')
    plt.title('Biplot: Feature and Principal Component Visualization')
    plt.grid()
    plt.show()

# Visualize biplot
plot_biplot(pca, X_scaled, feature_names)



Bi-plot


A biplot combines a scatter plot of the data points and vectors indicating the direction of the original features and the principal components. The length and direction of the feature vectors help you understand which original features contribute most to each principal component.


Checking Correlation Between Features

import seaborn as sns
# Calculate the correlation matrix
correlation_matrix = iris_df.corr()

# Plot the correlation matrix using a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix of Iris Features')
plt.show()


Correlation of all the features


In these examples, we used PCA to reduce the dimensions of the Iris dataset and then visualized the transformed data in both 2D and 3D using scatter plots. Additionally, we demonstrated how to create a biplot to visualize the relationships between the original features and principal components. These visualization techniques provide insights into the structure and patterns of high-dimensional data after dimensionality reduction.

PCA With Scikit-Learn

Understanding scikit-learn (sklearn):

Scikit-learn is a widely-used Python library for machine learning, data mining, and data analysis. It provides a comprehensive set of tools and functions to handle a variety of tasks in these domains, including classification, regression, clustering, dimensionality reduction, and more. Scikit-learn is built on top of other scientific computing libraries like NumPy, SciPy, and Matplotlib, making it seamless to integrate into data science workflows.

Implementing PCA on a Kaggle Dataset:

For this demonstration, let's use the "Breast Cancer Wisconsin (Diagnostic)" dataset from Kaggle, which contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.

Step 1: Importing Libraries and Loading Data

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target


In this block, You start by importing the necessary libraries and modules. You import numpy for numerical computations, pandas for data manipulation (though it's not used in this example), and various modules from sklearn for loading the dataset, preprocessing data, applying PCA, building a logistic regression model, and evaluating model performance.

Step 2: Splitting Data and Standardizing Features

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


Here, you split the loaded dataset into training and testing sets using the train_test_split function. The test_size parameter indicates the proportion of the data that will be used for testing (30% in this case), and random_state ensures reproducibility of results.


You then standardize the features using the StandardScaler to ensure that each feature has a mean of 0 and a standard deviation of 1. Standardization is important for many machine learning algorithms, especially when features are on different scales.

Step 3: Building Models (Before and After PCA)

# Build a Logistic Regression model without PCA
model_before_pca = LogisticRegression(random_state=42)
model_before_pca.fit(X_train_scaled, y_train)
y_pred_before_pca = model_before_pca.predict(X_test_scaled)

# Apply PCA and build a Logistic Regression model after PCA
n_components = 10  # Choose the number of components
pca = PCA(n_components=n_components)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

model_after_pca = LogisticRegression(random_state=42)
model_after_pca.fit(X_train_pca, y_train)
y_pred_after_pca = model_after_pca.predict(X_test_pca)


In this block, you build two logistic regression models: one without PCA (model_before_pca) and another after applying PCA (model_after_pca).


  • For the model before PCA:
    • You create a logistic regression model using LogisticRegression from sklearn.
    • You fit the model on the standardized training data using fit.
    • You predict the target values for the standardized test data using predict.
  • For the model after PCA:
    • You specify the desired number of principal components (n_components) for PCA.
    • You create a PCA object using PCA from sklearn.decomposition.
    • You transform the training and test data using the PCA transformation matrices (fit_transform and transform).
    • You create a logistic regression model using LogisticRegression.
    • You fit the model on the transformed training data using fit.
    • You predict the target values for the transformed test data using predict.

Step 4: Comparing Model Performances

# Compare model performances
accuracy_before_pca = accuracy_score(y_test, y_pred_before_pca)
accuracy_after_pca = accuracy_score(y_test, y_pred_after_pca)

print(f"Accuracy before PCA: {accuracy_before_pca:.4f}")
print(f"Accuracy after PCA: {accuracy_after_pca:.4f}")

# -------------------------- OUTPUT -------------------- #
Accuracy before PCA: 0.9825
Accuracy after PCA: 0.9942


Here, we compare the performances of the two logistic regression models:

  • You calculate the accuracy of both models using accuracy_score from sklearn.metrics.
  • You print out the accuracy before PCA and after PCA.


This last step provides insights into how applying PCA affects the accuracy of the model. Higher accuracy after PCA suggests that the dimensionality reduction process is still preserving the important patterns in the data while simplifying the feature space.

Interpreting Model Accuracy Results Before and After PCA:

In your experiment, you applied Principal Component Analysis (PCA) on a machine learning model using the "Breast Cancer Wisconsin (Diagnostic)" dataset. After evaluating the models, you obtained the following accuracy results:


  • Accuracy before PCA: 0.9825
  • Accuracy after PCA: 0.9942


These accuracy values reveal intriguing insights into the impact of PCA on the performance of your machine learning model. Let's break down these results and what they mean for your analysis:

Accuracy before PCA (0.9825):

  • This accuracy score represents the performance of your machine learning model without any dimensionality reduction using PCA.
  • The model achieved an impressive accuracy of approximately 98.25%.
  • In this context, "accuracy" refers to the proportion of correctly predicted outcomes (diagnostic results) out of the total predictions made by the model.
  • The high accuracy score suggests that the model, when trained on the original high-dimensional data, was capable of capturing the underlying patterns and relationships present in the dataset.
  • This result serves as a benchmark against which you can compare the performance of the model after applying PCA.

Accuracy after PCA (0.9942):

  • This accuracy score represents the performance of your machine learning model after applying PCA for dimensionality reduction.
  • The model achieved an even higher accuracy of approximately 99.42%.
  • The significant improvement in accuracy indicates that PCA played a beneficial role in enhancing the model's predictive capabilities.
  • By reducing the dataset's dimensionality while retaining most of the relevant information, PCA helped the model focus on the most meaningful features, reducing the potential impact of noise and irrelevant variations.
  • The increased accuracy suggests that the model's generalization to new, unseen data may have improved after applying PCA.


The results clearly demonstrate the positive influence of PCA on the machine learning model's performance. The increase in accuracy from 98.25% to 99.42% after PCA showcases how dimensionality reduction techniques like PCA can contribute to building more robust and efficient machine learning models. By retaining the essential information while reducing the complexity of the data, PCA enhances the model's ability to generalize and make accurate predictions on new, unseen data points. This finding underlines the importance of thoughtful feature engineering and preprocessing techniques like PCA in the data analysis and machine learning pipeline.

How to Choose Number of Components

Scree Plot

A scree plot is a graphical representation of the eigenvalues of the principal components. Eigenvalues indicate the amount of variance explained by each component. In a scree plot, you plot the eigenvalues against the component index. The "elbow point" is where the eigenvalues start to level off, suggesting that the significant components have been captured.

import numpy as np
import matplotlib.pyplot as plt

eigenvalues = pca.explained_variance_
plt.plot(range(1, len(eigenvalues) + 1), eigenvalues, marker='o')
plt.xlabel('Principal Component')
plt.ylabel('Eigenvalue')
plt.title('Scree Plot')
plt.show()

Screen Plot

Cumulative Explained Variance Plot

This plot shows the cumulative proportion of explained variance as you add more components. It helps you understand how much variance is retained by including different numbers of components. You can look for the point where adding more components only marginally increases the cumulative variance.


cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Cumulative Explained Variance Plot')
plt.show()


Cumulative Explained Variance Plot

Retaining a Specific Amount of Variance

You can choose to retain a specific amount of variance (e.g., 95%). By plotting the cumulative explained variance, you can see how many components are needed to achieve that threshold.


desired_variance = 0.95
num_components = np.argmax(cumulative_variance >= desired_variance) + 1

Cross-Validation and Model Performance

In some cases, you can use cross-validation to evaluate model performance with different numbers of components. Choose the number of components that results in the best model performance metric, such as accuracy or mean squared error.

Domain Knowledge

Your understanding of the data and the problem domain can guide your choice of components. If you know that certain features are irrelevant or redundant, you might choose a smaller number of components.

Rule of Thumb

A common rule of thumb is to choose the number of components that capture a significant portion of the variance, such as 95% or 99%.


Keep in mind that there's no one-size-fits-all approach, and the choice of the number of components can depend on the specific goals of your analysis, the complexity of the data, and the trade-off between simplicity and information retention. It's often helpful to experiment with different methods and consider the overall context of your analysis.