paint-brush
How Likely Was One to Survive on the Titanic?by@dotslashbit
3,515 reads
3,515 reads

How Likely Was One to Survive on the Titanic?

by SahilAugust 9th, 2023
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

Did class distinctions play a role in determining the fate of those onboard, creating a divide between privilege and danger? How did age and gender influence who survived and who succumbed to the relentless sea? Amidst the chaos, did the presence of family members provide comfort and support, urging passengers to face the storm together? And did the port of embarkation influence the destinies of those who boarded from different locations? Thankfully, we can answer these questions using the Titanic dataset available at Kaggle. More than a hundred years after it sank, we can use the data to understand how a ticket's price may have influenced survival and whether certain cabin locations provided refuge during the tragic events. This Exploratory Data Analysis (EDA) will reveal insights into the passengers’ experiences, shedding light on their stories of courage and loss on that fateful night.
featured image - How Likely Was One to Survive on the Titanic?
Sahil HackerNoon profile picture

The sinking of the RMS Titanic in 1912 remains one of the deadliest maritime disasters in history. More than 1,500 people lost their lives when the ship struck an iceberg and sank in the North Atlantic Ocean. In the years that followed, extensive research has been conducted to understand the factors that contributed to the high death toll.


Only 38% of the passengers survived this devastating event, prompting me to wonder about the individuals who were aboard the Titanic that fateful night.


Did class distinctions play a role in determining the fate of those onboard, creating a divide between privilege and danger? How did age and gender influence who survived and who succumbed to the relentless sea? Amidst the chaos, did the presence of family members provide comfort and support, urging passengers to face the storm together? And did the port of embarkation influence the destinies of those who boarded from different locations?


Thankfully, we can answer these questions using the Titanic dataset available at Kaggle. More than a hundred years after it sank, we can use the data to understand how a ticket's price may have influenced survival and whether certain cabin locations provided refuge during the tragic events. This Exploratory Data Analysis (EDA) will reveal insights into the passengers’ experiences, shedding light on their stories of courage and loss on that fateful night.

Questions/Insights

Here are some of the insights that we will be exploring during the analysis:


  1. What is the overall survival rate of passengers on the Titanic?
  2. How does the survival rate vary by gender? Are females more likely to survive than males?
  3. What is the distribution of passenger ages on the Titanic? Are there any notable patterns?
  4. Did passengers in different passenger classes (1st, 2nd, 3rd) have different survival rates?
  5. What is the survival rate among different age groups (e.g., children, adults, elderly)?
  6. Did the port of embarkation affect the chances of survival?
  7. How does the presence family affect the survival rate?
  8. Did passengers with higher fares have a better chance of survival?
  9. What is the distribution of passenger cabin locations? Did passengers in certain cabins have a higher survival rate?

Data

The Titanic dataset is a collection of data about the passengers and crew of the RMS Titanic, which sank in 1912. The dataset contains information about each passenger's name, age, gender, ticket class, and whether they survived the sinking. The Titanic dataset is a popular dataset for machine learning and data science projects. It is often used to train models to predict whether passengers survived the sinking based on their characteristics. The Titanic dataset is also used to study social networks and human behavior.


The Titanic dataset was created by Kaggle, a data science community. The dataset is available for free download on the Kaggle website.


You can get the dataset here

Prerequisites

  • Pandas
  • Matplotlib
  • Seaborn

Loading The Data

Let’s start our analysis by loading the necessary modules and the titanic dataset

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Titanic dataset
df = pd.read_csv('/kaggle/input/titanic/train.csv')
df.head()

titanic dataset first 5 rows

You can see that this dataset contains all the information about each passenger that I have discussed in the introduction section of this article.


Now, let's start our analysis using pandas and visualize the insights using matplotlib and seaborn.

Exploratory Data Analysis

What is the overall survival rate of passengers on the Titanic?

# Calculate the overall survival rate
survival_rate = df['Survived'].mean() * 100

# Create a bar plot to visualize the survival rate
sns.set(style='darkgrid')
plt.figure(figsize=(6, 4))
sns.countplot(x='Survived', data=df)
plt.xlabel('Survived')
plt.ylabel('Passenger Count')
plt.title('Survival Rate: {:.2f}%'.format(survival_rate))
plt.xticks([0, 1], ['No', 'Yes'])
plt.show()

percentage of passengers survived

The bar plot provides a clear and concise visual representation of the overall survival rate of passengers on the Titanic. It reveals that only 38% of the passengers managed to survive the disaster.


The height of the bars represents the number of passengers in each category (0 for non-survivors and 1 for survivors). The survival rate, indicated in the title of the plot, highlights the percentage of passengers who survived the tragic event.

How does the survival rate vary by gender? Are females more likely to survive than males?

# Calculate the survival rate by gender
survival_by_gender = df.groupby('Sex')['Survived'].mean() * 100

# Create a bar plot to visualize the survival rate by gender
sns.set(style='darkgrid')
plt.figure(figsize=(6, 4))
sns.barplot(x=survival_by_gender.index, y=survival_by_gender.values)
plt.xlabel('Gender')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Gender')
plt.show()

percentage of passengers survived by gender

You can see that out of all the survived passengers over 70% of them are female. The significantly higher survival rate for females compared to males is a striking observation.


This discrepancy suggests that gender played a crucial role in determining the chances of survival during the Titanic tragedy.

What is the distribution of passenger ages on the Titanic? Are there any notable patterns?

# Plot the distribution of passenger ages
sns.set(style='darkgrid')
plt.figure(figsize=(8, 6))
sns.histplot(data=df, x='Age', bins=20, kde=True)
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Distribution of Passenger Ages')
plt.show()

distribution of passenger ages

The histogram with KDE plot illustrates the distribution of passenger ages onboard the Titanic. The data shows a left-skewed distribution, indicating that there were more young adults, particularly between the ages of 18 and 35, compared to older adults or children among the passengers.


To analyze the distribution of passenger ages, we utilized a histogram with KDE (Kernel Density Estimation) plot. The x-axis represents different age intervals (bins), while the y-axis displays the count of passengers falling into each age group. By visualizing the data in this manner, we were able to discern the skewed nature of the age distribution on the Titanic, highlighting the prevalence of young adults among the passengers.


Did passengers in different passenger classes (1st, 2nd, 3rd) have different survival rates?

# Calculate the survival rates by passenger class
survival_by_class = df.groupby('Pclass')['Survived'].mean() * 100

# Create a bar plot to visualize the survival rates by passenger class
sns.set(style='darkgrid')
plt.figure(figsize=(6, 4))
sns.barplot(x=survival_by_class.index, y=survival_by_class.values)
plt.xlabel('Passenger Class')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Passenger Class')
plt.show()

survival rate by passenger class

The bar plot showcases the survival rates based on passenger class for the passengers in the Titanic dataset. It is evident from the plot that the higher the passenger class, the higher the survival rate. This observation aligns with the historical understanding that passengers in higher classes (1st class) had better access to lifeboats and safety measures, which likely contributed to their higher chances of survival. In contrast, passengers in lower classes (3rd class) faced more challenges during the evacuation process, potentially leading to a lower survival rate for that group.


To analyze the survival rates based on passenger class, we created a bar plot. Each bar represents the percentage of passengers who survived for each class category (1st, 2nd, or 3rd class). By visually examining the plot, we were able to identify any disparities in survival rates among the different passenger classes. The heights of the bars indicate the survival rates, with higher bars indicating higher percentages of survivors for specific passenger classes.

What is the survival rate among different age groups (e.g., children, adults, elderly)?

# Create age groups
age_bins = [0, 12, 18, 30, 50, 100]  # Define the age group boundaries
age_labels = ['Child', 'Teenager', 'Young Adult', 'Adult', 'Elderly']  # Define the age group labels
df['AgeGroup'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels, right=False)

# Calculate the survival rates by age group
survival_by_age_group = df.groupby('AgeGroup')['Survived'].mean() * 100

# Create a bar plot to visualize the survival rates by age group
sns.set(style='darkgrid')
plt.figure(figsize=(8, 6))
sns.barplot(x=survival_by_age_group.index, y=survival_by_age_group.values)
plt.xlabel('Age Group')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Age Group')
plt.show()

survival rate by age group

The bar plot illustrates the survival rates for different age groups among the passengers in the Titanic dataset. It reveals that certain age groups, such as children, adults, and the elderly, had higher chances of survival during the disaster. This observation indicates that priority was given to these vulnerable age groups during the evacuation process, resulting in higher survival rates for them.


To analyze the survival rates based on age groups, we created a bar plot. Each bar represents the percentage of survivors in a specific age group. By examining the plot, we were able to observe the variations in survival rates among different age groups. This allowed us to infer that certain age groups, like children, adults, and the elderly, received priority and had better chances of survival during the tragic event. The methodology employed here provided valuable insights into the impact of age on survival outcomes, reflecting the humanitarian efforts to protect the most vulnerable passengers during the disaster.

Did the port of embarkation affect the chances of survival?

# Calculate the survival rates by port of embarkation
survival_by_embarkation = df.groupby('Embarked')['Survived'].mean() * 100

# Create a bar plot to visualize the survival rates by port of embarkation
sns.set(style='darkgrid')
plt.figure(figsize=(6, 4))
sns.barplot(x=survival_by_embarkation.index, y=survival_by_embarkation.values)
plt.xlabel('Port of Embarkation')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Port of Embarkation')
plt.show()

survival rate by port of embarkation

The bar plot showcases the survival rates based on the port of embarkation for the passengers in the Titanic dataset. By analyzing the plot, we can clearly observe that the highest survival rate is associated with passengers who embarked from Cherbourg, while the lowest survival rate is linked to those who embarked from Southampton.


This visualization offers valuable insights into the variations in survival rates based on the port of embarkation, suggesting potential factors that may have influenced the passengers' chances of survival.


To analyze the survival rates based on the port of embarkation, we created a bar plot. Each bar represents the percentage of survivors for each port of embarkation category (Cherbourg, Queenstown, Southampton). By visually examining the plot, we were able to identify significant differences in survival rates among the different embarkation points.

Did passengers with higher fares have a better chance of survival?

# Create fare groups
fare_bins = [0, 50, 100, 150, 200, 300, 1000]  # Define the fare group boundaries
fare_labels = ['0-50', '50-100', '100-150', '150-200', '200-300', '300+']  # Define the fare group labels
df['FareGroup'] = pd.cut(df['Fare'], bins=fare_bins, labels=fare_labels, right=False)

# Calculate the survival rates by fare group
survival_by_fare_group = df.groupby('FareGroup')['Survived'].mean() * 100

# Create a bar plot to visualize the survival rates by fare group
sns.set(style='darkgrid')
plt.figure(figsize=(8, 6))
sns.barplot(x=survival_by_fare_group.index, y=survival_by_fare_group.values)
plt.xlabel('Fare Group')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Fare Group')
plt.show()

survival rate by fare

The graph illustrates that the survival rate was highest for passengers in the highest fare group (300+), followed by the second highest fare group (200-300). Conversely, the survival rate decreased as the fare group decreased, with the lowest survival rate for passengers in the lowest fare group (0-50).


This analysis suggests that the fare paid played a significant role in determining the chances of survival on the Titanic. Passengers who paid higher fares were more likely to be in first class, which had a higher survival rate overall.


Additionally, these higher-paying passengers might have been given priority during the rescue efforts, contributing to their higher survival rate.To analyze the relationship between fare groups and survival rates, we created a bar plot showcasing the survival percentages for each fare group. The graph allowed us to observe the trend of survival rates based on fare groups, revealing that higher fares were associated with higher survival rates.

What is the distribution of passenger cabin locations? Did passengers in certain cabins have a higher survival rate?

# Extract the cabin deck from the Cabin column
df['CabinDeck'] = df['Cabin'].str.extract(r'([A-Za-z])')

# Plot the distribution of passenger cabin locations
sns.set(style='darkgrid')
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='CabinDeck', order=sorted(df['CabinDeck'].dropna().unique()))
plt.xlabel('Cabin Deck')
plt.ylabel('Count')
plt.title('Distribution of Passenger Cabin Locations')
plt.show()

# Calculate the survival rates by cabin deck
survival_by_cabin_deck = df.groupby('CabinDeck')['Survived'].mean() * 100

# Create a bar plot to visualize the survival rates by cabin deck
sns.set(style='darkgrid')
plt.figure(figsize=(8, 6))
sns.barplot(x=survival_by_cabin_deck.index, y=survival_by_cabin_deck.values)
plt.xlabel('Cabin Deck')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Cabin Deck')
plt.show()

distribution of passenger cabin locations

survival rate by cabin deck

The graph clearly demonstrates a decreasing trend in survival rates as the cabin deck level decreased, with the lowest survival rate observed for passengers on A deck.


This analysis suggests that the location of a passenger's cabin was indeed a significant factor in their chances of survival on the Titanic. Passengers on higher decks were likely to have better access to lifeboats and were more easily rescued.


Additionally, they may have been given priority during the rescue operations, contributing to their higher survival rates.To analyze the relationship between cabin deck levels and survival rates, we created a bar plot displaying the survival percentages for each deck. By examining the graph, we could identify the trends in survival rates based on the cabin deck levels.

Survival Rate by Age Group and Gender

# Create age groups
age_bins = [0, 12, 18, 30, 50, 100]  # Define the age group boundaries
age_labels = ['Child', 'Teenager', 'Young Adult', 'Adult', 'Elderly']  # Define the age group labels
df['AgeGroup'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels, right=False)

# Calculate the survival rates by age group and gender
survival_by_age_gender = df.groupby(['AgeGroup', 'Sex'])['Survived'].mean() * 100

# Convert the survival rates into a pivot table for easier visualization
survival_pivot = survival_by_age_gender.unstack()

# Create a heatmap to visualize the survival rates by age group and gender
sns.set(style='darkgrid')
plt.figure(figsize=(8, 6))
sns.heatmap(data=survival_pivot, annot=True, cmap='coolwarm', fmt=".2f", cbar=True)
plt.xlabel('Gender')
plt.ylabel('Age Group')
plt.title('Survival Rate by Age Group and Gender')
plt.show()

survival rate by age group and gender

The heatmap provides a visual representation of the survival rates for different age groups and genders among the passengers in the Titanic dataset. By analyzing the heatmap, we can observe patterns and trends in survival rates based on age and gender, allowing us to identify which age and gender groups had higher or lower chances of survival during the disaster. The color intensity in the heatmap serves as a clear indicator, with darker colors representing higher survival rates and lighter colors indicating lower survival rates.


From the heatmap, we can draw several insights. Elderly female passengers had the highest likelihood of survival, while elderly male passengers had the lowest chances of survival. Additionally, the heatmap reveals that, in general, females were more likely to survive across all age categories. These observations shed light on the significant influence of age and gender in determining the passengers' survival outcomes during this historical tragedy.


To visualize the survival rates based on age groups and genders, we used a heatmap. Each cell in the heatmap represents the survival rate (percentage) for a specific age group and gender combination. By examining the color intensity in the heatmap, we were able to discern the differences in survival rates among various age and gender groups.

Survival Rate by Family Size

# Calculate the total number of family members for each passenger
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

# Calculate the survival rates by family size
survival_by_family_size = df.groupby('FamilySize')['Survived'].mean() * 100

# Create a bar plot to visualize the survival rates by family size
sns.set(style='darkgrid')
plt.figure(figsize=(10, 6))
sns.barplot(x=survival_by_family_size.index, y=survival_by_family_size.values)
plt.xlabel('Family Size')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Family Size')
plt.show()

survival rate by family size

The graph illustrates that passengers with larger family sizes were more likely to survive the sinking of the Titanic. The survival rate for passengers with family sizes ranging from 1 to 4 increased as the number of family members increased.


This insight is significant as it highlights family size as one of the strongest predictors of survival on the Titanic. Passengers with larger family sizes were likely traveling together, making them a priority for rescue efforts.


Additionally, having more family members may have facilitated mutual support, such as sharing resources like food and water, which could have contributed to their higher chances of survival.


To analyze the relationship between family size and survival rates, we created a bar plot displaying the survival percentages for each family size category. By visually examining the graph, we could identify how family size influenced the passengers' survival outcomes.

Survival Rate by Post Embarkment and Passenger Class

# Calculate the survival rates by port of embarkation and passenger class
survival_by_embark_class = df.groupby(['Embarked', 'Pclass'])['Survived'].mean() * 100

# Convert the survival rates into a pivot table for easier visualization
survival_pivot = survival_by_embark_class.unstack()

# Create a heatmap to visualize the survival rates
sns.set(style='darkgrid')
plt.figure(figsize=(8, 6))
sns.heatmap(data=survival_pivot, annot=True, cmap='coolwarm', fmt=".2f", cbar=True)
plt.xlabel('Passenger Class')
plt.ylabel('Port of Embarkation')
plt.title('Survival Rate by Port of Embarkation and Passenger Class')
plt.show()

survival rate by port and embarkment

The graph reveals several important patterns regarding the survival rates based on the port of embarkation and passenger class. Passengers who embarked from Cherbourg had a higher survival rate compared to those from Queenstown or Southampton.


This is possibly due to Cherbourg being the first port of call and the likelihood of more first-class passengers embarking there. Additionally, as a French port, there may have been a bias towards saving French passengers.


Furthermore, the graph indicates that passengers in first class had a higher survival rate compared to those in second or third class. This can be attributed to first-class passengers being seen as a priority for rescue and having better access to lifeboats due to their higher social status and potential ability to afford life-saving measures.


Lastly, a notable trend across all ports is the decreasing survival rate as the passenger class decreases. This suggests that the passenger class was a significant determinant of survival on the Titanic, with first-class passengers having the highest chances of survival.To analyze the relationship between the port of embarkation, passenger class, and survival rates, we created a heatmap. The heatmap represents the survival rate percentages for different combinations of port of embarkation and passenger class. By examining the heatmap, we were able to identify patterns and trends in survival rates based on these two factors.