Embarking on the voyage of machine learning and data analysis, it's crucial to have a reliable map to navigate the data's intricate terrain. Enter Exploratory Data Analysis (EDA), the compass that guides us through the data wilderness, revealing its secrets, relationships, and patterns hidden beneath the surface. EDA is an essential step in the data science process that involves gaining a deeper understanding of the data and uncovering its underlying structure, relationships, and patterns. It's the process of summarising, visualizing, and transforming data to extract insights and inform decisions about the next steps in data analysis and modeling.
In this article, we set sail on a captivating journey through the EDA process, using the legendary Titanic dataset from Kaggle as our North Star 🌟.
To construct effective models, it's essential to:
You can also access my Titanic EDA Notebook on Kaggle!
Introduction
Domain information
Loading libraries:
Loading data
First look: variables, NAs
5.1 Variables
5.2 Types of the variables
5.3 Check data for NA
Exploring the data
6.1 Survivals - target value
6.2 AGE
6.3 What is in the name?
6.4 Cabin
6.5 Family
6.6 Class
6.7 Gender
6.8 Embarked
6.9 Fare
Conclusion
The Titanic was a British passenger liner operated by the White Star Line. Titanic was on its way from Southampton to New York City when it sank in the North Atlantic Ocean in the early morning hours of 15 April 1912 after Titanic collided with an iceberg. The ship carried 2224 people, considering passengers and crew aboard,1514 of them died.
Titanic carried 16 wooden lifeboats and 4 collapsibles, which could accommodate 1178 people, only one-third of Titanic's total capacity (and 53% of real number of passengers).
At the time, lifeboats were intended to ferry survivors from a sinking ship to a rescuing ship—not keep afloat the whole population or power them to shore. If the SS Californian would responded to Titanic's distress calls, the lifeboats may have been adequate to ferry the passengers to safety as planned, but it didn't happen and the only way to survive were to get on the lifeboat.
The main question we will try to answer is “what passengers were more likely to survive?”
List of libraries I am using:
add Codeadd Markdown
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
As input information I have two CSV files:
In this notebook, I will use all available information (train + test datasets) to perform exploratory data analysis.
# path to train dataset
train_path = '../input/titanic/train.csv'
# path to test dataset
test_path = '../input/titanic/test.csv'
# Read a comma-separated values (csv) file into pandas DataFrame
train_data = pd.read_csv(train_path)
test_data = pd.read_csv(test_path)
# shape of tha data
print('Train shape: ', train_data.shape)
print('Test shape: ', test_data.shape)
Output:
Train shape: (891, 12)
Test shape: (418, 11)
The training part contains information about 891 passengers, described by 12 variables, including one target variable.
The testing part contains 418 observations, i.e. information about passengers, described by 11 variables (the test dataset doesn't contain target value.)
# create a sequence of DataFrame objects
frames = [train_data, test_data]
# Concatenate pandas objects along a particular axis
all_data = pd.concat(frames, sort = False)
# shape of the data
print('All data shape: ', all_data.shape)
# Show first 4 rows of the concatenated DataFrame
all_data.head(4)
Overall, we have information about 1309 passengers. I am guessing this dataset contains data only about passengers, not crew members (we know that Titanic carried 2224 people).
From the data overview of the competition, we have a description of each variable:
PassengerId - unique identifier
Survived:
0 = No
1 = Yes
Pclass: Ticket class
1 = 1st, Upper
2 = 2nd, Middle
3 = 3rd, Lower
Name: full name with a title
Sex: gender
Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
Sibsp: Number of siblings/spouses aboard the Titanic. The dataset defines family relations in this way:
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
Parch: Number of parents/children aboard the Titanic. The dataset defines family relations in this way:
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
Ticket: Ticket number.
Fare: Passenger fare.
Cabin: Cabin number.
Embarked: Port of Embarkation:
C = Cherbourg
Q = Queenstown
S = Southampton
add Codeadd Markdown
Data types, non-null values count:
all_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 1309 non-null int64
1 Survived 891 non-null float64
2 Pclass 1309 non-null int64
3 Name 1309 non-null object
4 Sex 1309 non-null object
5 Age 1046 non-null float64
6 SibSp 1309 non-null int64
7 Parch 1309 non-null int64
8 Ticket 1309 non-null object
9 Fare 1308 non-null float64
10 Cabin 295 non-null object
11 Embarked 1307 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 132.9+ KB
Age and Fare are continuous numeric variables.
Pclass is an integer, but in fact, it is a categorical variable represented by three numbers.
After previous manipulations, the Survived variable has type 'float'; it's not correct since it's a categorical variable, too, but it will not influence my EDA process, so I will let it float for now.
To check the dataset for NAs I am using **isna()**dataframe function, which returns a boolean same-sized object indicating if the values are NA and then I am calculating the number of True values for each variable.
NA values for each dataframe (train, test, all) are presented in the table below:
# check data for NA values
all_data_NA = all_data.isna().sum()
train_NA = train_data.isna().sum()
test_NA = test_data.isna().sum()
pd.concat([train_NA, test_NA, all_data_NA],
axis=1,
sort = False,
keys = ['Train NA', 'Test NA', 'All NA'])
There are 263 missing Age values, 1 missing Fare, 1014 NAs in Cabin variable, and 2 in Embarked variable.
418 NA in Survived variable due to the absence of this information in the test dataset. I will not impute these missings in the current notebook :) So, when I use this variable for visualization, there will be information only for the training part of the data.
In this notebook, I will do some missing data handling for the combined dataset. But in the second part of my work (ML solution), this should be done based on what we know only about training data, to avoid any data leakage.
Let's calculate and visualize the distribution of our target variable - 'Survived.’
A counterplot of the seaborn module is a very useful way to show the counts of observations in each category.
Since we have a target only for the training part, these numbers don't include all passengers.
# set size of the plot
plt.figure(figsize=(6, 4.5))
# countplot shows the counts of observations in each categorical bin using bars.
# x - name of the categorical variable
ax = sns.countplot(x = 'Survived', data = all_data, palette=["#3f3e6fd1", "#85c6a9"])
# set the current tick locations and labels of the x-axis.
plt.xticks( np.arange(2), ['drowned', 'survived'] )
# set title
plt.title('Overall survival (training dataset)',fontsize= 14)
# set x label
plt.xlabel('Passenger status after the tragedy')
# set y label
plt.ylabel('Number of passengers')
# calculate passengers for each category
labels = (all_data['Survived'].value_counts())
# add result numbers on barchart
for i, v in enumerate(labels):
ax.text(i, v-40, str(v), horizontalalignment = 'center', size = 14, color = 'w', fontweight = 'bold')
plt.show()
We have 891 passengers in the train dataset. 549 (61,6%) of them drowned and only 342(38,4%) survived.
But we know that lifeboats could carry53% of total passengers.
What is the age of passengers, how does it relate to the chances of survival, and how it changes depending on class and gender?
We have 263 missing values:
Overall age distribution (seaborn distplot) and descriptive statistics:
# set plot size
plt.figure(figsize=(15, 3))
# plot a univariate distribution of Age observations
sns.distplot(all_data[(all_data["Age"] > 0)].Age, kde_kws={"lw": 3}, bins = 50)
# set titles and labels
plt.title('Distrubution of passengers age (all data)',fontsize= 14)
plt.xlabel('Age')
plt.ylabel('Frequency')
# clean layout
plt.tight_layout()
# Descriptive statistics include those that summarize the central tendency,
# dispersion and shape of a dataset’s distribution, excluding NaN values.
age_distr = pd.DataFrame(all_data['Age'].describe())
# Transpose index and columns.
age_distr.transpose()
The distribution of Age is slightly right-skewed. The Age varies from about 0.17 year to 80 years with mean = 29.88, and there don't seem to be any obvious outliers, but we will check it.
Did age have a big influence on chances to survive?
To visualize two age distributions grouped by surviving status, I am using boxlot and stripplot shown together:
plt.figure(figsize=(15, 3))
# Draw a box plot to show Age distributions with respect to survival status.
sns.boxplot(y = 'Survived', x = 'Age', data = train_data,
palette=["#3f3e6fd1", "#85c6a9"], fliersize = 0, orient = 'h')
# Add a scatterplot for each category.
sns.stripplot(y = 'Survived', x = 'Age', data = train_data,
linewidth = 0.6, palette=["#3f3e6fd1", "#85c6a9"], orient = 'h')
plt.yticks( np.arange(2), ['drowned', 'survived'])
plt.title('Age distribution grouped by surviving status (train data)',fontsize= 14)
plt.ylabel('Passenger status after the tragedy')
plt.tight_layout()
# Descriptive statistics:
pd.DataFrame(all_data.groupby('Survived')['Age'].describe())
The mean age of survived passengers is 28.34, which is 2.28 smaller than the mean age of drowned passengers (only passengers we know survived status for).
The minimum age of drowned passengers is1 y.o, which is very sad.
The maximum age of surviving passengers is80 y.o. Let's check if there is no mistake.
all_data[all_data['Age'] == max(all_data['Age'] )]
Actually, Mr Algernon Henry Barkworth was born on 4 June 1864. He was 48 in 1912 and died in 1945 at 80 y.o.
train_data.loc[train_data['PassengerId'] == 631, 'Age'] = 48
all_data.loc[all_data['PassengerId'] == 631, 'Age'] = 48
# Descriptive statistics:
pd.DataFrame(all_data.groupby('Survived')['Age'].describe())
Let's update our description:
The mean age of survived passengers is 28.23, which is 2.39 smaller than the mean age of drowned passengers (only passengers we know survived status for).= The maximum age of surviving passengers is 63 y.o.
It looks like there is a slightly bigger chance to survive for younger people.
There, I will compare three age distributions grouped by class of the passenger.
As visualisations, I will use 2 graphs:
# set size
plt.figure(figsize=(20, 6))
# set palette
palette = sns.cubehelix_palette(5, start = 3)
plt.subplot(1, 2, 1)
sns.boxplot(x = 'Pclass', y = 'Age', data = all_data,
palette = palette, fliersize = 0)
sns.stripplot(x = 'Pclass', y = 'Age', data = all_data,
linewidth = 0.6, palette = palette)
plt.xticks( np.arange(3), ['1st class', '2nd class', '3rd class'])
plt.title('Age distribution grouped by ticket class (all data)',fontsize= 16)
plt.xlabel('Ticket class')
plt.subplot(1, 2, 2)
# To use kdeplot I need to create variables with filtered data for each category
age_1_class = all_data[(all_data["Age"] > 0) &
(all_data["Pclass"] == 1)]
age_2_class = all_data[(all_data["Age"] > 0) &
(all_data["Pclass"] == 2)]
age_3_class = all_data[(all_data["Age"] > 0) &
(all_data["Pclass"] == 3)]
# Ploting the 3 variables that we create
sns.kdeplot(age_1_class["Age"], shade=True, color='#eed4d0', label = '1st class')
sns.kdeplot(age_2_class["Age"], shade=True, color='#cda0aa', label = '2nd class')
sns.kdeplot(age_3_class["Age"], shade=True, color='#a2708e', label = '3rd class')
plt.title('Age distribution grouped by ticket class (all data)',fontsize= 16)
plt.xlabel('Age')
plt.xlim(0, 90)
plt.tight_layout()
plt.show()
# Descriptive statistics:
pd.DataFrame(all_data.groupby('Pclass')['Age'].describe())
1st class has wider distribution compare to 2nd and 3rd and almost symmetric.
Both 2nd and 3rd classes age distributions are right skewed.
The youngest passenger has 3rd class ticket, age = 0.17.
The oldest passenger has 1st class ticket, age = 76.
3rd class mean age = 24.8, 2nd classe average age is 29.5 and 1st class average age is 39.1.
Since surviving passengers, on average, were younger than those who drowned, does it mean that 3rd class passengers had more chances to survive? We will discover it later.
From the graphs, we can see the difference in age distribution between classes. So when I will do missing data imputation, I will take class into account.
Comparison of the age distribution by gender I will do separately for each class since we have such a noticeable age difference between classes.
# Descriptive statistics:
age_1_class_stat = pd.DataFrame(age_1_class.groupby('Sex')['Age'].describe())
age_2_class_stat = pd.DataFrame(age_2_class.groupby('Sex')['Age'].describe())
age_3_class_stat = pd.DataFrame(age_3_class.groupby('Sex')['Age'].describe())
pd.concat([age_1_class_stat, age_2_class_stat, age_3_class_stat], axis=0, sort = False, keys = ['1st', '2nd', '3rd'])
The oldest and the youngest passengers are female.
In each class, the average Age of female are slightly less than the average Age of male passengers.
Each passenger Name value contains the title of the passenger, which we can extract and discover.
To create a new variable "Title":
To visualize how many passengers hold each title, I chose countplot.
all_data['Title'] = all_data['Name'].str.split(',', expand = True)[1].str.split('.', expand = True)[0].str.strip(' ')
plt.figure(figsize=(6, 5))
ax = sns.countplot( x = 'Title',
data = all_data, p
alette = "hls",
order = all_data['Title'].value_counts().index)
_ = plt.xticks(
rotation=45,
horizontalalignment='right',
fontweight='light'
)
plt.title('Passengers distribution by titles',fontsize= 14)
plt.ylabel('Number of passengers')
# calculate passengers for each category
labels = (all_data['Title'].value_counts())
# add result numbers on barchart
for i, v in enumerate(labels):
ax.text(i, v+10, str(v),
horizontalalignment = 'center',
size = 10,
color = 'black')
plt.tight_layout()
plt.show()
The most frequent title among passengers is Mister (Mr.) - the general title or respect of an adult male. The second title by its frequency is Miss (unmarried woman), and the third - Mrs. (married woman).
Other titles are less frequent. I will discover if I can combine them into particular groups. I am going to use titles as a fiture, but if they split the data too much, leaving just a few observations in each group, it can lead to overfitting. And for a general understanding of the data, it will be more convenient to put titles in clearer groups.
"Military" group of titles:
"Honor" group of titles:
Sir - is a formal English honorific address for men. Sir is used for men titled knights i.e. of orders of chivalry, and later also to baronets and other offices.
The Countess - is a historical title of nobility
Lady - a formal title in the United Kingdom. A woman with a title of nobility or honorary
Jonkheer - is an honorific in the Low Countries denoting the lowest rank within the nobility.
Don - is an honorific prefix primarily used in Spain and the former Spanish Empire, Italy, Portugal, the Philippines, Latin America, Croatia, and Goa. (male)
Dona - Feminine form for don (honorific), a Spanish, Portuguese, southern Italian, and Filipino title, given as a mark of respect
I am not sure about the title Ms. We have only two passengers with this title, so I will convert it to Miss.
I created a dictionary of titles, and I am using the method "map" to create the variable "Title_category.”
all_data[all_data['Title']=='Ms']
title_dict = { 'Mr': 'Mr',
'Mrs': 'Mrs',
'Miss': 'Miss',
'Master': 'Master',
'Ms': 'Miss',
'Mme': 'Mrs',
'Mlle': 'Miss',
'Capt': 'military',
'Col': 'military',
'Major': 'military',
'Dr': 'Dr',
'Rev': 'Rev',
'Sir': 'honor',
'the Countess': 'honor',
'Lady': 'honor',
'Jonkheer': 'honor',
'Don': 'honor',
'Dona': 'honor' }
# map titles to category
all_data['Title_category'] = all_data['Title'].map(title_dict)
fig = plt.figure(figsize=(12, 5))
ax1 = fig.add_subplot(121)
ax = sns.countplot(x = 'Title_category',
data = all_data, palette = "hls",
order = all_data['Title_category'].value_counts().index)
_ = plt.xticks(
rotation=45,
horizontalalignment='right',
fontweight='light'
)
plt.title('Passengers distribution by titles',fontsize= 12)
plt.ylabel('Number of passengers')
# calculate passengers for each category
labels = (all_data['Title_category'].value_counts())
# add result numbers on barchart
for i, v in enumerate(labels):
ax.text(i, v+10, str(v), horizontalalignment = 'center', size = 10, color = 'black')
plt.tight_layout()
ax2 = fig.add_subplot(122)
surv_by_title_cat = all_data.groupby('Title_category')['Survived'].value_counts(normalize = True).unstack()
surv_by_title_cat = surv_by_title_cat.sort_values(by=1, ascending = False)
surv_by_title_cat.plot(kind='bar',
stacked='True',
color=["#3f3e6fd1", "#85c6a9"], ax = ax2)
plt.legend( ( 'Drowned', 'Survived'), loc=(1.04,0))
_ = plt.xticks(
rotation=45,
horizontalalignment='right',
fontweight='light'
)
plt.title('Proportion of survived/drowned by titles (train data)',fontsize= 12)
plt.tight_layout()
plt.show()
Training data:
The biggest proportion of survivors is in the "Mrs" group - married women.
More than 80% drowned in the "Mr." group.
Nobody survived among the Reverend group.
category_survived = sns.catplot(x="Title_category", col="Survived",
data = all_data, kind="count",
height=4, aspect=.7)
category_survived.set_xticklabels(rotation=45,
horizontalalignment='right',
fontweight='light')
plt.tight_layout()
If we consider the survivors not by percentage within each group but by comparing the number of survivors between groups, then the "Miss" title category is the luckiest one. The "Mr" category lost the biggest number of passengers.
Let's also visualize how Title categories and ticket classes are related:
class_by_title_cat = all_data.groupby('Title_category')['Pclass'].value_counts(normalize = True)
class_by_title_cat = class_by_title_cat.unstack().sort_values(by = 1, ascending = False)
class_by_title_cat.plot(kind='bar',
stacked='True',
color = ['#eed4d0', '#cda0aa', '#a2708e'])
plt.legend(loc=(1.04,0))
_ = plt.xticks(
rotation = 45,
horizontalalignment = 'right',
fontweight = 'light'
)
plt.title('Proportion of 1st/2nd/3rd ticket class in each title category',fontsize= 14)
plt.xlabel('Category of the Title')
plt.tight_layout()
All honor and military titles occupied the 1st class.
For sure, there is a relationship between variables, and survival was influenced not only by the title itself but by a combination of factors that are, to some extent, interrelated. How class could relate to surviving? Let's go further and discover.
From the number of the cabin, we can extract the first letter, which will tell us about the placement of the cabin on the ship! And it seems to me as a very important knowledge:
I found some descriptions of each Titanic deck:
There were 8 decks: the upper-deck - for lifeboats, other 7 were under it and had letter symbols:
To the passengers, without deck information, I will imput U letter (as unknown).
all_data['deck'] = all_data['Cabin'].str.split('', expand = True)[1]
all_data.loc[all_data['deck'].isna(), 'deck'] = 'U'
print('Unique deck letters from the cabin numbers:', all_data['deck'].unique())
Unique deck letters from the cabin numbers: ['U' 'C' 'E' 'G' 'D' 'A' 'B' 'F' 'T']
fig = plt.figure(figsize=(20, 5))
ax1 = fig.add_subplot(131)
sns.countplot(x = 'deck',
data = all_data,
palette = "hls",
order = all_data['deck'].value_counts().index, ax = ax1)
plt.title('Passengers distribution by deck',fontsize= 16)
plt.ylabel('Number of passengers')
ax2 = fig.add_subplot(132)
deck_by_class = all_data.groupby('deck')['Pclass'].value_counts(normalize = True).unstack()
deck_by_class.plot(kind='bar',
stacked='True',
color = ['#eed4d0', '#cda0aa', '#a2708e'], ax = ax2)
plt.legend(('1st class', '2nd class', '3rd class'), loc=(1.04,0))
plt.title('Proportion of classes on each deck',fontsize= 16)
plt.xticks(rotation = False)
ax3 = fig.add_subplot(133)
deck_by_survived = all_data.groupby('deck')['Survived'].value_counts(normalize = True).unstack()
deck_by_survived = deck_by_survived.sort_values(by = 1, ascending = False)
deck_by_survived.plot(kind='bar',
stacked='True',
color=["#3f3e6fd1", "#85c6a9"], ax = ax3)
plt.title('Proportion of survived/drowned passengers by deck',fontsize= 16)
plt.legend(( 'Drowned', 'Survived'), loc=(1.04,0))
plt.xticks(rotation = False)
plt.tight_layout()
plt.show()
Most passengers don't have cabin numbers ('U').
The largest part of passengers with known cabin numbers were located on the 'C' deck and had 1st class tickets. The 'C' deck is fifth by the percentage of the survivor.
The largest surviving rate (among passengers with known cabin numbers in training dataset) had passengers from deck 'D.’
Deck A was the closest to the deck with lifeboats, but it was the last in surviving rate (except unknown and T deck). How did it happen?
all_data[(all_data['deck']=='A') & (all_data['Survived']==0)]
I got curious, so I read a bit about some of these passengers:
John Hugo Ross When he boarded on 10 April 1912, he was so ill from dysentery he had to be carried to his cabin on a stretcher. When Ross was told the ship had struck an iceberg and that he should get dressed, Ross refused to believe the trouble was serious. "Is that all?" he told Peuchen. "It will take more than an iceberg to get me off this ship." Presumably, Ross drowned in his bed.
Andrews, Mr. Thomas Jr. was a managing director of H&W (built the Titanic) in charge of designing and was familiar with every detail of the construction of the firm's ships. He helped to evacuate people.
Roebling, Mr. Washington Augustus II helped to evacuate people as well.
It is obvious that there is no algorithm that can predict the survival rate by 100 percent based on the factors of the passenger's location on the ship or his age since the human factor and the unpredicted emergency were involved in the rescue process.
For the training process, it will be better to include passengers from the T deck to the A deck group.
Does the size of the family on Board together affect the chances of surviving a disaster? Does having children increase the chance of getting into a boat, or is it easier to survive being single?
I calculate the family size by summarizing the number of siblings with the parch number plus 1 (passenger himself).
Family size = sib + parch + 1
all_data['Family_size'] = all_data['SibSp'] + all_data['Parch'] + 1
family_size = all_data['Family_size'].value_counts()
print('Family size and number of passengers:')
all_data['Family_size'] = all_data['SibSp'] + all_data['Parch'] + 1
Family size and number of passengers:
1 790
2 235
3 159
4 43
6 25
5 22
7 16
11 11
8 8
Looks strange that there are 16 passengers with a family size of 7, for example. Let's check!
Also, I will add a surname variable by extracting the first word of the name.
all_data['Surname'] = all_data['Name'].str.split(',', expand = True)[0]
all_data[all_data['Family_size'] == 7]['Surname'].value_counts()
Andersson 9
Asplund 7
all_data[(all_data['Family_size'] == 7) & (all_data['Surname']=='Andersson')]
Let's group Andersons with 7-size family by ticket number.
There are 7 of them who used the same ticket and travelled together. 5 children (each of them has 4 siblings) and 2 parents. Two passengers used separate tickets.
all_data[(all_data['Family_size'] == 7) & (all_data['Surname']=='Andersson')].Ticket.value_counts()
347082 7
3101281 1
347091 1
all_data[(all_data['Ticket'] == '3101281') | (all_data['Ticket'] == '347091')]
Looks like they actually traveled alone, I will correct that data:
all_data.loc[all_data['PassengerId'] == 69, ['SibSp', 'Parch', 'Family_size']] = [0,0,1]
all_data.loc[all_data['PassengerId'] == 1106, ['SibSp', 'Parch', 'Family_size']] = [0,0,1]
all_data[(all_data['Ticket'] == '3101281') | (all_data['Ticket'] == '347091')]
There are some inconsistencies in other categories, with fewer relatives.
Let's check people with 5-size family and group them by Surname:
all_data[all_data['Family_size'] == 5]['Surname'].value_counts()
Palsson 5
Ryerson 5
Ford 5
Lefebre 5
Kink-Heilmann 1
Hocking 1
all_data[(all_data['Surname'] == 'Kink-Heilmann')&(all_data['Family_size'] == 5)]
Kink-Heilmann, Mr. Anton had 2 other siblings on the ship, unlike his wife, for whom these relatives do not fit the description of relatives in the data set. We will assume that all other "mismatches" in the groups are similar to this. Since I plan to group the size of families into groups, this will eliminate possible inconsistencies.
fig = plt.figure(figsize = (12,4))
ax1 = fig.add_subplot(121)
ax = sns.countplot(all_data['Family_size'], ax = ax1)
# calculate passengers for each category
labels = (all_data['Family_size'].value_counts())
# add result numbers on barchart
for i, v in enumerate(labels):
ax.text(i, v+6, str(v), horizontalalignment = 'center', size = 10, color = 'black')
plt.title('Passengers distribution by family size')
plt.ylabel('Number of passengers')
ax2 = fig.add_subplot(122)
d = all_data.groupby('Family_size')['Survived'].value_counts(normalize = True).unstack()
d.plot(kind='bar', color=["#3f3e6fd1", "#85c6a9"], stacked='True', ax = ax2)
plt.title('Proportion of survived/drowned passengers by family size (train data)')
plt.legend(( 'Drowned', 'Survived'), loc=(1.04,0))
plt.xticks(rotation = False)
plt.tight_layout()
We can observe that the percentage of survivors in people who have a family of 2, 3, 4 people is greater than in singles, then the percentage of survivors decreases as the family size increases.
I will create a 'Family_size_group' variable with four categories:
all_data['Family_size_group'] = all_data['Family_size'].map(lambda x: 'f_single' if x == 1
else ('f_usual' if 5 > x >= 2
else ('f_big' if 8 > x >= 5
else 'f_large' )
))
fig = plt.figure(figsize = (14,5))
ax1 = fig.add_subplot(121)
d = all_data.groupby('Family_size_group')['Survived'].value_counts(normalize = True).unstack()
d = d.sort_values(by = 1, ascending = False)
d.plot(kind='bar', stacked='True', color = ["#3f3e6fd1", "#85c6a9"], ax = ax1)
plt.title('Proportion of survived/drowned passengers by family size (training data)')
plt.legend(( 'Drowned', 'Survived'), loc=(1.04,0))
_ = plt.xticks(rotation=False)
ax2 = fig.add_subplot(122)
d2 = all_data.groupby('Family_size_group')['Pclass'].value_counts(normalize = True).unstack()
d2 = d2.sort_values(by = 1, ascending = False)
d2.plot(kind='bar', stacked='True', color = ['#eed4d0', '#cda0aa', '#a2708e'], ax = ax2)
plt.legend(('1st class', '2nd class', '3rd class'), loc=(1.04,0))
plt.title('Proportion of 1st/2nd/3rd ticket class in family group size')
_ = plt.xticks(rotation=False)
plt.tight_layout()
Large families are all from the 3rd class, and no one from the training part of the dataset survived.
The biggest proportion of the 1st class is the usual size of the family, and the proportion of survivors in the usual family is the biggest.
We have made a lot of assumptions about the survival rate depending on the classes. Let's now look closely at this variable.
ax = sns.countplot(all_data['Pclass'], palette = ['#eed4d0', '#cda0aa', '#a2708e'])
# calculate passengers for each category
labels = (all_data['Pclass'].value_counts(sort = False))
# add result numbers on barchart
for i, v in enumerate(labels):
ax.text(i, v+2, str(v), horizontalalignment = 'center',
size = 12, color = 'black', fontweight = 'bold')
plt.title('Passengers distribution by family size')
plt.ylabel('Number of passengers')
plt.tight_layout()
Most of the Titanic's passengers were traveling third class (709).
The second class is the smallest in terms of the number of passengers.
fig = plt.figure(figsize=(14, 5))
ax1 = fig.add_subplot(121)
sns.countplot(x = 'Pclass', hue = 'Survived',
data = all_data, palette=["#3f3e6fd1", "#85c6a9"], ax = ax1)
plt.title('Number of survived/drowned passengers by class (train data)')
plt.ylabel('Number of passengers')
plt.legend(( 'Drowned', 'Survived'), loc=(1.04,0))
_ = plt.xticks(rotation=False)
ax2 = fig.add_subplot(122)
d = all_data.groupby('Pclass')['Survived'].value_counts(normalize = True).unstack()
d.plot(kind='bar', stacked='True', ax = ax2, color =["#3f3e6fd1", "#85c6a9"])
plt.title('Proportion of survived/drowned passengers by class (train data)')
plt.legend(( 'Drowned', 'Survived'), loc=(1.04,0))
_ = plt.xticks(rotation=False)
plt.tight_layout()
Despite the previously identified prerequisites (on average, older people are more likely to die, and in the first class, the average age is higher than in other classes. Also, passengers on deck A, which consists of 100% first class, have a large proportion of drowned passengers). The first class has the largest number of survivors, and the proportion of survivors within the class is the largest.
Third-class tickets had the highest number of drowned passengers, and most of the third-class passengers drowned.
sns.catplot(x = 'Pclass', hue = 'Survived',
col = 'Sex', kind = 'count',
data = all_data , palette=["#3f3e6fd1", "#85c6a9"])
plt.tight_layout()
However, most of the male passengers of the first class drowned, and for females, almost all of them survived. In the third grade, half of the females survived.
For a better understanding of how the combination of some factors influences on chances of survival, let us break passengers into 18 imaginary groups separated by:
To do so, I will create 6 stripplots (3 for male, 3 for female), with values grouped by Surviving status, and add background colour to separate age groups:
plt.figure(figsize=(20, 10))
palette=["#3f3e6fd1", "#85c6a9"]
plt.subplot(2, 3, 1)
sns.stripplot(x = 'Survived', y = 'Age', data = age_1_class[age_1_class['Sex']=='male'],
linewidth = 0.9, palette = palette)
plt.axhspan(0, 16, color = "#e1f3f6")
plt.axhspan(16, 40, color = "#bde6dd")
plt.axhspan(40, 80, color = "#83ceb9")
plt.title('Age distribution (males, 1st class)',fontsize= 14)
plt.xticks( np.arange(2), ['drowned', 'survived'])
plt.ylim(0, 80)
plt.subplot(2, 3, 2)
sns.stripplot(x = 'Survived', y = 'Age', data = age_2_class[age_2_class['Sex']=='male'],
linewidth = 0.9, palette = palette)
plt.axhspan(0, 16, color = "#e1f3f6")
plt.axhspan(16, 40, color = "#bde6dd")
plt.axhspan(40, 80, color = "#83ceb9")
plt.title('Age distribution (males, 2nd class)',fontsize= 14)
plt.xticks( np.arange(2), ['drowned', 'survived'])
plt.ylim(0, 80)
plt.subplot(2, 3, 3)
sns.stripplot(x = 'Survived', y = 'Age', data = age_3_class[age_3_class['Sex']=='male'],
linewidth = 0.9, palette = palette)
plt.axhspan(0, 16, color = "#e1f3f6")
plt.axhspan(16, 40, color = "#bde6dd")
plt.axhspan(40, 80, color = "#83ceb9")
plt.title('Age distribution (males, 3rd class)',fontsize= 14)
plt.xticks( np.arange(2), ['drowned', 'survived'])
plt.ylim(0, 80)
plt.subplot(2, 3, 4)
sns.stripplot(x = 'Survived',
y = 'Age', data = age_1_class[age_1_class['Sex']=='female'],
linewidth = 0.9, palette = palette)plt.axhspan(0, 16, color = "#ffff9978")
plt.axhspan(16, 40, color = "#ffff97bf")
plt.axhspan(40, 80, color = "#ffed97bf")
plt.title('Age distribution (females, 1st class)',fontsize= 14)
plt.xticks( np.arange(2), ['drowned', 'survived'])
plt.ylim(0, 80)
plt.subplot(2, 3, 5)
sns.stripplot(x = 'Survived', y = 'Age', data = age_2_class[age_2_class['Sex']=='female'],
linewidth = 0.9, palette = palette)
plt.axhspan(0, 16, color = "#ffff9978")
plt.axhspan(16, 40, color = "#ffff97bf")
plt.axhspan(40, 80, color = "#ffed97bf")
plt.title('Age distribution (females, 2nd class)',fontsize= 14)
plt.xticks( np.arange(2), ['drowned', 'survived'])
plt.ylim(0, 80)
plt.subplot(2, 3, 6)
sns.stripplot(x = 'Survived', y = 'Age', data = age_3_class[age_3_class['Sex']=='female'],
linewidth = 0.9, palette = palette)
plt.axhspan(0, 16, color = "#ffff9978")
plt.axhspan(16, 40, color = "#ffff97bf")
plt.axhspan(40, 80, color = "#ffed97bf")
plt.title('Age distribution (females, 3rd class)',fontsize= 14)
plt.xticks( np.arange(2), ['drowned', 'survived'])
plt.ylim(0, 80)
plt.show()
From these graphs (training data only), we can see that
Let's discover gender a little bit more:
plt.figure(figsize = (15,4))
plt.subplot (1,3,1)
ax = sns.countplot(all_data['Sex'], palette="Set3")
plt.title('Number of passengers by Sex')
plt.ylabel('Number of passengers')
# calculate passengers for each category
labels = (all_data['Sex'].value_counts())
# add result numbers on barchart
for i, v in enumerate(labels):
ax.text(i, v+10, str(v), horizontalalignment = 'center', size = 10, color = 'black')
plt.subplot (1,3,2)
sns.countplot( x = 'Pclass', data = all_data, hue = 'Sex', palette="Set3")
plt.title('Number of male/female passengers by class')
plt.ylabel('Number of passengers')
plt.legend( loc=(1.04,0))
plt.subplot (1,3,3)
sns.countplot( x = 'Family_size_group', data = all_data, hue = 'Sex',
order = all_data['Family_size_group'].value_counts().index , palette="Set3")
plt.title('Number of male/female passengers by family size')
plt.ylabel('Number of passengers')
plt.legend( loc=(1.04,0))
plt.tight_layout()
There were overall more males than females on board, which is fair for each ticket class, but in the 3rd class, the number of males was more than twice than of females.
Almost 600 male passengers traveled without family members and only about 200 females, but in usual and big families there were slightly more female passengers.
Titanic had 3 embarkation points before the ship started its route to New York:
Some passengers could have left Titanic in Cherbourg or Queenstown and avoided catastrophe. Also, the point of embarkation could have an influence on ticket fare and location on the ship.
Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
fig = plt.figure(figsize = (15,4))ax1 = fig.add_subplot(131)
palette = sns.cubehelix_palette(5, start = 2)
ax = sns.countplot(all_data['Embarked'], palette = palette, order = ['C', 'Q', 'S'], ax = ax1)
plt.title('Number of passengers by Embarked')
plt.ylabel('Number of passengers')
# calculate passengers
for each categorylabels = (all_data['Embarked'].value_counts())
labels = labels.sort_index()
# add result numbers on barchart
for i, v in enumerate(labels):
ax.text(i, v+10, str(v), horizontalalignment = 'center', size = 10, color = 'black')
ax2 = fig.add_subplot(132)
surv_by_emb = all_data.groupby('Embarked')['Survived'].value_counts(normalize = True)
surv_by_emb = surv_by_emb.unstack().sort_index()
surv_by_emb.plot(kind='bar', stacked='True', color=["#3f3e6fd1", "#85c6a9"], ax = ax2)
plt.title('Proportion of survived/drowned passengers by Embarked (train data)')
plt.legend(( 'Drowned', 'Survived'), loc=(1.04,0))
_ = plt.xticks(rotation=False)
ax3 = fig.add_subplot(133)
class_by_emb = all_data.groupby('Embarked')['Pclass'].value_counts(normalize = True)
class_by_emb = class_by_emb.unstack().sort_index()
class_by_emb.plot(kind='bar', stacked='True', color = ['#eed4d0', '#cda0aa', '#a2708e'], ax = ax3)
plt.legend(('1st class', '2nd class', '3rd class'), loc=(1.04,0))
plt.title('Proportion of clases by Embarked')
_ = plt.xticks(rotation=False)
plt.tight_layout()
sns.catplot(x="Embarked", y="Fare", kind="violin", inner=None,
data=all_data, height = 6, palette = palette, order = ['C', 'Q', 'S'])
plt.title('Distribution of Fare by Embarked')
plt.tight_layout()
# Descriptive statistics:
pd.DataFrame(all_data.groupby('Embarked')['Fare'].describe())
The wider fare distribution among passengers who embarked in Cherbourg. It makes sense - many first-class passengers boarded the ship here, but the share of third-class passengers is quite significant.
The smallest variation in the price of passengers who boarded in q. Also, the average price of these passengers is the smallest. I think this is due to the fact that the path was supposed to be the shortest + almost all third-class passengers.
Let's check the NA values of the Embarked variable:
train_data[train_data['Embarked'].isna()]
These two passengers traveled together (same ticket number). To impute missing values, we can use mode value for passengers with the closest fare value and Pclass.
sns.catplot(x="Pclass", y="Fare", kind="swarm",
data=all_data, palette=sns.cubehelix_palette(5, start = 3), height = 6)
plt.tight_layout()
We can observe that the distribution of prices for the second and third class is very similar. The distribution of first-class prices is very different, has a larger spread, and on average prices are higher.
Let's add colors to our points to indicate the surviving status of the passenger (there will be only data from the training part of the dataset):
sns.catplot(x="Pclass", y="Fare", hue = "Survived", kind="swarm", data=all_data,
palette=["#3f3e6fd1", "#85c6a9"], height = 6)
plt.tight_layout()
Looks like the bigger the passenger paid, the more chances he had to survive.
What about zero fares in the first class? Is it a mistake?
all_data[all_data['Fare'] == min(all_data['Fare'])]
Some of the passengers have "Line" tickets; perhaps they were somehow involved in the Titanic but were not the ship's crew. I don't think we should change these prices, but add an additional feature for these passengers.
We know that there were not enough boats on Board the Titanic for all passengers to be able to evacuate. After studying the information about the passengers, we can make some assumptions about who had a better chance of survival in a shipwreck situation, as well as General observations about passengers.
You can find the second part of my work with the Titanic dataset on Kaggle, which contains the following: