paint-brush
Secrets, Relationships, and Patterns: An Introduction to Exploratory Data Analysisby@ndemidova
2,891 reads
2,891 reads

Secrets, Relationships, and Patterns: An Introduction to Exploratory Data Analysis

by Nadezda DemidovaSeptember 6th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

In this article, we set sail on a captivating journey through the EDA process, using the legendary Titanic dataset from Kaggle as our North Star.
featured image - Secrets, Relationships, and Patterns: An Introduction to Exploratory Data Analysis
Nadezda Demidova HackerNoon profile picture


1. Introduction

Embarking on the voyage of machine learning and data analysis, it's crucial to have a reliable map to navigate the data's intricate terrain. Enter Exploratory Data Analysis (EDA), the compass that guides us through the data wilderness, revealing its secrets, relationships, and patterns hidden beneath the surface. EDA is an essential step in the data science process that involves gaining a deeper understanding of the data and uncovering its underlying structure, relationships, and patterns. It's the process of summarising, visualizing, and transforming data to extract insights and inform decisions about the next steps in data analysis and modeling.


In this article, we set sail on a captivating journey through the EDA process, using the legendary Titanic dataset from Kaggle as our North Star 🌟.


To construct effective models, it's essential to:


  • Understand Your Domain: Familiarize yourself with the specific field or context to which the data belongs.
  • Comprehend Data Characteristics: Gain insights into the data's nature, its structural attributes, and how it is organized.
  • Handle Missing Values: Address and manage any instances where data is absent, ensuring it doesn't hinder your analysis.
  • Identify Outliers: Recognize and address any data points that significantly deviate from the norm, which could distort your modeling efforts.
  • Define the Key Question: Clearly articulate the central question or objective you aim to address through your analysis, including the target value you intend to predict or explain.
  • Test and Validate Your Hypotheses: Continuously assess and refine your assumptions and theories about the data as you delve deeper into your analysis.


You can also access my Titanic EDA Notebook on Kaggle!

Content Overview

  1. Introduction

  2. Domain information

  3. Loading libraries:

  4. Loading data

  5. First look: variables, NAs

    5.1 Variables

    5.2 Types of the variables

    5.3 Check data for NA

  6. Exploring the data

    6.1 Survivals - target value

    6.2 AGE

    6.3 What is in the name?

    6.4 Cabin

    6.5 Family

    6.6 Class

    6.7 Gender

    6.8 Embarked

    6.9 Fare

  7. Conclusion

2. Domain information

The Titanic was a British passenger liner operated by the White Star Line. Titanic was on its way from Southampton to New York City when it sank in the North Atlantic Ocean in the early morning hours of 15 April 1912 after Titanic collided with an iceberg. The ship carried 2224 people, considering passengers and crew aboard,1514 of them died.


Titanic carried 16 wooden lifeboats and 4 collapsibles, which could accommodate 1178 people, only one-third of Titanic's total capacity (and 53% of real number of passengers).


At the time, lifeboats were intended to ferry survivors from a sinking ship to a rescuing ship—not keep afloat the whole population or power them to shore. If the SS Californian would responded to Titanic's distress calls, the lifeboats may have been adequate to ferry the passengers to safety as planned, but it didn't happen and the only way to survive were to get on the lifeboat.

The main question we will try to answer is “what passengers were more likely to survive?”

3. Loading libraries

List of libraries I am using:


  • pandas - offers data structures and operations for manipulating numerical tables and time series. (imported as pd) Documentation
  • seaborn - data visualization library based on matplotlib. Documentation
  • matplotlib.pyplot - to create some visualizations (imported as plt) Documentation
  • numpy - The fundamental package for scientific computing with Python. Documentation

add Codeadd Markdown


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

4. Loading data

As input information I have two CSV files:

  • train.csv - training part of the dataset contains labels and information about passengers.
  • test.csv - testing part of the dataset doesn't contain labels.

In this notebook, I will use all available information (train + test datasets) to perform exploratory data analysis.


  1. First, load both CSV files into two DataFrames, using pandas read_csv method,
  2. and check the shape of the loaded data:
# path to train dataset
train_path = '../input/titanic/train.csv'
# path to test dataset
test_path = '../input/titanic/test.csv'

# Read a comma-separated values (csv) file into pandas DataFrame
train_data = pd.read_csv(train_path)
test_data = pd.read_csv(test_path)

# shape of tha data
print('Train shape: ', train_data.shape)
print('Test shape: ', test_data.shape)
Output:
Train shape:  (891, 12)
Test shape:  (418, 11)

The training part contains information about 891 passengers, described by 12 variables, including one target variable.


The testing part contains 418 observations, i.e. information about passengers, described by 11 variables (the test dataset doesn't contain target value.)


  1. Combine test and train data into one "all_data" DataFrame.
    To do so, I create a sequence of DataFrame objects and use pandas concat method. Terget values of testing data in resulting dataset will be NaN.
    Check the shape of the result DataFrame and take a look at the first 4 rows:
# create a sequence of DataFrame objects
frames = [train_data, test_data]
# Concatenate pandas objects along a particular axis 
all_data = pd.concat(frames, sort = False)
# shape of the data
print('All data shape: ', all_data.shape)
# Show first 4 rows of the concatenated DataFrame
all_data.head(4)

Overall, we have information about 1309 passengers. I am guessing this dataset contains data only about passengers, not crew members (we know that Titanic carried 2224 people).

5. First look: variables, NAs

5.1 Variables

From the data overview of the competition, we have a description of each variable:

  • PassengerId - unique identifier

  • Survived:

      0 = No
      1 = Yes
    
  • Pclass: Ticket class

      1 = 1st, Upper
      2 = 2nd, Middle
      3 = 3rd, Lower
    
  • Name: full name with a title

  • Sex: gender

  • Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

  • Sibsp: Number of siblings/spouses aboard the Titanic. The dataset defines family relations in this way:

      Sibling = brother, sister, stepbrother, stepsister
      Spouse = husband, wife (mistresses and fiancés were ignored)
    
  • Parch: Number of parents/children aboard the Titanic. The dataset defines family relations in this way:

      Parent = mother, father
      Child = daughter, son, stepdaughter, stepson
      Some children travelled only with a nanny, therefore parch=0 for them.
    
  • Ticket: Ticket number.

  • Fare: Passenger fare.

  • Cabin: Cabin number.

  • Embarked: Port of Embarkation:

      C = Cherbourg
      Q = Queenstown
      S = Southampton
    


add Codeadd Markdown

5.2 Types of the variables

Data types, non-null values count:

all_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 132.9+ KB

Age and Fare are continuous numeric variables.


Pclass is an integer, but in fact, it is a categorical variable represented by three numbers.
After previous manipulations, the Survived variable has type 'float'; it's not correct since it's a categorical variable, too, but it will not influence my EDA process, so I will let it float for now.

5.3 Check data for NA

To check the dataset for NAs I am using **isna()**dataframe function, which returns a boolean same-sized object indicating if the values are NA and then I am calculating the number of True values for each variable.


NA values for each dataframe (train, test, all) are presented in the table below:

# check data for NA values
all_data_NA = all_data.isna().sum()
train_NA = train_data.isna().sum()
test_NA = test_data.isna().sum()

pd.concat([train_NA, test_NA, all_data_NA], 
            axis=1, 
            sort = False,
            keys = ['Train NA', 'Test NA', 'All NA'])

There are 263 missing Age values, 1 missing Fare, 1014 NAs in Cabin variable, and 2 in Embarked variable.


418 NA in Survived variable due to the absence of this information in the test dataset. I will not impute these missings in the current notebook :) So, when I use this variable for visualization, there will be information only for the training part of the data.


In this notebook, I will do some missing data handling for the combined dataset. But in the second part of my work (ML solution), this should be done based on what we know only about training data, to avoid any data leakage.

6. Exploring the data

6.1 Survivals - target value

Let's calculate and visualize the distribution of our target variable - 'Survived.’


A counterplot of the seaborn module is a very useful way to show the counts of observations in each category.


Since we have a target only for the training part, these numbers don't include all passengers.

# set size of the plot
plt.figure(figsize=(6, 4.5)) 

# countplot shows the counts of observations in each categorical bin using bars.
# x - name of the categorical variable
ax = sns.countplot(x = 'Survived', data = all_data, palette=["#3f3e6fd1", "#85c6a9"])

# set the current tick locations and labels of the x-axis.
plt.xticks( np.arange(2), ['drowned', 'survived'] )
# set title
plt.title('Overall survival (training dataset)',fontsize= 14)
# set x label
plt.xlabel('Passenger status after the tragedy')
# set y label
plt.ylabel('Number of passengers')

# calculate passengers for each category
labels = (all_data['Survived'].value_counts())
# add result numbers on barchart
for i, v in enumerate(labels):
    ax.text(i, v-40, str(v), horizontalalignment = 'center', size = 14, color = 'w', fontweight = 'bold')
    
plt.show()

We have 891 passengers in the train dataset. 549 (61,6%) of them drowned and only 342(38,4%) survived.
But we know that lifeboats could carry53% of total passengers.

6.2 AGE

What is the age of passengers, how does it relate to the chances of survival, and how it changes depending on class and gender?

6.2.1 Age distribution

We have 263 missing values:

  • 177 missing in the training dataset
  • 86 in the test dataset

Overall age distribution (seaborn distplot) and descriptive statistics:

# set plot size
plt.figure(figsize=(15, 3))

# plot a univariate distribution of Age observations 
sns.distplot(all_data[(all_data["Age"] > 0)].Age, kde_kws={"lw": 3}, bins = 50)

# set titles and labels
plt.title('Distrubution of passengers age (all data)',fontsize= 14)
plt.xlabel('Age')
plt.ylabel('Frequency')
# clean layout
plt.tight_layout()


# Descriptive statistics include those that summarize the central tendency, 
# dispersion and shape of a dataset’s distribution, excluding NaN values.
age_distr = pd.DataFrame(all_data['Age'].describe())
# Transpose index and columns.
age_distr.transpose()

The distribution of Age is slightly right-skewed. The Age varies from about 0.17 year to 80 years with mean = 29.88, and there don't seem to be any obvious outliers, but we will check it.

6.2.2 Age by surviving status

Did age have a big influence on chances to survive?


To visualize two age distributions grouped by surviving status, I am using boxlot and stripplot shown together:

plt.figure(figsize=(15, 3))

# Draw a box plot to show Age distributions with respect to survival status.
sns.boxplot(y = 'Survived', x = 'Age', data = train_data,
     palette=["#3f3e6fd1", "#85c6a9"], fliersize = 0, orient = 'h')

# Add a scatterplot for each category.
sns.stripplot(y = 'Survived', x = 'Age', data = train_data,
     linewidth = 0.6, palette=["#3f3e6fd1", "#85c6a9"], orient = 'h')

plt.yticks( np.arange(2), ['drowned', 'survived'])
plt.title('Age distribution grouped by surviving status (train data)',fontsize= 14)
plt.ylabel('Passenger status after the tragedy')
plt.tight_layout()

# Descriptive statistics:
pd.DataFrame(all_data.groupby('Survived')['Age'].describe())

The mean age of survived passengers is 28.34, which is 2.28 smaller than the mean age of drowned passengers (only passengers we know survived status for).


The minimum age of drowned passengers is1 y.o, which is very sad.


The maximum age of surviving passengers is80 y.o. Let's check if there is no mistake.


all_data[all_data['Age'] == max(all_data['Age'] )]

Actually, Mr Algernon Henry Barkworth was born on 4 June 1864. He was 48 in 1912 and died in 1945 at 80 y.o.

train_data.loc[train_data['PassengerId'] == 631, 'Age'] = 48
all_data.loc[all_data['PassengerId'] == 631, 'Age'] = 48

# Descriptive statistics:
pd.DataFrame(all_data.groupby('Survived')['Age'].describe())

Let's update our description:


The mean age of survived passengers is 28.23, which is 2.39 smaller than the mean age of drowned passengers (only passengers we know survived status for).= The maximum age of surviving passengers is 63 y.o.


It looks like there is a slightly bigger chance to survive for younger people.

6.2.3 Age by class

There, I will compare three age distributions grouped by class of the passenger.
As visualisations, I will use 2 graphs:


  1. boxplot+stripplot as before
  2. kdeplot, to plot age density curves for each class. This method can't handle missing values, so I filter the data before using it.
# set size
plt.figure(figsize=(20, 6))

# set palette
palette = sns.cubehelix_palette(5, start = 3)

plt.subplot(1, 2, 1)
sns.boxplot(x = 'Pclass', y = 'Age', data = all_data,
     palette = palette, fliersize = 0)

sns.stripplot(x = 'Pclass', y = 'Age', data = all_data,
     linewidth = 0.6, palette = palette)
plt.xticks( np.arange(3), ['1st class', '2nd class', '3rd class'])
plt.title('Age distribution grouped by ticket class (all data)',fontsize= 16)
plt.xlabel('Ticket class')


plt.subplot(1, 2, 2)

# To use kdeplot I need to create variables with filtered data for each category
age_1_class = all_data[(all_data["Age"] > 0) & 
                              (all_data["Pclass"] == 1)]
age_2_class = all_data[(all_data["Age"] > 0) & 
                              (all_data["Pclass"] == 2)]
age_3_class = all_data[(all_data["Age"] > 0) & 
                              (all_data["Pclass"] == 3)]

# Ploting the 3 variables that we create
sns.kdeplot(age_1_class["Age"], shade=True, color='#eed4d0', label = '1st class')
sns.kdeplot(age_2_class["Age"], shade=True, color='#cda0aa', label = '2nd class')
sns.kdeplot(age_3_class["Age"], shade=True, color='#a2708e', label = '3rd class')
plt.title('Age distribution grouped by ticket class (all data)',fontsize= 16)
plt.xlabel('Age')
plt.xlim(0, 90)
plt.tight_layout()
plt.show()

# Descriptive statistics:
pd.DataFrame(all_data.groupby('Pclass')['Age'].describe())

1st class has wider distribution compare to 2nd and 3rd and almost symmetric.


Both 2nd and 3rd classes age distributions are right skewed.
The youngest passenger has 3rd class ticket, age = 0.17.
The oldest passenger has 1st class ticket, age = 76.
3rd class mean age = 24.8, 2nd classe average age is 29.5 and 1st class average age is 39.1.


Since surviving passengers, on average, were younger than those who drowned, does it mean that 3rd class passengers had more chances to survive? We will discover it later.


From the graphs, we can see the difference in age distribution between classes. So when I will do missing data imputation, I will take class into account.

6.2.4 Age vs. class vs. gender

Comparison of the age distribution by gender I will do separately for each class since we have such a noticeable age difference between classes.

# Descriptive statistics:
age_1_class_stat = pd.DataFrame(age_1_class.groupby('Sex')['Age'].describe())
age_2_class_stat = pd.DataFrame(age_2_class.groupby('Sex')['Age'].describe())
age_3_class_stat = pd.DataFrame(age_3_class.groupby('Sex')['Age'].describe())

pd.concat([age_1_class_stat, age_2_class_stat, age_3_class_stat], axis=0, sort = False, keys = ['1st', '2nd', '3rd'])

The oldest and the youngest passengers are female.
In each class, the average Age of female are slightly less than the average Age of male passengers.

6.3 What is in the name?

Each passenger Name value contains the title of the passenger, which we can extract and discover.

To create a new variable "Title":


  1. I am using the method 'split' by comma to divide Name into two parts and save the second part
  2. I am splitting the saved part by dot and saving the first part of the result
  3. To remove spaces around the title, I am using the 'split' method

To visualize how many passengers hold each title, I chose countplot.

all_data['Title'] = all_data['Name'].str.split(',', expand = True)[1].str.split('.', expand = True)[0].str.strip(' ')
​
plt.figure(figsize=(6, 5))
ax = sns.countplot( x = 'Title', 
                    data = all_data, p
                    alette = "hls", 
                    order = all_data['Title'].value_counts().index)
_ = plt.xticks(
    rotation=45, 
    horizontalalignment='right',
    fontweight='light'  
)
​
plt.title('Passengers distribution by titles',fontsize= 14)
plt.ylabel('Number of passengers')
​
# calculate passengers for each category
labels = (all_data['Title'].value_counts())
# add result numbers on barchart
for i, v in enumerate(labels):
    ax.text(i, v+10, str(v), 
            horizontalalignment = 'center', 
            size = 10, 
            color = 'black')
    
​
plt.tight_layout()
plt.show()
​

The most frequent title among passengers is Mister (Mr.) - the general title or respect of an adult male. The second title by its frequency is Miss (unmarried woman), and the third - Mrs. (married woman).


Other titles are less frequent. I will discover if I can combine them into particular groups. I am going to use titles as a fiture, but if they split the data too much, leaving just a few observations in each group, it can lead to overfitting. And for a general understanding of the data, it will be more convenient to put titles in clearer groups.


  • Master - By the late 19th century, etiquette dictated that men be addressed as Mister and boys as Master.
  • Mme - Madame. a French title of respect equivalent to “Mrs.” used alone or prefixed to a woman's married name or title. I will add to "Mrs".
  • Mlle - Mademoiselle is a French courtesy title, abbreviated Mlle, traditionally given to an unmarried woman. The equivalent in English is "Miss". I will add to the "Miss" group.
  • Dr. - Doctor is an academic title.
  • Rev. - Reverend is usually a courtesy title for Protestant Christian ministers or pastors.

"Military" group of titles:

  • Capt. -Captain is a title for the commander of a military unit
  • Major is a military rank of commissioned officer status
  • Col. - The honorary title of Colonel is conferred by several states in the US and certain military units of the Commonwealth of Nations

"Honor" group of titles:

  • Sir - is a formal English honorific address for men. Sir is used for men titled knights i.e. of orders of chivalry, and later also to baronets and other offices.

  • The Countess - is a historical title of nobility

  • Lady - a formal title in the United Kingdom. A woman with a title of nobility or honorary

  • Jonkheer - is an honorific in the Low Countries denoting the lowest rank within the nobility.

  • Don - is an honorific prefix primarily used in Spain and the former Spanish Empire, Italy, Portugal, the Philippines, Latin America, Croatia, and Goa. (male)

  • Dona - Feminine form for don (honorific), a Spanish, Portuguese, southern Italian, and Filipino title, given as a mark of respect


I am not sure about the title Ms. We have only two passengers with this title, so I will convert it to Miss.

I created a dictionary of titles, and I am using the method "map" to create the variable "Title_category.”


all_data[all_data['Title']=='Ms']

title_dict = {  'Mr':     'Mr',
                'Mrs':    'Mrs',
                'Miss':   'Miss',
                'Master': 'Master',
              
                'Ms':     'Miss',
                'Mme':    'Mrs',
                'Mlle':   'Miss',

                'Capt':   'military',
                'Col':    'military',
                'Major':  'military',

                'Dr':     'Dr',
                'Rev':    'Rev',
                  
                'Sir':    'honor',
                'the Countess': 'honor',
                'Lady':   'honor',
                'Jonkheer': 'honor',
                'Don':    'honor',
                'Dona':   'honor' }

# map titles to category
all_data['Title_category'] = all_data['Title'].map(title_dict)


fig = plt.figure(figsize=(12, 5))


ax1 = fig.add_subplot(121)
ax = sns.countplot(x = 'Title_category', 
                   data = all_data, palette = "hls", 
                   order = all_data['Title_category'].value_counts().index)
_ = plt.xticks(
    rotation=45, 
    horizontalalignment='right',
    fontweight='light'  
)
plt.title('Passengers distribution by titles',fontsize= 12)
plt.ylabel('Number of passengers')

# calculate passengers for each category
labels = (all_data['Title_category'].value_counts())
# add result numbers on barchart
for i, v in enumerate(labels):
    ax.text(i, v+10, str(v), horizontalalignment = 'center', size = 10, color = 'black')
    

plt.tight_layout()

ax2 = fig.add_subplot(122)
surv_by_title_cat = all_data.groupby('Title_category')['Survived'].value_counts(normalize = True).unstack()
surv_by_title_cat = surv_by_title_cat.sort_values(by=1, ascending = False)
surv_by_title_cat.plot(kind='bar', 
                       stacked='True', 
                       color=["#3f3e6fd1", "#85c6a9"], ax = ax2)

plt.legend( ( 'Drowned', 'Survived'), loc=(1.04,0))
_ = plt.xticks(
    rotation=45, 
    horizontalalignment='right',
    fontweight='light'  
)


plt.title('Proportion of survived/drowned by titles (train data)',fontsize= 12)

plt.tight_layout()
plt.show()


  • The smallest group is "honor,” passengers with royal-kind titles.

Training data:

  • The biggest proportion of survivors is in the "Mrs" group - married women.

  • More than 80% drowned in the "Mr." group.

  • Nobody survived among the Reverend group.


category_survived = sns.catplot(x="Title_category",  col="Survived",
                data = all_data, kind="count",
                height=4, aspect=.7)

category_survived.set_xticklabels(rotation=45, 
    horizontalalignment='right',
    fontweight='light')

plt.tight_layout()

If we consider the survivors not by percentage within each group but by comparing the number of survivors between groups, then the "Miss" title category is the luckiest one. The "Mr" category lost the biggest number of passengers.


Let's also visualize how Title categories and ticket classes are related:

class_by_title_cat = all_data.groupby('Title_category')['Pclass'].value_counts(normalize = True)
class_by_title_cat = class_by_title_cat.unstack().sort_values(by = 1, ascending = False)
class_by_title_cat.plot(kind='bar', 
                        stacked='True', 
                        color = ['#eed4d0', '#cda0aa', '#a2708e'])
plt.legend(loc=(1.04,0))
_ = plt.xticks(
    rotation = 45, 
    horizontalalignment = 'right',
    fontweight = 'light'  
)


plt.title('Proportion of 1st/2nd/3rd ticket class in each title category',fontsize= 14)
plt.xlabel('Category of the Title')
plt.tight_layout()


All honor and military titles occupied the 1st class.


  • All Reverends occupied 2nd class.
  • The biggest percentage of the 3rd class is in the Master category.

For sure, there is a relationship between variables, and survival was influenced not only by the title itself but by a combination of factors that are, to some extent, interrelated. How class could relate to surviving? Let's go further and discover.


6.4 Cabin

From the number of the cabin, we can extract the first letter, which will tell us about the placement of the cabin on the ship! And it seems to me as a very important knowledge:


  • How close cabin located to the lifeboats
  • How far from the most damaged parts of the ship
  • How close to people who have information about what is happening and how to act
  • How many obstacles passengers had in the way to the lifeboat

I found some descriptions of each Titanic deck:

There were 8 decks: the upper-deck - for lifeboats, other 7 were under it and had letter symbols:

  • A: it did not run the entire length of the vessel (i.e. it did not reach from the stern to the bow of the vessel), and was intended for passengers of the 1st class.
  • B: it did not run the entire length of the ship (it was interrupted by 37 meters above the C deck, and served as a place for anchors in the front).
  • C: in the front part of the galley, there is dining room for the crew, as well as a walking area for passengers of the 3rd class.
  • D: a walking area for passengers.
  • E: cabins of the 1st and 2nd class.
  • F: part of the passenger cabins of the 2nd class, most of the cabins of the 3rd class.
  • G: did not run the entire length of the ship, the boiler rooms were located in the center.
  • T - boat deck?

To the passengers, without deck information, I will imput U letter (as unknown).

all_data['deck'] = all_data['Cabin'].str.split('', expand = True)[1]
all_data.loc[all_data['deck'].isna(), 'deck'] = 'U'
print('Unique deck letters from the cabin numbers:', all_data['deck'].unique())


Unique deck letters from the cabin numbers: ['U' 'C' 'E' 'G' 'D' 'A' 'B' 'F' 'T']

fig = plt.figure(figsize=(20, 5))

ax1 = fig.add_subplot(131)
sns.countplot(x = 'deck', 
              data = all_data, 
              palette = "hls", 
              order = all_data['deck'].value_counts().index, ax = ax1)
plt.title('Passengers distribution by deck',fontsize= 16)
plt.ylabel('Number of passengers')

ax2 = fig.add_subplot(132)
deck_by_class = all_data.groupby('deck')['Pclass'].value_counts(normalize = True).unstack()
deck_by_class.plot(kind='bar', 
                    stacked='True',
                    color = ['#eed4d0', '#cda0aa', '#a2708e'], ax = ax2)
plt.legend(('1st class', '2nd class', '3rd class'), loc=(1.04,0))
plt.title('Proportion of classes on each deck',fontsize= 16)
plt.xticks(rotation = False)

ax3 = fig.add_subplot(133)
deck_by_survived = all_data.groupby('deck')['Survived'].value_counts(normalize = True).unstack()
deck_by_survived = deck_by_survived.sort_values(by = 1, ascending = False)
deck_by_survived.plot(kind='bar', 
                      stacked='True', 
                      color=["#3f3e6fd1", "#85c6a9"], ax = ax3)
plt.title('Proportion of survived/drowned passengers by deck',fontsize= 16)
plt.legend(( 'Drowned', 'Survived'), loc=(1.04,0))
plt.xticks(rotation = False)
plt.tight_layout()

plt.show()


Most passengers don't have cabin numbers ('U').


The largest part of passengers with known cabin numbers were located on the 'C' deck and had 1st class tickets. The 'C' deck is fifth by the percentage of the survivor.

The largest surviving rate (among passengers with known cabin numbers in training dataset) had passengers from deck 'D.’

Deck A was the closest to the deck with lifeboats, but it was the last in surviving rate (except unknown and T deck). How did it happen?

all_data[(all_data['deck']=='A') & (all_data['Survived']==0)]

I got curious, so I read a bit about some of these passengers:


John Hugo Ross When he boarded on 10 April 1912, he was so ill from dysentery he had to be carried to his cabin on a stretcher. When Ross was told the ship had struck an iceberg and that he should get dressed, Ross refused to believe the trouble was serious. "Is that all?" he told Peuchen. "It will take more than an iceberg to get me off this ship." Presumably, Ross drowned in his bed.


Andrews, Mr. Thomas Jr. was a managing director of H&W (built the Titanic) in charge of designing and was familiar with every detail of the construction of the firm's ships. He helped to evacuate people.


Roebling, Mr. Washington Augustus II helped to evacuate people as well.


It is obvious that there is no algorithm that can predict the survival rate by 100 percent based on the factors of the passenger's location on the ship or his age since the human factor and the unpredicted emergency were involved in the rescue process.


For the training process, it will be better to include passengers from the T deck to the A deck group.


6.5 Family

Does the size of the family on Board together affect the chances of surviving a disaster? Does having children increase the chance of getting into a boat, or is it easier to survive being single?

I calculate the family size by summarizing the number of siblings with the parch number plus 1 (passenger himself).


Family size = sib + parch + 1


6.5.1 Calculate family size

all_data['Family_size'] = all_data['SibSp'] + all_data['Parch'] + 1
family_size = all_data['Family_size'].value_counts()
print('Family size and number of passengers:')
all_data['Family_size'] = all_data['SibSp'] + all_data['Parch'] + 1

Family size and number of passengers:

1 790

2 235

3 159

4 43

6 25

5 22

7 16

11 11

8 8


Family Size 7

Looks strange that there are 16 passengers with a family size of 7, for example. Let's check!

Also, I will add a surname variable by extracting the first word of the name.

all_data['Surname'] = all_data['Name'].str.split(',', expand = True)[0]


  1. Let’s Group people with family size = 7 by Surname. We have 9 Andersons, who have a family size of 7:
all_data[all_data['Family_size'] == 7]['Surname'].value_counts()

Andersson 9

Asplund 7

all_data[(all_data['Family_size'] == 7) & (all_data['Surname']=='Andersson')]

  1. Let's group Andersons with 7-size family by ticket number.


There are 7 of them who used the same ticket and travelled together. 5 children (each of them has 4 siblings) and 2 parents. Two passengers used separate tickets.

all_data[(all_data['Family_size'] == 7) & (all_data['Surname']=='Andersson')].Ticket.value_counts()

347082 7

3101281 1

347091 1

all_data[(all_data['Ticket'] == '3101281') | (all_data['Ticket'] == '347091')]

Looks like they actually traveled alone, I will correct that data:

all_data.loc[all_data['PassengerId'] == 69, ['SibSp', 'Parch', 'Family_size']] = [0,0,1]
all_data.loc[all_data['PassengerId'] == 1106, ['SibSp', 'Parch', 'Family_size']] = [0,0,1]
all_data[(all_data['Ticket'] == '3101281') | (all_data['Ticket'] == '347091')]

Family size 5

There are some inconsistencies in other categories, with fewer relatives.

Let's check people with 5-size family and group them by Surname:

all_data[all_data['Family_size'] == 5]['Surname'].value_counts()

Palsson 5

Ryerson 5

Ford 5

Lefebre 5

Kink-Heilmann 1

Hocking 1

all_data[(all_data['Surname'] == 'Kink-Heilmann')&(all_data['Family_size'] == 5)]

Kink-Heilmann, Mr. Anton had 2 other siblings on the ship, unlike his wife, for whom these relatives do not fit the description of relatives in the data set. We will assume that all other "mismatches" in the groups are similar to this. Since I plan to group the size of families into groups, this will eliminate possible inconsistencies.


6.5.2 Family size and chances for survival

fig = plt.figure(figsize = (12,4))

ax1 = fig.add_subplot(121)
ax = sns.countplot(all_data['Family_size'], ax = ax1)

# calculate passengers for each category
labels = (all_data['Family_size'].value_counts())
# add result numbers on barchart
for i, v in enumerate(labels):
    ax.text(i, v+6, str(v), horizontalalignment = 'center', size = 10, color = 'black')

plt.title('Passengers distribution by family size')
plt.ylabel('Number of passengers')

ax2 = fig.add_subplot(122)
d = all_data.groupby('Family_size')['Survived'].value_counts(normalize = True).unstack()
d.plot(kind='bar', color=["#3f3e6fd1", "#85c6a9"], stacked='True', ax = ax2)
plt.title('Proportion of survived/drowned passengers by family size (train data)')
plt.legend(( 'Drowned', 'Survived'), loc=(1.04,0))
plt.xticks(rotation = False)

plt.tight_layout()

  • There were two large families with sizes 8 and 11, and all their members from the training dataset drowned.
  • Most of the passengers were traveling alone, and the percentage of survivals was not very large.
  • The biggest proportion of surviving passengers in a group of people who had 4 family members on board.

We can observe that the percentage of survivors in people who have a family of 2, 3, 4 people is greater than in singles, then the percentage of survivors decreases as the family size increases.

I will create a 'Family_size_group' variable with four categories:

  • single
  • usual (sizes 2, 3, 4)
  • big (5, 6, 7)
  • and large (all bigger than 7)
all_data['Family_size_group'] = all_data['Family_size'].map(lambda x: 'f_single' if x == 1 
                                                            else ('f_usual' if 5 > x >= 2
                                                                  else ('f_big' if 8 > x >= 5 
                                                                       else 'f_large' )
                                                             ))


fig = plt.figure(figsize = (14,5))

ax1 = fig.add_subplot(121)
d = all_data.groupby('Family_size_group')['Survived'].value_counts(normalize = True).unstack()
d = d.sort_values(by = 1, ascending = False)
d.plot(kind='bar', stacked='True', color = ["#3f3e6fd1", "#85c6a9"], ax = ax1)
plt.title('Proportion of survived/drowned passengers by family size (training data)')
plt.legend(( 'Drowned', 'Survived'), loc=(1.04,0))
_ = plt.xticks(rotation=False)

ax2 = fig.add_subplot(122)
d2 = all_data.groupby('Family_size_group')['Pclass'].value_counts(normalize = True).unstack()
d2 = d2.sort_values(by = 1, ascending = False)
d2.plot(kind='bar', stacked='True', color = ['#eed4d0', '#cda0aa', '#a2708e'], ax = ax2)
plt.legend(('1st class', '2nd class', '3rd class'), loc=(1.04,0))
plt.title('Proportion of 1st/2nd/3rd ticket class in family group size')
_ = plt.xticks(rotation=False)

plt.tight_layout()


Large families are all from the 3rd class, and no one from the training part of the dataset survived.

The biggest proportion of the 1st class is the usual size of the family, and the proportion of survivors in the usual family is the biggest.

6.6 Class

We have made a lot of assumptions about the survival rate depending on the classes. Let's now look closely at this variable.

6.6.1 Passengers by class

ax = sns.countplot(all_data['Pclass'], palette = ['#eed4d0', '#cda0aa', '#a2708e'])
# calculate passengers for each category
labels = (all_data['Pclass'].value_counts(sort = False))
# add result numbers on barchart
for i, v in enumerate(labels):
    ax.text(i, v+2, str(v), horizontalalignment = 'center', 
                            size = 12, color = 'black', fontweight = 'bold')
plt.title('Passengers distribution by family size')
plt.ylabel('Number of passengers')
plt.tight_layout()

Most of the Titanic's passengers were traveling third class (709).

The second class is the smallest in terms of the number of passengers.

6.6.2 Class vs surviving status


fig = plt.figure(figsize=(14, 5))

ax1 = fig.add_subplot(121)
sns.countplot(x = 'Pclass', hue = 'Survived',
               data = all_data, palette=["#3f3e6fd1", "#85c6a9"], ax = ax1)
plt.title('Number of survived/drowned passengers by class (train data)')
plt.ylabel('Number of passengers')
plt.legend(( 'Drowned', 'Survived'), loc=(1.04,0))
_ = plt.xticks(rotation=False)

ax2 = fig.add_subplot(122)
d = all_data.groupby('Pclass')['Survived'].value_counts(normalize = True).unstack()
d.plot(kind='bar', stacked='True', ax = ax2, color =["#3f3e6fd1", "#85c6a9"])
plt.title('Proportion of survived/drowned passengers by class (train data)')
plt.legend(( 'Drowned', 'Survived'), loc=(1.04,0))
_ = plt.xticks(rotation=False)

plt.tight_layout()


Despite the previously identified prerequisites (on average, older people are more likely to die, and in the first class, the average age is higher than in other classes. Also, passengers on deck A, which consists of 100% first class, have a large proportion of drowned passengers). The first class has the largest number of survivors, and the proportion of survivors within the class is the largest.

Third-class tickets had the highest number of drowned passengers, and most of the third-class passengers drowned.

6.6.3 Class vs. surviving status vs. gender

sns.catplot(x = 'Pclass', hue = 'Survived', 
            col = 'Sex', kind = 'count', 
            data = all_data , palette=["#3f3e6fd1", "#85c6a9"])
plt.tight_layout()

However, most of the male passengers of the first class drowned, and for females, almost all of them survived. In the third grade, half of the females survived.

6.6.4 Class vs. Gender vs. Age -> Surviving status

For a better understanding of how the combination of some factors influences on chances of survival, let us break passengers into 18 imaginary groups separated by:

  • Class (1 / 2 / 3)
  • Gender (male/female)
  • Age ( <16 / 16-40 / 40<)

To do so, I will create 6 stripplots (3 for male, 3 for female), with values grouped by Surviving status, and add background colour to separate age groups:


plt.figure(figsize=(20, 10))

palette=["#3f3e6fd1", "#85c6a9"]

plt.subplot(2, 3, 1)
sns.stripplot(x = 'Survived', y = 'Age', data = age_1_class[age_1_class['Sex']=='male'],
     linewidth = 0.9, palette = palette)
plt.axhspan(0, 16, color = "#e1f3f6")
plt.axhspan(16, 40, color = "#bde6dd")
plt.axhspan(40, 80, color = "#83ceb9")
plt.title('Age distribution (males, 1st class)',fontsize= 14)
plt.xticks( np.arange(2), ['drowned', 'survived'])
plt.ylim(0, 80)

plt.subplot(2, 3, 2)
sns.stripplot(x = 'Survived', y = 'Age', data = age_2_class[age_2_class['Sex']=='male'],
     linewidth = 0.9, palette = palette)
plt.axhspan(0, 16, color = "#e1f3f6")
plt.axhspan(16, 40, color = "#bde6dd")
plt.axhspan(40, 80, color = "#83ceb9")
plt.title('Age distribution (males, 2nd class)',fontsize= 14)
plt.xticks( np.arange(2), ['drowned', 'survived'])
plt.ylim(0, 80)

plt.subplot(2, 3, 3)
sns.stripplot(x = 'Survived', y = 'Age', data = age_3_class[age_3_class['Sex']=='male'],
    linewidth = 0.9, palette = palette)
plt.axhspan(0, 16, color = "#e1f3f6")
plt.axhspan(16, 40, color = "#bde6dd")
plt.axhspan(40, 80, color = "#83ceb9")
plt.title('Age distribution (males, 3rd class)',fontsize= 14)
plt.xticks( np.arange(2), ['drowned', 'survived'])
plt.ylim(0, 80)

plt.subplot(2, 3, 4)
sns.stripplot(x = 'Survived', 
    y = 'Age', data = age_1_class[age_1_class['Sex']=='female'],
    linewidth = 0.9, palette = palette)plt.axhspan(0, 16, color = "#ffff9978")
plt.axhspan(16, 40, color = "#ffff97bf")
plt.axhspan(40, 80, color = "#ffed97bf")
plt.title('Age distribution (females, 1st class)',fontsize= 14)
plt.xticks( np.arange(2), ['drowned', 'survived'])
plt.ylim(0, 80)

plt.subplot(2, 3, 5)
sns.stripplot(x = 'Survived', y = 'Age', data = age_2_class[age_2_class['Sex']=='female'],
     linewidth = 0.9, palette = palette)
plt.axhspan(0, 16, color = "#ffff9978")
plt.axhspan(16, 40, color = "#ffff97bf")
plt.axhspan(40, 80, color = "#ffed97bf")
plt.title('Age distribution (females, 2nd class)',fontsize= 14)
plt.xticks( np.arange(2), ['drowned', 'survived'])
plt.ylim(0, 80)

plt.subplot(2, 3, 6)
sns.stripplot(x = 'Survived', y = 'Age', data = age_3_class[age_3_class['Sex']=='female'],
     linewidth = 0.9, palette = palette)
plt.axhspan(0, 16, color = "#ffff9978")
plt.axhspan(16, 40, color = "#ffff97bf")
plt.axhspan(40, 80, color = "#ffed97bf")
plt.title('Age distribution (females, 3rd class)',fontsize= 14)
plt.xticks( np.arange(2), ['drowned', 'survived'])
plt.ylim(0, 80)

plt.show()

From these graphs (training data only), we can see that

  • only one kid (<16) from 1st and 2nd classes drowned (female from the 1st class).
  • But children from the 3rd class were not so lucky. It looks like chances to survive for passengers <16 were 50/50 for males and females.
  • most females from 1st and 2nd class survived, without much difference in Age.
  • females from 3rd class in Age group 40+ drowned except one.
  • similar picture for males in 2nd and 3rd classes in the Age group 40+: only 2 from each class survived.
  • for 40+ males from 1st class situation were slightly different, there are more survived passengers.
  • the largest "accumulation" of drowned passengers is observed in the Age group 16-40 males, 3rd class.

6.7 Gender

Let's discover gender a little bit more:


plt.figure(figsize = (15,4))

plt.subplot (1,3,1)
ax = sns.countplot(all_data['Sex'], palette="Set3")
plt.title('Number of passengers by Sex')
plt.ylabel('Number of passengers')

# calculate passengers for each category
labels = (all_data['Sex'].value_counts())
# add result numbers on barchart
for i, v in enumerate(labels):
    ax.text(i, v+10, str(v), horizontalalignment = 'center', size = 10, color = 'black')

plt.subplot (1,3,2)
sns.countplot( x = 'Pclass', data = all_data, hue = 'Sex', palette="Set3")
plt.title('Number of male/female passengers by class')
plt.ylabel('Number of passengers')
plt.legend( loc=(1.04,0))

plt.subplot (1,3,3)
sns.countplot( x = 'Family_size_group', data = all_data, hue = 'Sex',
               order = all_data['Family_size_group'].value_counts().index , palette="Set3")
plt.title('Number of male/female passengers by family size')
plt.ylabel('Number of passengers')
plt.legend( loc=(1.04,0))

plt.tight_layout()


There were overall more males than females on board, which is fair for each ticket class, but in the 3rd class, the number of males was more than twice than of females.

Almost 600 male passengers traveled without family members and only about 200 females, but in usual and big families there were slightly more female passengers.


6.8 Embarked

Titanic had 3 embarkation points before the ship started its route to New York:


  • Southampton
  • Cherbourg
  • Queenstown

Some passengers could have left Titanic in Cherbourg or Queenstown and avoided catastrophe. Also, the point of embarkation could have an influence on ticket fare and location on the ship.

Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)


fig = plt.figure(figsize = (15,4))ax1 = fig.add_subplot(131)
palette = sns.cubehelix_palette(5, start = 2)
ax = sns.countplot(all_data['Embarked'], palette = palette, order = ['C', 'Q', 'S'], ax = ax1)
plt.title('Number of passengers by Embarked')
plt.ylabel('Number of passengers')
# calculate passengers 
for each categorylabels = (all_data['Embarked'].value_counts())
labels = labels.sort_index()
# add result numbers on barchart
for i, v in enumerate(labels):
   ax.text(i, v+10, str(v), horizontalalignment = 'center', size = 10, color = 'black')

ax2 = fig.add_subplot(132)
surv_by_emb = all_data.groupby('Embarked')['Survived'].value_counts(normalize = True)
surv_by_emb = surv_by_emb.unstack().sort_index()
surv_by_emb.plot(kind='bar', stacked='True', color=["#3f3e6fd1", "#85c6a9"], ax = ax2)
plt.title('Proportion of survived/drowned passengers by Embarked (train data)')
plt.legend(( 'Drowned', 'Survived'), loc=(1.04,0))
_ = plt.xticks(rotation=False)

ax3 = fig.add_subplot(133)
class_by_emb = all_data.groupby('Embarked')['Pclass'].value_counts(normalize = True)
class_by_emb = class_by_emb.unstack().sort_index()
class_by_emb.plot(kind='bar', stacked='True', color = ['#eed4d0', '#cda0aa', '#a2708e'], ax = ax3)
plt.legend(('1st class', '2nd class', '3rd class'), loc=(1.04,0))
plt.title('Proportion of clases by Embarked')
_ = plt.xticks(rotation=False)

plt.tight_layout()

  • Most number of passengers (914) were embarked in Southampton. Also Southampton has the biggiest proportion of drowned passengers.
  • 270 passengers emarked in Cherbourg, and more than 50% of them survived (in the training dataset).
  • 123 of passengers embarked in Queenstown, the vast majority of them are 3rd class passengers


sns.catplot(x="Embarked", y="Fare", kind="violin", inner=None,
            data=all_data, height = 6, palette = palette, order = ['C', 'Q', 'S'])
plt.title('Distribution of Fare by Embarked')
plt.tight_layout()

# Descriptive statistics:
pd.DataFrame(all_data.groupby('Embarked')['Fare'].describe())

  • The wider fare distribution among passengers who embarked in Cherbourg. It makes sense - many first-class passengers boarded the ship here, but the share of third-class passengers is quite significant.

  • The smallest variation in the price of passengers who boarded in q. Also, the average price of these passengers is the smallest. I think this is due to the fact that the path was supposed to be the shortest + almost all third-class passengers.


Let's check the NA values of the Embarked variable:

train_data[train_data['Embarked'].isna()]

These two passengers traveled together (same ticket number). To impute missing values, we can use mode value for passengers with the closest fare value and Pclass.

6.9 Fare

sns.catplot(x="Pclass", y="Fare", kind="swarm", 
          data=all_data, palette=sns.cubehelix_palette(5, start = 3), height = 6)
plt.tight_layout()

We can observe that the distribution of prices for the second and third class is very similar. The distribution of first-class prices is very different, has a larger spread, and on average prices are higher.


Let's add colors to our points to indicate the surviving status of the passenger (there will be only data from the training part of the dataset):

sns.catplot(x="Pclass", y="Fare",  hue = "Survived", kind="swarm", data=all_data, 
            palette=["#3f3e6fd1", "#85c6a9"], height = 6)
plt.tight_layout()


Looks like the bigger the passenger paid, the more chances he had to survive.

What about zero fares in the first class? Is it a mistake?


all_data[all_data['Fare'] == min(all_data['Fare'])]

Some of the passengers have "Line" tickets; perhaps they were somehow involved in the Titanic but were not the ship's crew. I don't think we should change these prices, but add an additional feature for these passengers.


7. Conclusion

We know that there were not enough boats on Board the Titanic for all passengers to be able to evacuate. After studying the information about the passengers, we can make some assumptions about who had a better chance of survival in a shipwreck situation, as well as General observations about passengers.


  • There are 891 passengers in the training dataset; 549 (61,6%) of them drowned, and only 342 (38,4%) survived. But we know that lifeboats (16 wooden lifeboats and four collapsible) could carry 53% of total passengers.
  • The Age of all passengers varies from about 0.17 years to 80 years, with an average of 29.88. The mean age of survived passengers is 28.23, which is 2.39 smaller than the mean age of drowned passengers (only passengers we know survived status for). It looks like there is a slightly bigger chance to survive for younger people.
  • Exploring the title of passengers, we can see that the biggest proportion of survivors is in the "Mrs" group - a married woman. More than 80% drowned in the "Mr." group, and nobody survived among the Reverend group.
  • Most passengers don't have cabin numbers. The largest part of passengers with known cabin numbers were located on the 'C' deck and had 1st class tickets. The 'C' deck is fifth by the percentage of the survivor.
  • The largest survival rate (among passengers with known cabin numbers in the training dataset) had passengers from deck 'D'’ Deck A was the closest to the deck with lifeboats, but it is the last in the survival rate.
  • The family size on board also seems to have an influence on chances for survival: there were two large families with sizes 8 and 11, and all their members from the training dataset drowned. We can observe that the percentage of survivors in people who have a family of 2, 3, 4 people is greater than in singles, then the percentage of survivors decreases as the family size increases.
  • Most of the Titanic's passengers were traveling third class (709). The second class is the smallest in terms of the number of passengers. Despite the previously identified prerequisites (on average, older people are more likely to die, and in the first class, the average age is higher than in other classes. Also, passengers on deck A, which consists of 100% first class, have a large proportion of drowned passengers). The first class has the largest number of survivors, and the proportion of survivors within the class is the largest.
  • Third-class tickets had the highest number of drowned passengers, and most of the third-class passengers drowned.
  • However, most of the male passengers of the first class drowned, and almost all the females survived. In the third grade, half of the females survived.
  • There were overall more males than females on board, which is fair for each ticket class, but in the 3rd class, the number of males was more than twice as big as females.
  • Almost 600 male passengers traveled without family members and only about 200 females, but in usual and big families there were slightly more female passengers.
  • Most numbers of passengers (914) were embarked in Southampton. Also, Southampton has the biggest proportion of drowned passengers. 270 passengers embarked in Cherbourg, and more than 50% of them survived (in the training dataset). 123 of passengers embarked in Queenstown, the vast majority of them are 3rd class passengers.
  • If we use a naive approach and consider all the parameters separately, then young female first-class passengers with the title Mrs have a moderate number of relatives on Board who paid a large amount for a ticket and went on Board in Cherbourg have a better chance of survival. For sure, there is a relationship between variables, and survival was influenced not only by the title, ticket, or age itself but by a combination of factors that are, to some extent, interrelated.
  • It is obvious that there is no algorithm that can predict the survival rate by 100 percent based on the factors of the passenger's location on the ship or his age since the human factor and the unpredicted emergency were involved in the rescue process.


You can find the second part of my work with the Titanic dataset on Kaggle, which contains the following:


  • Missing data imputation
  • Feature generation
  • Models implementation and tuning: Logistic Regression, Random Forest, XGBoost
  • Comparing models and submission