The journey on the unsinkable — what AI can learn from the disaster
Have you ever thought of being one of the passengers on the Titanic in 1912? Would you survive? Would your children survive? What would happen if you were in the upper deck? We are often told that women and children were given priority to the lifeboats, but I was not satisfied with this short, qualitative answer. With the advent of machine learning, we are able to dig deeper. We can spot patterns and infer what would have happened to you and your family should you all were onboard.
Curiosity is the wick in the candle of learning.
— William Arthur Ward
Mmm… borderline for me…
Table of Contents
Originally published at edenau.github.io.
For data scientists, Titanic Kaggle dataset is arguably one of the most widely used datasets in the field of machine learning, along with MNIST hand-written digit, Iris flower etc.. For some non-data scientists, machine learning is a black box that does magic; but for some sceptics who ‘learn from the past’ (and inadvertently human-trained a classification model in their head), they foresee that
AI is the new dot-com.
One of my professors argued that machine learning-driven time-series forecasting model would not perform with flying colours, as it is just looking in the rearview mirror.
However, the omnipresence of machine learning models proves their values to our society, and the fact that they have already brought a lot of convenience to us is undeniable — Google auto-completes email sentences which really saves me a lot of time. I hope I can share my experience, help spread the knowledge, and demonstrate how new technologies can achieve things that we could not in the last century.
AI is the new electricity.
— Andrew Ng
PS: machine learning is a subset of artificial intelligence, in case you are confused.
What is Kaggle?
Kaggle is a platform for data scientists to share data, exchange thoughts, and compete in predictive analytics problems. There are many high-quality datasets that are freely accessible on Kaggle. Some Kaggle competitions even offer prize money, and they attract a lot of famous machine learning practitioners to participate. People often think that Kaggle is not for beginners, or it has a very steep learning curve.
They are not wrong. But they do offer challenges for people who are ‘getting started’. As a (junior) data scientist, I could not resist searching for interesting datasets to start my journey on Kaggle. And I bumped into the Titanic.
Kaggle is an Airbnb for data scientists — this is where they spend their nights and weekends.
— Zeeshan-ul-hassan Usmani
Here goes the overview of the technical bit. I first found the dataset on Kaggle and decided to work on it and analyze it with Python. I used it to develop and train an ensemble of classifiers using scikit-learn that would predict one’s chances of survival. I then saved it by pickle and deployed it as a Web App on localhost using Flask. Finally, I leveraged AWS free tier (available for 12 months) to cloud-host it.
There are many tutorials online that focus on how to code and develop a machine learning model in Python and other languages. Therefore, I will only explain my work in a relatively qualitative manner with graphs and results, instead of bombarding you with codes. If you really want to go through my codes, they are available on GitHub.
I decided to use Python since it is the most popular programming language for machine learning with numerous libraries. And I don’t know R. And no one uses MATLAB for machine learning. Instead of my local machine, I went for Google Colab so that I could work cross-platform without any hassle.
Sit tight, here we go!
Let’s import the data into a DataFrame. It consists of passenger ID, survival, ticket class, name, sex, age, number of siblings and spouses onboard, number of parents and children onboard, ticket number, passenger fare, cabin number, and port of embarkation.
What immediately came to my mind was the following points:
PassengerIdis key (unique),
Survivedis the target that we would like to infer,
Namemight not help but their titles might,
Ticketis a mess, and
- there are missing data labelled as
I decided to drop variable
Ticket for now for simplicity. It could possibly be holding useful information, but it would require extensive feature engineering to extract them. We should start with the easiest, and take it from there.
On the other hand, let’s take a closer look at the missing data. There are a few missing entries in variables
Fare, which should be able to be inferred by other variables. Around 20% passenger ages were not recorded. This might pose a problem to us, since
Age is likely to be one of the key predictors in the dataset. ‘Women and children first’ was a code of conduct back then, and reports suggested that they were indeed saved first. There are >77% of missing entries in
Cabin, which is unlikely to be very helpful, and let’s drop it for now.
Pair plot (not shown) is usually my go-to at the beginning of a data visualization task, as it is usually helpful, and it has a high information-to-lines-of-code ratio. One single line of
seaborn.pairplot() gives you n² plots (technically n(n+1)/2 distinct plots), where n represents the number of variables. It gives you a basic understanding of the relationship between every pair of variables, and the distribution of each variable itself. Let’s dive into different variables.
We first inspect the relationship of target variable
Survived with each predictor one by one. By a simple count plot
seaborn.countplot(), we found that most people belonged to the third class, which wasn’t surprising; and in general they had a lower probability of survival. Even with this single predictor, given everything else unknown, we could infer that a first-class passenger would be more likely to survive, while this would be unlikely for a third-class passenger.
Meanwhile, women and children were more likely to survive, which aligned with the aforementioned theory of ‘women and children first’. First-class young female passengers would now be the ones with the highest chance of survival, if we only examine variable
Nevertheless, it might be harder to interpret the density plot
seaborn.kdeplot() of passenger fare. For both ‘survived’ and ‘not survived’ classes, they span over a wide range of fare, with the ‘not survived’ class having a smaller mean and variance. Note that there is a funny tail in the ‘survived’ class, which corresponds to three people getting their first-class tickets with $512 each (no idea what currency the dataset was referring to). They all got onboard at the port of Cherbourg, and all of them survived.
On the other hand, the port of embarkation seems to also play a role in determining who would survive. Most people embarked at the port of Southampton — the first stop of the journey, and they had the lowest survival rate. Maybe they were assigned to cabins further away from exits, or spending more time on a cruise would make people relaxed or tired. No one knows. Or maybe it’s just indirectly caused by some third variable — say maybe there were fewer women/children/first-class passengers that got onboard at the first port. This plot does not provide such information. Further investigation is required, and is left as an exercise for the reader. (Nani?)
If you are a fan of tables instead of plots, we can also visualize the data by
pandas.DataFrame.groupby() and take the mean for each class. However, I don’t think there is a clear pattern shown in the table of
Correlation matrix generated by
seaborn.heatmap() illustrates the strength of correlation between any two variables. As you can see,
Sex has the highest magnitude of correlation with
Survived, whereas guess what,
Pclass are highly correlated.
Parch do not seem to play a big role in predicting one’s survival chance, although instinct suggests otherwise. Family members onboard might be your helping hands in escaping from the sinking ship, or they could also be a burden (not from my personal experience, in case you are reading). More on that later in Feature Engineering.
Missing Data Imputation
We have found earlier in Data Inspection that there were missing data entries. For instance, we seem to have no clue how much did a 60-year-old Thomas Storey pay for his ticket. Instinct tells us that ticket fare hugely depends on ticket class and port of embarkation (sex might also be a factor in the early 20th century), We can cross-check with the correlation matrix above. Therefore, we will just take the mean (or median if you want) of third-class fare at Southampton. This is just an educated guess and is probably wrong, but it is good enough. Bear in mind that it is impossible to have noiseless data, and machine learning models are (to different extents) robust against noise. Go digging in historical archives is not worth it.
There were also two women whom we had no idea where they got on the ship. This should be strongly correlated with ticket class and fare. As they both paid 80 dollars for a first-class seat, I would bet my money on Cherbourg (C in the plots).
If there are only a few missing entries in a particular variable, we can use the tricks above to make educated guesses by essentially taking the maximum likelihood value. Nonetheless, it would be really dangerous to do the same thing if we have more missing data, like in
Age where about 20% of them are missing.
We can no longer make educated guesses by inspection. Since we dropped variable
Cabin, and all other missing entries are filled in, we can leverage all other variables to infer the missing
Age by random forest regressor. There are 80% of ‘training’ data to infer the remaining 20%.
As suggested in Data Inspection, passengers’ name would probably not helpful in our case, since they are all distinct and, you know what, being called Eden wouldn’t make me less likely to survive. But we could extract the titles of their names.
While most of them had titles of ‘Mr’, ‘Mrs’, and ‘Miss’, there were quite a number of less frequent titles — ‘Dr’, ‘The Reverend’, ‘Colonel’ etc., some of them only appeared once, such as ‘Lady’, ‘Doña’, ‘Captain’ etc.. Their rare appearance would not help much in model training. In order to find patterns with data science, you need data. One datum point has no patterns whatsoever. Let’s just categorize all those relatively rare titles as ‘Rare’.
Categorical data requires extra care before model training. Classifiers simply cannot process string inputs like ‘Mr’, ‘Southampton’ etc.. While we can map them to integers, say (‘Mr’, ‘Miss’, ‘Mrs’, ‘Rare’) → (1, 2, 3, 4), there should be no concept of ordering amongst titles. Being a Dr does not make you superior. In order not to mislead machines and accidentally construct a sexist AI, we should one-hot-encode them. They become:
( (1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0), (0, 0, 0, 1) )
On the other hand, I decided to add two more variables —
FamilySize = SibSp + Parch + 1 makes more sense since the whole family would have stayed together on the cruise. You wouldn’t have a moment with your partner but abandon your parents, would you? Besides, being alone might be one of the crucial factors. You could be more likely to make reckless decisions, or you could be more flexible without taking care of your family. Experiments (adding variables one at a time) suggested that their addition did improve the overall predictability.
Codes are available on GitHub.
Before picking and tuning any classifiers, it is VERY important to standardize the data. Variables measured at various scales would screw things up.
Fare covers a range between 0 and 512, whereas
Sex is binary (in our dataset) — 0 or 1. You wouldn’t want to weigh
Fare more than
I have some thoughts (and guesses) on which classifiers would perform well enough, often by experience. I personally prefer Random Forest as it usually guarantees good enough results. But let’s just try out all the classifiers we know — SVM, KNN, AdaBoost, you name it, and they were all tuned by grid search. XGBoost stands out eventually with an 87% test accuracy, but that does not mean it would perform the best in inferring unknown data subsets.
To increase the robustness of our classifier, an ensemble of classifiers with different natures was trained and final results were obtained by majority voting. It is vital to embed models with different strengths into the ensemble, otherwise there is no point building an ensemble model at the expense of computation time.
Finally, I submitted it to Kaggle and achieved around 80% accuracy. Not bad. There is always room for improvement. For instance, there is surely some useful information hidden in
Ticket, but we dropped them for simplicity. We could also create more features, e.g. a binary class
Underage that is 1 if
Age < 18 or 0 otherwise.
But we will move on for now.
Saving and Loading Model
I am not satisfied with just a trained machine learning model. I want it to be accessible by everyone (sorry for people who don’t have internet access). Therefore, we have to save the model and deploy it elsewhere, and this can be done by pickle library. Parameters
open() function represent write access and read-only in binary mode respectively.
pickle.dump(<model>, open(<file_path>, 'wb'))
Web App Deployment
Flask is an easy-to-use web framework in Python. I only had little prior experience in building websites (HTML in primary school, GitHub pages last year, Wix doesn’t count), and I found it straightforward and simple. The simplest thing you can do is:
from flask import Flask
app = Flask(__name__)
return "<h1>Write something here.</h1>"
And voilà! You can browse it in your localhost.
What else do we need? We want people to fill in a form to collect the data required and pass it to the machine learning model (and not sell it, Mark). The model would have an output, which we will redirect users to that page.
We will use WTForms to build a form in Python, and a single form is defined by a class, which looks like the following:
from wtforms import Form, TextField, validators, SubmitField, DecimalField, IntegerField, SelectField
sex = SelectField('Sex:',choices=[('1', 'Male'), ('0', 'Female') ],
fare = DecimalField('Passenger Fare:',default=33,places=1,
message='Fare must be between 0 and 512')])
submit = SubmitField('Predict')
I found an HTML template from Will Koehrsen and built on top of it.
You look at the people who have been there before you, the people who have dealt with this before, the people who have built the spaghetti code, and you thank them very much for making it open source, and you give them credit, and you take what they have made and you put it in your program, and you never ever look at it again. — Tom Scott
No way I delve into a CSS spaghetti.
Now that the webpage can be viewed via my localhost, and everything works fine. The last step would be hosting it online. Unfortunately, I only used one site hosting service before, GitHub Pages, which only hosts static sites. There are 3 major cloud hosting services now — AWS, GCP, and Azure. AWS is by far the most popular one, so I went for its 12-month free tier. They had lots of tutorials and documentation (and customer services), and it is quite easy to follow.
I connected to the Linux server instance with my private key, migrated my repository to the server, ran my script, and it worked! The only problem was system timeout when it was idle for too long. By some googling, turns out it can be fixed by starting a screen session, which we run the scripts inside the session.
[ec2-user@ip-<SOME_IP> ~]$ screen
When you want to check the status next time, use
screen -r to resume your screen session. You can exit the session by
Ctrl+A followed by
And now, it is running 24/7 non-stop! Check it out at http://18.104.22.168:8888.
AI is not false hope. They are backed by sophisticated statistical theories. They are hard works. They are keys to revolutionize industries.
To new learners:
Data science is not just about training and tuning neural networks. It requires curiosity to drive us, knowledge to kick-start, energy to clean data, experience (i.e. trial and error) to engineer features, patience to train models, maturity to handle failure, and wisdom to explain.
To technology enthusiasts:
AI is no magic.
There is nothing new under the sun.
To future me:
Doing data science is one thing, writing it is a different story. — current me
If you want to start a machine learning project now, check out the article below for starters:
Start training your neural networks with free GPUs todaytowardsdatascience.com
If you want to check out what else we can do using Python, these articles might help:
Exploring data visualization tools in Pythontowardsdatascience.com
Exploring climate-related data manipulation tools in Pythontowardsdatascience.com
Originally published at edenau.github.io.