The journey on the unsinkable — what AI can learn from the disaster Have you ever thought of being one of the passengers on the Titanic in 1912? Would you survive? Would your children survive? What would happen if you were in the upper deck? We are often told that women and children were given priority to the lifeboats, but I was not satisfied with this short, qualitative answer. With the advent of , we are able to dig deeper. We can spot patterns and infer what would have happened to you and your family should you all were onboard. machine learning Curiosity is the wick in the candle of learning. — William Arthur Ward TL;DR I developed a machine learning model that does so, and deployed it as a . Check out and complete the quiz to find out your chance of survival. Web App http://18.212.193.40:8888/ Mmm… borderline for me… Table of Contents Introduction Data Machine Learning Web App Deployment The Takeaway Originally published at edenau.github.io . Introduction Motivation For data scientists, is arguably one of the most widely used datasets in the field of machine learning, along with , etc.. For some non-data scientists, machine learning is a black box that does magic; but for some sceptics who ‘learn from the past’ (and inadvertently human-trained a classification model in their head), they foresee that Titanic Kaggle dataset MNIST hand-written digit Iris flower AI is the new dot-com. Photo by on Marc Sendra Martorell Unsplash One of my professors argued that machine learning-driven time-series forecasting model would not perform with flying colours, as it is just . looking in the rearview mirror However, the omnipresence of machine learning models proves their values to our society, and the fact that they have already brought a lot of convenience to us is undeniable — Google auto-completes email sentences which really saves me a lot of time. I hope I can share my experience, help spread the knowledge, and demonstrate how new technologies can achieve things that we could not in the last century. AI is the new electricity. — Andrew Ng PS: machine learning is a subset of artificial intelligence, in case you are confused. What is Kaggle? Kaggle is a platform for data scientists to share data, exchange thoughts, and compete in predictive analytics problems. There are many high-quality datasets that are freely accessible on Kaggle. Some Kaggle competitions even offer prize money, and they attract a lot of famous machine learning practitioners to participate. People often think that Kaggle is not for beginners, or it has a very steep learning curve. They are not wrong. But they do offer challenges for people who are ‘getting started’. As a (junior) data scientist, I could not resist searching for interesting datasets to start my journey on Kaggle. And I bumped into the Titanic. Kaggle is an Airbnb for data scientists — this is where they spend their nights and weekends. — Zeeshan-ul-hassan Usmani Overview Here goes the overview of the technical bit. I first found the dataset on Kaggle and decided to work on it and analyze it with . I used it to develop and train an ensemble of classifiers using that would predict one’s chances of survival. I then saved it by and deployed it as a Web App on localhost using . Finally, I leveraged AWS free tier (available for 12 months) to cloud-host it. Python scikit-learn pickle Flask There are many tutorials online that focus on how to code and develop a machine learning model in Python and other languages. Therefore, I will only explain my work in a relatively qualitative manner with graphs and results, instead of bombarding you with codes. If you really want to go through my codes, they are available on . GitHub Tools I decided to use Python since it is the most popular programming language for machine learning with numerous libraries. And I don’t know R. And no one uses MATLAB for machine learning. Instead of my local machine, I went for so that I could work cross-platform without any hassle. Google Colab Sit tight, here we go! Data Data Inspection Let’s import the data into a . It consists of passenger ID, survival, ticket class, name, sex, age, number of siblings and spouses onboard, number of parents and children onboard, ticket number, passenger fare, cabin number, and port of embarkation. DataFrame First 5 rows of data What immediately came to my mind was the following points: is key (unique), PassengerId is the target that we would like to infer, Survived might not help but their titles might, Name is a mess, and Ticket there are missing data labelled as . NaN I decided to drop variable for now for simplicity. It could possibly be holding useful information, but it would require extensive feature engineering to extract them. Ticket We should start with the easiest, and take it from there. The ratio of missing data On the other hand, let’s take a closer look at the missing data. There are a few missing entries in variables and , which should be able to be inferred by other variables. Around 20% passenger ages were not recorded. This might pose a problem to us, since is likely to be one of the key predictors in the dataset. ‘Women and children first’ was a code of conduct back then, and reports suggested that they were indeed saved first. There are >77% of missing entries in , which is unlikely to be very helpful, and let’s drop it for now. Embarked Fare Age Cabin Data Visualization Pair plot (not shown) is usually my go-to at the beginning of a data visualization task, as it is usually helpful, and it has a high information-to-lines-of-code ratio. One single line of gives you n² plots (technically n(n+1)/2 distinct plots), where n represents the number of variables. It gives you a basic understanding of the relationship between every pair of variables, and the distribution of each variable itself. Let’s dive into different variables. seaborn.pairplot() We first inspect the relationship of target variable with each predictor one by one. By a simple count plot , we found that most people belonged to the third class, which wasn’t surprising; and in general they had a lower probability of survival. Even with this single predictor, , we could infer that a first-class passenger would be more likely to survive, while this would be unlikely for a third-class passenger. Survived seaborn.countplot() given everything else unknown Meanwhile, women and children were more likely to survive, which aligned with the aforementioned theory of . First-class young female passengers would now be the ones with the highest chance of survival, if we only examine variable , , and . ‘women and children first’ Pclass Sex Age Nevertheless, it might be harder to interpret the density plot of passenger fare. For both ‘survived’ and ‘not survived’ classes, they span over a wide range of fare, with the ‘not survived’ class having a smaller mean and variance. Note that there is a funny tail in the ‘survived’ class, which corresponds to three people getting their first-class tickets with $512 each (no idea what currency the dataset was referring to). They all got onboard at the port of Cherbourg, and all of them survived. seaborn.kdeplot() On the other hand, the port of embarkation seems to also play a role in determining who would survive. Most people embarked at the port of Southampton — the first stop of the journey, and they had the lowest survival rate. Maybe they were assigned to cabins further away from exits, or spending more time on a cruise would make people relaxed or tired. No one knows. Or maybe it’s just indirectly caused by some third variable — say maybe there were fewer women/children/first-class passengers that got onboard at the first port. This plot does not provide such information. Further investigation is required, . and is left as an exercise for the reader (Nani?) If you are a fan of tables instead of plots, we can also visualize the data by and take the mean for each class. However, I don’t think there is a clear pattern shown in the table of below. pandas.DataFrame.groupby() Parch Correlation matrix generated by illustrates the strength of correlation between any two variables. As you can see, has the highest magnitude of correlation with , whereas guess what, and are highly correlated. and do not seem to play a big role in predicting one’s survival chance, although instinct suggests otherwise. Family members onboard might be your helping hands in escaping from the sinking ship, or they could also be a burden (not from my personal experience, in case you are reading). More on that later in . seaborn.heatmap() Sex Survived Fare Pclass SibSp Parch Feature Engineering Missing Data Imputation We have found earlier in that there were missing data entries. For instance, we seem to have no clue how much did a 60-year-old Thomas Storey pay for his ticket. Instinct tells us that ticket fare hugely depends on ticket class and port of embarkation (sex might also be a factor in the early 20th century), We can cross-check with the correlation matrix above. Therefore, we will just take the mean (or median if you want) of third-class fare at Southampton. This is just an educated guess and is probably wrong, but it is good enough. Bear in mind that it is impossible to have noiseless data, and machine learning models are (to different extents) robust against noise. Go digging in historical archives is not worth it. Data Inspection There were also two women whom we had no idea where they got on the ship. This should be strongly correlated with ticket class and fare. As they both paid 80 dollars for a first-class seat, I would bet my money on Cherbourg (C in the plots). If there are only a few missing entries in a particular variable, we can use the tricks above to make educated guesses by essentially taking the value. Nonetheless, it would be really dangerous to do the same thing if we have more missing data, like in where about 20% of them are missing. maximum likelihood Age We can no longer make educated guesses by inspection. Since we dropped variable , and all other missing entries are filled in, we can leverage all other variables to infer the missing by random forest regressor. There are 80% of ‘training’ data to infer the remaining 20%. Cabin Age Machine Learning Feature Engineering As suggested in , passengers’ name would probably not helpful in our case, since they are all distinct and, you know what, being called wouldn’t make me less likely to survive. But we could extract the titles of their names. Data Inspection Eden While most of them had titles of ‘Mr’, ‘Mrs’, and ‘Miss’, there were quite a number of less frequent titles — ‘Dr’, ‘The Reverend’, ‘Colonel’ etc., some of them only appeared once, such as ‘Lady’, ‘Doña’, ‘Captain’ etc.. Their rare appearance would not help much in model training. In order to find patterns with data science, you need data. One datum point has no patterns whatsoever. Let’s just categorize all those relatively rare titles as ‘Rare’. Categorical data requires extra care before model training. Classifiers simply cannot process string inputs like ‘Mr’, ‘Southampton’ etc.. While we can map them to integers, say (‘Mr’, ‘Miss’, ‘Mrs’, ‘Rare’) → (1, 2, 3, 4), there should be no concept of ordering amongst titles. . In order not to mislead machines and accidentally construct a sexist AI, we should one-hot-encode them. They become: Being a Dr does not make you superior ( (1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0), (0, 0, 0, 1) ) On the other hand, I decided to add two more variables — and . Adding makes more sense since the whole family would have stayed together on the cruise. You wouldn’t have a moment with your partner but abandon your parents, would you? Besides, being alone might be one of the crucial factors. You could be more likely to make reckless decisions, or you could be more flexible without taking care of your family. Experiments (adding variables one at a time) suggested that their addition did improve the overall predictability. FamilySize IsAlone FamilySize = SibSp + Parch + 1 Model Evaluation Codes are available on . GitHub Before picking and tuning any classifiers, it is VERY important to standardize the data. Variables measured at various scales would screw things up. covers a range between 0 and 512, whereas is binary (in our dataset) — 0 or 1. Fare Sex You wouldn’t want to weigh **_Fare_** more than **_Sex_** . I have some thoughts (and guesses) on which classifiers would perform well enough, often by experience. I personally prefer Random Forest as it usually guarantees good enough results. But let’s just try out all the classifiers we know — SVM, KNN, AdaBoost, you name it, and they were all tuned by grid search. XGBoost stands out eventually with an 87% test accuracy, but that does not mean it would perform the best in inferring unknown data subsets. To increase the robustness of our classifier, an ensemble of classifiers was trained and final results were obtained by . It is vital to embed models with different strengths into the ensemble, otherwise there is no point building an ensemble model at the expense of computation time. with different natures majority voting Finally, I submitted it to Kaggle and achieved around accuracy. Not bad. There is always room for improvement. For instance, there is surely some useful information hidden in and , but we dropped them for simplicity. We could also create more features, e.g. a binary class that is 1 if or 0 otherwise. 80% Cabin Ticket Underage Age < 18 But we will move on for now. Saving and Loading Model I am not satisfied with just a trained machine learning model. I want it to be accessible by everyone (sorry for people who don’t have internet access). Therefore, we have to save the model and deploy it elsewhere, and this can be done by library. Parameters and in function represent write access and read-only in binary mode respectively. pickle 'wb' 'rb' open() pickle.dump(<model>, open(<file_path>, 'wb'))pickle.load(open(<file_path>, 'rb')) Photo by on Jonathan Pielmayer Unsplash Web App Deployment Web Framework is an easy-to-use web framework in Python. I only had little prior experience in building websites (HTML in primary school, last year, Wix doesn’t count), and I found it straightforward and simple. The simplest thing you can do is: Flask GitHub pages from flask import Flaskapp = Flask(__name__) .route("/")def hello():return "<h1>Write something here.</h1>" @app app.run(host='0.0.0.0', port=60000) And voilà! You can browse it in your localhost. What else do we need? We want people to fill in a form to collect the data required and pass it to the machine learning model (and not sell it, Mark). The model would have an output, which we will redirect users to that page. We will use to build a form in Python, and a single form is defined by a class, which looks like the following: WTForms from wtforms import Form, TextField, validators, SubmitField, DecimalField, IntegerField, SelectField class ReusableForm(Form): sex = SelectField('Sex:',choices=[('1', 'Male'), ('0', 'Female') ],validators=[validators.InputRequired()]) fare = DecimalField('Passenger Fare:',default=33,places=1,validators=[validators.InputRequired(),validators.NumberRange(min=0,max=512, message='Fare must be between 0 and 512')\]) submit = SubmitField('Predict') I found an HTML template from and built on top of it. Will Koehrsen You look at the people who have been there before you, the people who have dealt with this before, the people who have built the spaghetti code, and you thank them very much for making it open source, , and you take what they have made and you put it in your program, and you never ever look at it again. — Tom Scott and you give them credit No way I delve into a CSS spaghetti. Cloud Hosting Now that the webpage can be viewed via my localhost, and everything works fine. The last step would be hosting it online. Unfortunately, I only used one site hosting service before, GitHub Pages, which only hosts static sites. There are 3 major cloud hosting services now — AWS, GCP, and Azure. AWS is by far the most popular one, so I went for its 12-month free tier. They had lots of tutorials and documentation (and customer services), and it is quite easy to follow. Photo by on Nathan Anderson Unsplash I connected to the Linux server instance with my private key, migrated my repository to the server, ran my script, and it worked! The only problem was system timeout when it was idle for too long. By some , turns out it can be fixed by starting a screen session, which we run the scripts inside the session. googling [ec2-user@ip-<SOME_IP> ~]$ screen When you want to check the status next time, use to resume your screen session. You can exit the session by followed by . screen -r Ctrl+A D And now, it is running 24/7 non-stop! Check it out at . http://18.212.193.40:8888 The Takeaway Endnotes To sceptics: AI is not false hope. They are backed by sophisticated statistical theories. They are hard works. They are keys to revolutionize industries. To new learners: Data science is not just about training and tuning neural networks. It requires curiosity to drive us, knowledge to kick-start, energy to clean data, experience (i.e. trial and error) to engineer features, patience to train models, maturity to handle failure, and wisdom to explain. To technology enthusiasts: AI is no magic. There is nothing new under the sun. Photo by on Antonio Lainez Unsplash To future me: Doing data science is one thing, writing it is a different story. — current me Related Articles If you want to start a machine learning project now, check out the article below for starters: _Start training your neural networks with free GPUs today_towardsdatascience.com Quick guide to run your Python scripts on Google Colaboratory If you want to check out what else we can do using Python, these articles might help: _Exploring data visualization tools in Python_towardsdatascience.com Visualizing bike mobility in London using interactive maps and animations _Exploring climate-related data manipulation tools in Python_towardsdatascience.com Handling NetCDF files using XArray for absolute beginners Remarks If you want to contribute in any means, feel free to drop me an or find me on . As you can see, this project would definitely benefit from getting a ! email Twitter domain name Originally published at edenau.github.io .