Let’s talk about the one and only project you need to build, that’ll help you gain full-stack data science experience, and impress interviewers on your interviews if your goal is to jumpstart your career in data science.
I’ll break down the components of what a good data science project includes and exactly what an interviewer is looking for in a project and why they’re looking for it.
If you’d like to watch the video version of this article, please watch below:
One piece of advice before I start talking about the components of a good project. Let me, first, tell you about the 2 things to stay away from when you are trying to find or build a project.
1. Avoid any analysis with the Titanic or Iris dataset.
a. It’s been done to death.
b.I don’t care about your survival classifier.
a. I think it’s a great place to start when you’re just beginning. But it’s too commonplace and ordinary.
So unless you can get ranked in the top 10, just move on from Kaggle.
So with that out of the way, let’s talk about the one data science project that’ll impress anyone. Again, I’ll break down the components of exactly what an interviewer is looking for in a data science project and why they’re looking for it.
But as a summary, what an interviewer is looking for is a data scientist with real-world skills -- both in analytics/coding and in using modern technologies.
This helps you get closer to becoming a full-stack (or fully independent) data scientist.
So here’s a quick break down of the components of a good project:
1. Working with real data
2. Working with modern technologies
b. Databases in the cloud
3. Building models
4. Making an impact / validation
a. Application frameworks
So component #1 -- Working with real data -- specifically with data that gets updated in real-time. Working with real data that users produce and working with data that is produced in real-time helps prove to the interviewer that you know how to work with relevant and timely data. Not analyzing some data that was produced in 1912, like the titanic dataset.
So having said that, you’re probably asking, how do you get this data? This is a perfect segway into component #2.
Working with Modern Technologies
Use modern technologies used in the industry. So how are you going to get that real-life data that is updated in real-time? You can use APIs to collect that data.
Almost all apps and platforms these days rely on APIs to collect and pass information.
Learning how to use APIs to get the data that you need for your analysis shows the interviewer that you have relevant skills to do the job.
Some popular examples of APIs are Twitter, Netflix, and Amazon. A good API for data analysis will include:
This is something you’d be using at the job often. So as an interviewer, I’d start to see you as an experienced data scientist, not one that’s an absolute beginner.
Let’s talk about the 2nd modern technology to work with -- databases in the cloud. Once you collect the data from the API and maybe after you clean the data, you probably want to store it in a database. Why?
This gets us to the part of a data science project you probably thought was most important -- building models. It’s definitely really important to learn how to implement a model -- whether regression or some type of machine learning model.
And that’s kind of why I told you to start with Kaggle because they could give you experience on how to build ML models. So if you just don’t have a lot of experience building models in general, you can start with Kaggle.
But while getting experience in building models is important, there’s another aspect that’s even more important -- It’s the decisions you make and why you made them while building your model that are even more important.
Here are some questions you’ll need to answer when implementing your model. You’ll need to be able to eloquently explain your answers to these questions on an interview, otherwise no matter how good your model is, no one would be able to trust it:
* Why did you pick your model? Why that model? What are you trying to accomplish with this model that you couldn’t do with others?
* How did you clean the data? Why did you clean it in that way?
* What type of validation tests did you perform on the data to prepare it for the model?
* Tell me about the assumptions of your model? How did you validate them?
* How did you optimize your model? What were the trade-off decisions you made?
* How did you implement your test/control?
* Tell me about how the underlying math in the model works.
What you don’t see in this line of questions is how your model performed. I don’t really care too much about that. I care about your thought process and how you made decisions. And if you understand the underlying theory of the model.
Lastly, how do you know if you’ve built a great project? Your project should make an impact. You should have some validation from others.
I understand you’re building and coding to improve your skills. But the job of a data scientist is to help others by turning data into insights to provide recommendations that make an impact to the business. How do you know if your insights and recommendations are valuable if you’re building by yourself and showing nobody? You need to show your work to others and build something they find valuable.
There are 3 ways to do this.
The main point is to show that what you built is valuable and that people care about your work. Show the impact of your work. Interviewers and your teammates would be really impressed. Guaranteed.
So here are the components again:
1. Working with real data
2. Working with Modern technologies
b. Databases in the cloud
3. Building Models
4. Making an Impact / Validation
c. Application frameworks
You’re probably thinking that this is a lot of work and includes so many different skills that it’s going to take your years to be able to master. And the answer is, yes, it’s supposed to take you years to master.
The great part of these components is that you can work on them independently of each other. Like you can learn how to grab data from an API separately from learning how to work with databases. Master one component at a time and eventually you’ll master them all.
The secret to all this though? You don’t need to do multiple projects to master these skills. This is basically just one project. You’re building a data science infrastructure and learning the data science process.
Once you build the infrastructure like connecting to an API to pushing data to a database to building a model to produce nice visuals. You can use the exact same framework for other analyses, and probably just need to slightly revise your code at each step.
You can use the same code to connect to a new API and grab a new dataset. Use similar code and techniques to clean your data, and so on and so forth. Once you have the infrastructure built end-to-end, you can start working with other datasets and build other types of models using the same framework.
So keep iterating and improving and providing something of value to others, not just yourself.
That’s it for me. Hope this gives you some ideas for your next data science project. This type of project is definitely something that would impress an interviewer, if your goal is to get your first data science job.