Data Science From Scratch by@sangeet-aggarwal

May 4th 2020 415 reads

Data Science, which is also known as the sexiest job of the century, has become a dream job for many of us. But for some, it looks like a challenging maze and they don’t know where to start. If you are one of them, then continue reading.

In this post, I’ll discuss how you can start your journey of Data Science from scratch. I’ll explain the following steps in detail.

- Learn the basics of programming with Python
- Learn basic Statistics and Mathematics
- Learn Python for Data Analysis
- Learn Machine Learning
- Practice with projects

If you are from an IT background, you are probably familiar with programming with Python, in which case you can skip this step. But if you’re yet not exposed to the fun of coding, you should start learning Python. It’s the easiest to learn of all programming languages and is widely used for development as well as data analytics.

To begin with, you can search for free online tutorials that will help you understand the basics of Python. I’m listing a few links where you can learn Python on your own in a short period of time. You can try these out and choose for yourself.

- learnpython.org
- Google’s Python Class
- Estudy free Python course (Video Tutorials)
- Code Academy (With online editor to code)

The list is not exhaustive and you can find many more resources on the web that can help you start learning the basics of Python. You can also find many YouTube channels that have Python tutorials for beginners.

Once you are familiar with the syntax and other basics of programming, you can continue learning the intermediate and advanced levels of Python. Although to be good at data science, I recommend you to complete at least the intermediate level, so you can be familiar with Data Structures and File Systems in Python.

Let’s move on to the next step.

Data Science is the skill of analyzing the data and drawing useful and actionable insights. For that, you must have knowledge of basic Statistics and Mathematics. Now I’m not asking you to be a great statistician, but you should know the basics to understand important things like distribution of data and the working of algorithms. Having said that let’s see what you need to learn.

First of all, go through your high school statistics so you can touch base again. For that, I recommend Khan Academy’s series of High School Stats (optional if you are thorough and comfortable with it).

After brushing up your high school concepts, You can start reading any of the following books:

- An Introduction to Statistical Learning (with R) (highly recommended)
- Think Stats (with Python)

The above links will directly take you to the respective pdf versions of these books. You can also purchase the physical copies as per your convenience. After having read one of these books, you will also get familiar with the fundamentals of Data Analysis which will help you in the next step.

Having said that let’s move on to our first attempt at data analysis.

This is where it gets interesting. Now that you know the basics of Python programming and the required Statistics, its time to finally get your hands dirty.

If you want to learn without paying anything, just make an account on Udacity and sign up for their free course — Intro to Data Analysis. This course will introduce you to the useful Python libraries such as **Pandas **and **Numpy**, that are needed for Data Analysis. You can learn at your own pace and easily finish the course in a few weeks.

There are many other courses on Udacity for you to explore. You can also find Nanodegree programs offered by Udacity, for which you generally have to pay. If you are comfortable paying for learning, there are many good platforms such as Coursera, Dataquest, Datacamp, etc.

By the end of this step, you should be familiar with some important libraries of Python and data structures like **Series**, **Arrays**, and **DataFrames**. You should also be able to perform tasks like data wrangling, drawing conclusions, vectorized operations, grouping data, and combining data from multiple files.

Although you are now ready for the next step, there is still one thing left to be learned before moving on. The final key to bridging the gap between Analytics and Machine Learning — **Data Visualization**.

Data Visualization is an important part of Data Analytics as it helps you draw conclusions and visualize patterns in the data. Therefore it is imperative to learn how to visualize data. The best and the simplest way to do so is to go through Kaggle’s course of Data Visualization. After this, you will be familiar with an important Python library — **Seaborn.**

Great! You have come more than halfway to learning Data Science. Let’s move on to the next step which is Machine Learning.

Machine Learning, as the name suggests is the process with which machine (computer) learns itself. It is the study of computer algorithms that improve automatically through experience. You build models mostly using predefined algorithms depending upon the kind of data and business problem you are facing. These models train themselves on a given data and are then used to draw conclusions on new data.

The simplest way to go about learning Machine Learning would be to go through the following courses on Kaggle in the given order:

- Intro to Machine Learning
- Intermediate Machine Learning
- Feature Engineering (to improve your models)

Although there are many other ways to learn Machine Learning, I have mentioned the easiest one for which you don’t have to pay. If money is not the constraint for you, you can explore various courses on DataCamp, Coursera, Udacity, and other related platforms.

By the end of this step, you would understand the difference between **Supervised Machine Learning **and **Unsupervised Machine Learning**. You would also know various important algorithms such as **Regression**, **Classification**, **Decision Trees**, **Random Forest**, etc.

Awesome! You just cracked the maze and joined the club of Data Science. Now all you have to do is to get better and climb up the ladder.

If you are still reading this blog, you really have what it takes to become a successful Data Scientist. Once you have achieved all the knowledge, you must retain it and enhance it by practicing as much as you can. To do so, you can find projects to work on and business problems to solve.

One of the best ways to stay in practice is by participating in Kaggle contests and solving the problems. Kaggle gives you the problem to be solved and the required data to work on. If it’s a contest, you can submit your results and get a rank in the leaderboard based on your score.

You can also work on personal projects to build a portfolio of your own. You can try the following sources to explore datasets:

To practice, I recommend you to download and install Anaconda in your local machine. This is a great toolkit for doing your Data Science projects. You will find **Jupyter Notebook **as one of the tools in Anaconda, which is a great way to build Python projects and showcase them in your portfolios.

I am sure that following the guidelines in this blog would have helped you achieve the goal of learning data science. There’s a lot to learn and even more to explore in this field.

Stay tuned.

Previously published at https://towardsdatascience.com/data-science-from-scratch-4343d63c1c66