Data Science As A Career: 12 Steps From Beginner to Pro by@raevskymichail

February 1st 2021 507 reads

For those looking to build a career in Data Science from scratch, here is a guide for you! This article will explain the advances you can make in your Data Science career, as well as a scattering of links to useful resources.

The field of data science is developing vigorously. But data scienceĀ **is not only neural networks**, but also classical statistics and machine learning algorithms (which is more understandable for business processes), and overall everything related to the analysis, processing, and presentation of information in digital form.

It cannot yet be said that there is a clear division of labor in Data Science ā this is a non-specialized profession. A rough analogy: just as there were pure **Computer Scientists*** **(computer scientists and programmers) who understood everything related to computers, so now there are *** Data Scientists**Ā who are engaged in everything related to data. The marker of the first movement towards specialization of labor is the sphere of online education.

One way or another, a data scientist works at the intersection of several areas:

ā¶ļøĀ **Mathematics**Ā (including linear algebra, machine learning algorithms)

ā¶ļøĀ **Programming**Ā (e.x. Python, R, SQL is usually a minimum requirement)

ā¶ļøĀ **Business problems**Ā (yes, apart from Computer Science, you should understand what are business processes and how you can improve it)

Depending on your role in the team, some of these things will have to be done more. When choosing a vector of development, start from your own interests ā learning will require significant resources, and without love for your work, you willĀ **quickly burn out**. A mathematical base is necessary, but it is likely that the personal circle of tasks will be reduced to the use of existing tools and knowledge, and not to the invention of something new. As K. V. Vorontsov said inĀ one interview:

People who know how to use ready-made algorithms need 50ā100ā500 times more. It seems that the problem of how to teach Computer Science and the problem of āmore math or more engineeringā has the following answer: you need both, but you have to teach mathematics to a carefully selected multitude of people who have realized themselves as creators, designers of new methods

If you want truly understand machine learning algorithms, you need first to understand Linear Algebra, Multivariable Calculus, probability theory, and mathematical statistics:

- Linear Algebra for Data Science in RĀ
**(4 hours of lessons)** - Introduction to CalculusĀ
**(48 hours)** - Foundations of Probability in PythonĀ
**(5 hours)** - Basics of Statistics,Ā part 2Ā ,Ā part 3Ā
**(total 43 hours)**

If illustrations, visualization are not enough, I highly recommend taking a look at the wonderful channelĀ **3Blue1Brown**. Here are some YouTube playlists forĀ linear algebra,Ā analysis,Ā differential equations.

By the way, there is aĀ detailed course of YouTube 175 videos onĀ multivariate mathematical analysis on the Khan Academy YouTube channel. When taking video lectures, do not forget about the possibility of fast-forwarding. To use motor memory and work deeper into the material, take notes.

Besides mathematics, you need to be able to program. Usually, Python or R is chosen as the main language for data analysts. There are many good courses in both languages, including with an emphasis on data analysis:

- Datacamp - Python Programming Track
- Datacamp - R Programming Track.
- Stepik - Analyzing Data in R,Ā part 2.

Newcomers to Data Science often have a question about which language to choose the main ā created specifically for data processing **R or universal Python**. Although this is a hot topic, I personally started with R (in computational biology people like it more), however, now I know both languages and highly recommendĀ **starting first with Python**, since a transition Python -> R is more smooth, compared to backward direction.

**In short:**Ā if you are planning a career in Data Science, I recommendĀ you master both languages. Knowing R concepts and libraries will keep you one step ahead of Python-only users, and vice versa. Hereās how data analyst Irina Goloshchapova writes about it:

"By combining the most powerful and stable R and Python libraries in some cases, you can improve the efficiency of calculations or avoid the invention of bicycles for the implementation of any statistical models. Secondly, this is an increase in the speed and convenience of project execution, if different people in your team (or yourself) have good knowledge of different languages. A reasonable combination of existing R and Python programming skills can help."

But if you want to go, albeit not a simple, but easier way, then one Python is enough ā you will find more courses andĀ answers to all sorts of questions on it on Stackoverflow.

One of the most popular tools for sharing data analysis results is Jupyter notebooks:

Jupyter Notebooks and the Jupyter Lab Platform allow you to combine code, text in Markdown, and formulas in LaTeX, testing, and profiling in a single document. Alternatively, you can collaborate on notebooks using Google ColabĀ orĀ JupyterHub.

Learn toĀ use GitĀ as soon as possible. In the process, you will have to choose between a variety of models and architectural solutions ā version control is very useful here.

Plus, there areĀ many great Data Science projectsĀ on GitHub. Remember that open source is one of the easiest ways to gain the necessary teamwork experience and contribute to a common cause.

You will naturally come across other popular tools as you progress through the courses. For example, in Python for high-speed processing of data arrays, knowledge ofĀ NumPyĀ is required, for tabular data presentation, PandasĀ data frames are usually used, for visualization āĀ MatplotLibĀ or Plotly, ready-made classes of popular machine learning models are imported fromĀ Scikit-learn.

Few courses focus on this, but in practice, data is usually stored in databases ā SQL or NoSQL. For further work, you will need to learn how to communicate with them:

- Datacamp - Introduction to Databases in Python
- Datacamp - Introduction to Relational Databases in SQL
- Stepik - Hadoop. A system for processing large amounts of data

For deep learning, you need to get familiar with frameworks ā TensorFlow or PyTorch. There are others ā we compared them in the articleĀ āWrite your first Generative Adversarial Network Model on PyTorchā.

**Courses:**

- Andrew Ngās Machine Learning Course on CourseraĀ is one of the most popular MOOCs out there. It is worth taking if only because it is often referred to other advanced courses. However, Octave / Matlab is used instead of standard Python and R.
- Leskovets et al.Ā Mining of massive datasets. There is a breakdown by chapters: pdf, exercises, presentations, videos.
- Courses onĀ DataCamp
- Harvard Data ScienceĀ Course (eDX)
- Probabilistic Programming and Bayesian Methods for Hackers
- Dive into Deep Learning: Free Interactive Book with Code, Math and DiscussionĀ http://d2l.ai

**Textbooks:**

- Hasti et al.Ā Elements of Statistical Learning

- Hal DaumĆ© IIIĀ AcademicĀ Machine Learning Course
- Shalev-Schwartz and Ben-David.Ā Understanding Machine Learning: From Theory to Algorithms
- David Barber.Ā Bayesian Decision Theory and Machine Learning

- Tom Mitchell.Ā Machine Learning
- Devroy et al.Ā Probabilistic theory of pattern recognition
- Neatly designed editions with easy copying ofĀ R in action: data analysis and graphing with RĀ andĀ Machine Learning in action
- Cheat Sheet on Key Concepts and Machine Learning Algorithms

A lot of interesting things can be learned from the English-language news aggregators from the world of data science:

Register onĀ Kaggle. Not only is it the most famousĀ machine learning competitionĀ platformĀ with cash prizes, but it is also a large community with a registry of datasets,Ā Jupyter notebooks,Ā mini-courses, and discussions. Participating in the Kaggle ranking on your resume can give you extra credit for your interview.

Data science is an incredibly broad interdisciplinary field, and special skills are required to solve specific problems. After familiarizing yourself with Kaggle, it will become clearer to you in which demanded knowledge you have gaps.

Also, pay attention to the following courses:

- Introduction to Deep Learning in Python.
- Deep Learning for NLP in Python.
- Introduction to Natural Language Processing.
- Probabilistic Graphical Models Specialization.
- Data Structures course.
- Computer graphics: the basicsĀ (useful for working with models that process images).

YouTube channels also come in handy:

On the YouTube channel of theĀ Computer Science Center,Ā courses in special sections are conveniently organized into playlists:

- Machine LearningĀ (Ā second partĀ )
- Image and Video AnalysisĀ (Ā second partĀ )
- Introduction to Natural language processing
- Data analysis in Python in examples and tasksĀ (Ā continuedĀ )
- Data analysis in R
- Technologies for storing and processing large amounts of data
- Mathematical statistics.

Donāt stop learning. Browse the top and sidebar **subreddits** for topics related to machine learning:

- / r / analyzit
- / r / bigdata
- / r / computervision
- / r / datacleaning
- / r / datagangsta
- / r / dataisbeautiful
- / r / dataisugly
- / r / datascience
- / r / datasets
- / r / dataviz
- / r / JupyterNotebooks
- / r / LanguageTechnology
- / r / learnmachinelearning
- / r / MachineLearning
- / r / opendata
- / r / rstats
- / r / probabilitytheory
- / r / pystats
- / r / SampleSize
- / r / semanticweb
- / r / statistics
- / r / textdatamining

Use new knowledge in the field of Data Science to benefit yourself and others. Create something that will make others say āwowā! Lots of project ideas are listed in

You can start not from the project, but from an interesting dataset. List of popular registries:

- Open data registry at data.gov.ru
- Google public datasets
- Kaggle datasets (40 thousand)
- Reddit / r / datasets branch
- UCI Machine Learning Repository
- Aggregate list of open-source datasets awesome-public-datasets
- List of large public datasets
- List of quality Webhose.io datasets
- Datasets of the IEEE Society
- Wolfram Data Drop Accumulator
- Statistics database on finance, sports, geography, industry

Lots of discussions with project ideas can be found on Quora:

What Data Science Problems Can Be Solved Over the Weekend by One Programmer? I am studying Machine Learning and Statistics and am looking for something socially significant using public datasets and APIs

How do I start building a recommendation system? What tools/technologies/algorithms are best used to build an engine? How to check the effectiveness of the recommendations?

Create a public repository on GitHub for each project. Brush up the results, share them on your blog and community. Contribute to side projects, post your ideas and thoughts. All this will help you build a portfolio and get to know people working on related tasks.

The main languages āāof data science are not Python or R, but English and the language of mathematics.

Preprints of articles are published on theĀ **ArXiv website**. The most useful sections for data scientists:

It is simply impossible to keep track of all publications. The Reddit branches listed above will help to isolate the most important texts (since the author became the head of the AI āādepartment at Tesla, the site began to break more often, but itās still the best tool). There is alsoĀ such a list of articles with commentsĀ and recordings of webinars from the YouTube channel KaggleĀ with parsing of scientific articles related to data science algorithms.

Data Science is a highly competitive profession in demand. But even the results of interviews are turned into data by community members. There are many lists of questions to prepare for a data scientist interview:

- Data Science Interview Questions
- How Do I Prepare for a Data Science Interview
- How to Prepare for Statistics Questions
- What Types of A / B Testing Questions to Expect in an Interview

This year it is more difficult, but we hope that summer schools and internships will return soon:

- Which companies offer Data Science internships for students
- What tips to follow if I want to apply for an internship in Data Science or Software Engineering
- When is the Best Time to Apply for Summer Data Science Internships

Be sure to use your data mining skills to analyze the job market ā analyze which skills are found in jobs more often to hone them as much as possible. Estimate how much income you can expect, taking into account spending on the site, rental housing, and moving to another city.

Share your project or find it with the Data Science community. Prepare a talk and speak at a local meetup. Start a blog where you will share your finds, your own ideas, and repositories.

Last but not least, enjoy how your skills help make the world a better place!

*If you found this article helpful, share the article on Facebook so your friends can benefit from it too.*

*Also published at **https://dev.to/mikhailraevskiy/data-scientist-12-steps-from-beginner-to-pro-3fh6*

Join Hacker Noon

Create your free account to unlock your custom reading experience.