Bioinformatician at Oncobox Inc, Research Associate at MIPT
For those looking to build a career in Data Science from scratch, here is a guide for you! This article will explain the advances you can make in your Data Science career, as well as a scattering of links to useful resources.
The field of data science is developing vigorously. But data science is not only neural networks, but also classical statistics and machine learning algorithms (which is more understandable for business processes), and overall everything related to the analysis, processing, and presentation of information in digital form.
It cannot yet be said that there is a clear division of labor in Data Science — this is a non-specialized profession. A rough analogy: just as there were pure Computer Scientists (computer scientists and programmers) who understood everything related to computers, so now there are Data Scientists who are engaged in everything related to data. The marker of the first movement towards specialization of labor is the sphere of online education.
One way or another, a data scientist works at the intersection of several areas:
▶️ Mathematics (including linear algebra, machine learning algorithms)
▶️ Programming (e.x. Python, R, SQL is usually a minimum requirement)
▶️ Business problems (yes, apart from Computer Science, you should understand what are business processes and how you can improve it)
Depending on your role in the team, some of these things will have to be done more. When choosing a vector of development, start from your own interests — learning will require significant resources, and without love for your work, you will quickly burn out. A mathematical base is necessary, but it is likely that the personal circle of tasks will be reduced to the use of existing tools and knowledge, and not to the invention of something new. As K. V. Vorontsov said in one interview:
People who know how to use ready-made algorithms need 50–100–500 times more. It seems that the problem of how to teach Computer Science and the problem of “more math or more engineering” has the following answer: you need both, but you have to teach mathematics to a carefully selected multitude of people who have realized themselves as creators, designers of new methods
If you want truly understand machine learning algorithms, you need first to understand Linear Algebra, Multivariable Calculus, probability theory, and mathematical statistics:
If illustrations, visualization are not enough, I highly recommend taking a look at the wonderful channel 3Blue1Brown. Here are some YouTube playlists for linear algebra, analysis, differential equations.
By the way, there is a detailed course of YouTube 175 videos on multivariate mathematical analysis on the Khan Academy YouTube channel. When taking video lectures, do not forget about the possibility of fast-forwarding. To use motor memory and work deeper into the material, take notes.
Besides mathematics, you need to be able to program. Usually, Python or R is chosen as the main language for data analysts. There are many good courses in both languages, including with an emphasis on data analysis:
Newcomers to Data Science often have a question about which language to choose the main — created specifically for data processing R or universal Python. Although this is a hot topic, I personally started with R (in computational biology people like it more), however, now I know both languages and highly recommend starting first with Python, since a transition Python -> R is more smooth, compared to backward direction.
In short: if you are planning a career in Data Science, I recommend you master both languages. Knowing R concepts and libraries will keep you one step ahead of Python-only users, and vice versa. Here’s how data analyst Irina Goloshchapova writes about it:
"By combining the most powerful and stable R and Python libraries in some cases, you can improve the efficiency of calculations or avoid the invention of bicycles for the implementation of any statistical models. Secondly, this is an increase in the speed and convenience of project execution, if different people in your team (or yourself) have good knowledge of different languages. A reasonable combination of existing R and Python programming skills can help."
But if you want to go, albeit not a simple, but easier way, then one Python is enough — you will find more courses and answers to all sorts of questions on it on Stackoverflow.
One of the most popular tools for sharing data analysis results is Jupyter notebooks:
Jupyter Notebooks and the Jupyter Lab Platform allow you to combine code, text in Markdown, and formulas in LaTeX, testing, and profiling in a single document. Alternatively, you can collaborate on notebooks using Google Colab or JupyterHub.
Learn to use Git as soon as possible. In the process, you will have to choose between a variety of models and architectural solutions — version control is very useful here.
Plus, there are many great Data Science projects on GitHub. Remember that open source is one of the easiest ways to gain the necessary teamwork experience and contribute to a common cause.
You will naturally come across other popular tools as you progress through the courses. For example, in Python for high-speed processing of data arrays, knowledge of NumPy is required, for tabular data presentation, Pandas data frames are usually used, for visualization — MatplotLib or Plotly, ready-made classes of popular machine learning models are imported from Scikit-learn.
Few courses focus on this, but in practice, data is usually stored in databases — SQL or NoSQL. For further work, you will need to learn how to communicate with them:
For deep learning, you need to get familiar with frameworks — TensorFlow or PyTorch. There are others — we compared them in the article “Write your first Generative Adversarial Network Model on PyTorch”.
A lot of interesting things can be learned from the English-language news aggregators from the world of data science:
Register on Kaggle. Not only is it the most famous machine learning competition platform with cash prizes, but it is also a large community with a registry of datasets, Jupyter notebooks, mini-courses, and discussions. Participating in the Kaggle ranking on your resume can give you extra credit for your interview.
Data science is an incredibly broad interdisciplinary field, and special skills are required to solve specific problems. After familiarizing yourself with Kaggle, it will become clearer to you in which demanded knowledge you have gaps.
Also, pay attention to the following courses:
YouTube channels also come in handy:
On the YouTube channel of the Computer Science Center, courses in special sections are conveniently organized into playlists:
Don’t stop learning. Browse the top and sidebar subreddits for topics related to machine learning:
You can start not from the project, but from an interesting dataset. List of popular registries:
Lots of discussions with project ideas can be found on Quora:
What Data Science Problems Can Be Solved Over the Weekend by One Programmer? I am studying Machine Learning and Statistics and am looking for something socially significant using public datasets and APIs
How do I start building a recommendation system? What tools/technologies/algorithms are best used to build an engine? How to check the effectiveness of the recommendations?
Create a public repository on GitHub for each project. Brush up the results, share them on your blog and community. Contribute to side projects, post your ideas and thoughts. All this will help you build a portfolio and get to know people working on related tasks.
The main languages of data science are not Python or R, but English and the language of mathematics.
Preprints of articles are published on the ArXiv website. The most useful sections for data scientists:
It is simply impossible to keep track of all publications. The Reddit branches listed above will help to isolate the most important texts (since the author became the head of the AI department at Tesla, the site began to break more often, but it’s still the best tool). There is also such a list of articles with comments and recordings of webinars from the YouTube channel Kaggle with parsing of scientific articles related to data science algorithms.
Data Science is a highly competitive profession in demand. But even the results of interviews are turned into data by community members. There are many lists of questions to prepare for a data scientist interview:
This year it is more difficult, but we hope that summer schools and internships will return soon:
Be sure to use your data mining skills to analyze the job market — analyze which skills are found in jobs more often to hone them as much as possible. Estimate how much income you can expect, taking into account spending on the site, rental housing, and moving to another city.
Share your project or find it with the Data Science community. Prepare a talk and speak at a local meetup. Start a blog where you will share your finds, your own ideas, and repositories.
Last but not least, enjoy how your skills help make the world a better place!
If you found this article helpful, share the article on Facebook so your friends can benefit from it too.
Create your free account to unlock your custom reading experience.