Data science is the underlying force that is driving recent advances in artificial intelligence (AI), and machine learning (ML). This has lead to the enormous growth of ML libraries and made established programming languages like Python more popular than ever before.
It makes sense to put them all together (even though they’re not interchangeable) because there’s significant overlap. In some ways we can say that data science is about producing insights, while AI is about producing actions, and ML is focused on making predictions.
To better understand the inner workings of data science in AI and ML, you will have to dive right into the machine learning engineering stack listed below to understand how it’s used.
As part of our research for Springboard’s AI/Machine Learning Career Track (the first online machine learning course with job guarantee), we’ve curated together a selection of those tools and the resources required.
CometML is a newer product that aims to do for machine learning what GitHub did for code. GitHub is celebrated for its flexibility in organizing workflows and maintaining version control for projects with multiple developers working on the same codebase. Similarly, CometML allows data scientists and developers to efficiently track, compare, and collaborate on machine learning experiments. As you train your model, CometML tracks and graphs the results. It also tracks code changes and imports them.
Developers can integrate CometML into most machine learning libraries, including PyTorch, Keras, and Scikit-learn, so there’s no disruption to existing workflows. You simply have a new supplementary service that helps you get greater insight into your experiments. You can deploy CometML in any jupyter notebook with two lines of code.
To get started, you just set up a free account, add the Comet tracking code to your machine learning app of choice, and run your experiments as normal.
Comet works with GitHub and other git service providers. Once you’ve finished a project, you can generate a pull request straight to your GitHub repository.
Dask-ML was developed to provide advanced parallelism for analytics while boosting performance at scale for tools like Pandas and NumPy workflows. It also enables the execution of advanced computations by exposing low-level APIs to its internal task scheduler.
For machine learning projects, Dask-ML is a useful tool to overcome long training times and large data sets. In fact, you can scale algorithms more easily by replacing NumPy arrays with Dask-ML arrays. Dask-ML leverages Dask workflows to prepare the data. Then it can be quickly deployed to tools like TensorFlow alongside Dask-ML to hand over the data.
Check out these Dask-ML resources:
Docker’s role in the machine learning stack is to simplify the installation process. For data scientists, this can be a godsend for anyone who spends a significant amount of time trying to resolve configuration problems. The primary idea behind Docker is simple: if it works in a Docker container, then it will work on any machine.
This open-source software platform makes it much easier to develop, deploy, and manage virtual machines using containers on popular operating systems. Because of its robust ecosystem of allied tools, you can write a Dockerfile that builds a Docker image, which contains most of the libraries and tools that you will need for any given project.
DockerHub is what GitHub is to Git. This platform makes it much easier to share your Docker images and help someone else with their project. However, Docker’s real potential is only realized when machine learning is added into the mix.
For example, you can use an app with a container to search through millions of profile pictures on social media platforms using facial recognition. In this scenario, Docker streamlines the work, makes it scalable, and allows businesses to focus on their goals.
We can say that ML was revolutionized by Docker because it allowed for the creation of effective application architecture.
If you have never used Docker before, it’s best to start with Docker Orientation.
GitHub is a Git repository hosting service and development platform where both business and open-source communities can host, manage projects, review code, and develop software. Supported by over 31 million developers, GitHub provides a highly user-friendly, web-based graphical interface that makes managing development projects easier.
There are several collaboration features, like basic task management tools, wikis, and access control. Whether you’re a beginner or an established pro, this platform boasts a wealth of resources for your benefit.
An introduction to Git and how to use it to commit/save your files for collaboration is available on freeCodeCamp.
Some of the ML resources that can help you on your next project are:
- Awesome Data Science Repository
- GPT-2 — OpenAI’s Ground-Breaking Language Model
Hadoop is an Apache project that can be described as a software library and framework that enables the distributed data processing of large datasets from multiple computers using simple programming models.
In fact, Hadoop can be scaled from one computer to thousands of commodity systems that offer computing power and local storage. The Hadoop framework is made up of the following models:
- Hadoop Distributed File System (HDFS)
- Hadoop Common
- Hadoop YARN
- Hadoop MapReduce
You can also further extend the power and reach of Hadoop with the following models:
Hadoop is ideal for companies that want to rapidly process large, complex datasets quickly by leveraging ML. Machine learning can be implemented in Hadoop’s MapReduce to quickly identify patterns and profound insights.
To achieve this, you have to run the ML library Mahout on top of Hadoop. To learn more about using Mahout on Hadoop, check out the following resources:
Keras, written in Python, is a high-level neural network API. Authored by François Chollet, an AI researcher and software engineer at Google and the founder of Wysp, Keras is designed to be highly user-friendly and fast.
With Keras, you can easily run experimentations on top of CNTK, TensorFlow, or Theano. It’s ideal for projects that demand a deep learning library that accommodates rapid prototyping through modularity and extensibility. It also supports recurrent and convolutional networks, and seamlessly runs on both CPUs and GPUs.
Luigi is a Python framework that is used internally at Spotify. With this tool, you can build complex pipelines of batch jobs and effectively manage dependency resolution, visualization, workflows, and more.
It was built to address the challenges associated with long-running batch processes where failures are inevitable. Luigi makes it easier to automate and manage long running tasks like data dump to or from databases, Hadoop jobs, and running ML algorithms.
Tools like Cascading, Hive, and Pig can manage the lower level aspects of data processing effectively, but Luigi can help you chain them together. For example, you can stitch together a Hadoop job in Java, dumping a table from a database, a Hive query, or a Spark job in Python. As Luigi takes care of workflow management, you can focus on the task and it dependencies.
Helpful Luigi resources include:
- Production ready Data-Science with Python and Luigi
- Machine Learning Pipeline using Luigi and Scikit-Learn
For data scientists coding with Python, Pandas is an important tool that’s often the backbone of many big data projects. In fact, for anyone thinking of a career in data science or machine learning, it will be critical to learn Pandas because it’s key to cleaning, transforming, and analyzing data.
It can make it easy for you to get acquainted with the data by extracting the information from a CSV file into a DataFrame or table. It can also perform calculations, visualizations, and clean the data before storage. The latter will be critical to ML and natural language processing.
Check out this Pandas tutorial for more.
PyTorch is written in Python and is the successor of Python’s Torch library (which was written in Lua). Developed by Facebook and used by major players like Salesforce, Twitter, and the University of Oxford, PyTorch provides maximum flexibility and speed for DL platforms.
It’s also a replacement for NumPy as it makes better use of the power for GPUs. As it’s based on the system properties like operating system or the package managers, installation is pretty easy. PyTorch can be installed within an IDE like PyCharm on from the command prompt.
PyTorch is good at displaying procedures in a straightforward manner and includes a considerable amount of pre-prepared models and particular parts that are easy to consolidate.
To get an in-depth understanding PyTorch in ML, listen to this interview with Soumith Chintala, research engineer at the Facebook AI Research Lab. There are more PyTorch tutorials on the GitHub platform.
The Apache project Spark can be described as an open-source, general-purpose, distributed data processing engine. It’s a highly flexible tool that can be leveraged to access data in a variety of sources, such as Amazon S3, Cassandra, HDFS, and OpenStack.
If we compare it with Hadoop, Spark’s in-memory processing can be 100 times faster and run about 10 times faster on disk. You can use it to process your data on a standalone computer in a local machine or even build models when the input datasets are much larger than your computer’s memory. In fact, what makes Spark perfect for ML is its in-memory processing, which is capable of delivering near real-time analytics.
Spark also comes with an interactive mode, so users can get immediate feedback on their queries and actions. While it’s also good at batch processing, it transcends the competition in machine-based learning, processing interaction queries, streaming workloads, and real-time data processing capabilities.
If you’re already familiar with Hadoop, you can easily add Spark to your arsenal, as it’s highly compatible (and is even listed as a module on Hadoop’s project page).
Spark is user-friendly because it comes with the following APIs:
- Spark SQL (which is very similar to SQL 92)
This comprehensive guide from Microsoft demonstrates how to train and create ML models with Spark.
Scikit-learn is an open-source ML library for Python that features algorithms that support k-neighbours, random forests, and vector machines. It also supports numeric and scientific libraries for Python like NumPy and SciPy.
By far the cleanest and easiest ML library, Scikit-learn accomodates a wide selection of supervised and unsupervised algorithms. Designed with an engineering mindset, this tool is highly user-friendly, powerful, and flexible for running end-to-end ML research projects.
To learn more about how Scikit-learn is used in ML, you can go through the following resources:
The open-source programming library, TensorFlow was developed to help ML algorithms construct and train frameworks and neural systems to mimic human perception, thinking, and learning. Some of Google’s leading products, notably Google Translate, utilize TensorFlow.
It works by using different advancement strategies to make the calculation of numerical articulations less demanding while boosting overall performance. However, TensorFlow can prove to be far more challenging than Keras or PyTorch and requires a great deal of standard coding.
- Deep Learning with TensorFlow for Beginners
- Getting Started with TensorFlow: A Machine Learning Tutotial
- TensorFlow Tutorials
Clearly, the machine learning engineer’s toolbox is robust. The wealth of technology that’s available is quite significant and potentially overwhelming. However, if you’re a Python developer, it’ll be fairly easy to pick up the main components of the ML stack. If you feel like you need some help doing so, Springboard offers a curated curriculum, job guarantee, and unlimited calls with machine learning experts, including your own personal mentor through the AI/Machine Learning Career Track.