Introduction to the machine learning stack Data science is the underlying force that is driving recent advances in artificial intelligence (AI), and machine learning (ML). This has lead to the enormous growth of ML libraries and made established programming languages like Python more popular than ever before. It makes sense to put them all together (even though they’re not interchangeable) because there’s significant overlap. In some ways we can say that data science is about producing insights, while AI is about producing actions, and ML is focused on making predictions. To better understand the inner workings of data science in AI and ML, you will have to dive right into the machine learning engineering stack listed below to understand how it’s used. As part of our research for Springboard’s (the first online machine learning course with job guarantee), we’ve curated together a selection of those tools and the resources required. AI/Machine Learning Career Track CometML CometML CometML is a newer product that aims to do for machine learning what GitHub did for code. GitHub is celebrated for its flexibility in organizing workflows and maintaining version control for projects with multiple developers working on the same codebase. Similarly, CometML allows data scientists and developers to efficiently track, compare, and collaborate on machine learning experiments. As you train your model, CometML tracks and graphs the results. It also tracks code changes and imports them. Developers can integrate CometML into most machine learning libraries, including PyTorch, Keras, and Scikit-learn, so there’s no disruption to existing workflows. You simply have a new supplementary service that helps you get greater insight into your experiments. You can deploy CometML in any jupyter notebook with two lines of code. To get started, you just set up a free account, add the Comet tracking code to your machine learning app of choice, and run your experiments as normal. Comet works with GitHub and other git service providers. Once you’ve finished a project, you can generate a pull request straight to your GitHub repository. For more, check out , this “ ” for users, and this by Comet.ml CEO Gideon Mendels. Comet’s documentation cheat sheet Data Council presentation Dask-ML Dask-ML -ML was developed to provide advanced parallelism for analytics while boosting performance at scale for tools like Pandas and NumPy workflows. It also enables the execution of advanced computations by exposing low-level APIs to its internal task scheduler. Dask For machine learning projects, Dask-ML is a useful tool to overcome long training times and large data sets. In fact, you can scale algorithms more easily by replacing NumPy arrays with Dask-ML arrays. Dask-ML leverages Dask workflows to prepare the data. Then it can be quickly deployed to tools like TensorFlow alongside Dask-ML to hand over the data. Check out these Dask-ML resources: Dask Questions on StackOverflow Scalable Machine Learning with Dask Why Dask? Docker Docker Docker’s role in the machine learning stack is to simplify the installation process. For data scientists, this can be a godsend for anyone who spends a significant amount of time trying to resolve configuration problems. The primary idea behind Docker is simple: if it works in a Docker container, then it will work on any machine. This open-source software platform makes it much easier to develop, deploy, and manage virtual machines using containers on popular operating systems. Because of its robust ecosystem of allied tools, you can write a Dockerfile that builds a , which contains most of the libraries and tools that you will need for any given project. Docker image is what GitHub is to Git. This platform makes it much easier to share your Docker images and help someone else with their project. However, Docker’s real potential is only realized when machine learning is added into the mix. DockerHub For example, you can on social media platforms using facial recognition. In this scenario, Docker streamlines the work, makes it scalable, and allows businesses to focus on their goals. use an app with a container to search through millions of profile pictures We can say that ML was revolutionized by Docker because it allowed for the creation of effective application architecture. If you have never used Docker before, it’s best to start with . Docker Orientation GitHub Github is a Git repository hosting service and development platform where both business and open-source communities can host, manage projects, review code, and develop software. Supported by over 31 million developers, GitHub provides a highly user-friendly, web-based graphical interface that makes managing development projects easier. GitHub There are several collaboration features, like basic task management tools, wikis, and access control. Whether you’re a beginner or an established pro, this platform boasts a wealth of resources for your benefit. An introduction to Git and how to use it to commit/save your files for collaboration is . available on freeCodeCamp Some of the ML resources that can help you on your next project are: Awesome Data Science Repository GPT-2 — OpenAI’s Ground-Breaking Language Model Iterative/DVC OpenML SC-FEGAN Hadoop Apache Hadoop Hadoop is an Apache project that can be described as a software library and framework that enables the distributed data processing of large datasets from multiple computers using simple programming models. In fact, Hadoop can be scaled from one computer to thousands of commodity systems that offer computing power and local storage. The Hadoop framework is made up of the following models: Hadoop Distributed File System (HDFS) Hadoop Common Hadoop YARN Hadoop MapReduce You can also further extend the power and reach of Hadoop with the following models: Ambari Avro Cassandra Flume Hive Oozie Pig Sqoop Hadoop is ideal for companies that want to rapidly process large, complex datasets quickly by leveraging ML. Machine learning can be implemented in Hadoop’s MapReduce to quickly identify patterns and profound insights. To achieve this, you have to run the on top of Hadoop. To learn more about using Mahout on Hadoop, check out the following resources: ML library Mahout Building Mahout from Source Apache Mahout on GitHub Apache Mahout Tutorial for Beginners Keras Keras , written in Python, is a high-level neural network API. Authored by , an AI researcher and software engineer at Google and the founder of , Keras is designed to be highly user-friendly and fast. Keras François Chollet Wysp With Keras, you can easily run experimentations on top of CNTK, TensorFlow, or Theano. It’s ideal for projects that demand a deep learning library that accommodates rapid prototyping through modularity and extensibility. It also supports recurrent and convolutional networks, and seamlessly runs on both CPUs and GPUs. To get a better idea of how this works, check out this . You can also find . interview with Chollet more Keras resources on GitHub Luigi Luigi Luigi is a Python framework that is . With this tool, you can build complex pipelines of batch jobs and effectively manage dependency resolution, visualization, workflows, and more. used internally at Spotify It was built to address the challenges associated with long-running batch processes where failures are inevitable. Luigi makes it easier to automate and manage long running tasks like data dump to or from databases, Hadoop jobs, and running ML algorithms. Tools like Cascading, Hive, and Pig can manage the lower level aspects of data processing effectively, but Luigi can help you chain them together. For example, you can stitch together a Hadoop job in Java, dumping a table from a database, a Hive query, or a Spark job in Python. As Luigi takes care of workflow management, you can focus on the task and it dependencies. Helpful Luigi resources include: Production ready Data-Science with Python and Luigi Machine Learning Pipeline using Luigi and Scikit-Learn Pandas Pandas For data scientists coding with Python, is an important tool that’s often the backbone of many big data projects. In fact, for anyone thinking of a career in data science or machine learning, it will be critical to learn Pandas because it’s key to cleaning, transforming, and analyzing data. Pandas It can make it easy for you to get acquainted with the data by extracting the information from a CSV file into a DataFrame or table. It can also perform calculations, visualizations, and clean the data before storage. The latter will be critical to ML and natural language processing. Check out this for more. Pandas tutorial PyTorch PyTorch is written in Python and is the successor of Python’s Torch library (which was written in Lua). Developed by Facebook and used by major players like Salesforce, Twitter, and the University of Oxford, PyTorch provides maximum flexibility and speed for DL platforms. PyTorch It’s also a replacement for NumPy as it makes better use of the power for GPUs. As it’s based on the system properties like operating system or the package managers, installation is pretty easy. PyTorch can be installed within an IDE like on from the command prompt. PyCharm PyTorch is good at displaying procedures in a straightforward manner and includes a considerable amount of pre-prepared models and particular parts that are easy to consolidate. To get an in-depth understanding in ML, with , research engineer at the Facebook AI Research Lab. There are more on the GitHub platform. PyTorch listen to this interview Soumith Chintala PyTorch tutorials Spark Apache Spark The Apache project Spark can be described as an open-source, general-purpose, distributed data processing engine. It’s a highly flexible tool that can be leveraged to access data in a variety of sources, such as Amazon S3, Cassandra, HDFS, and OpenStack. If we compare it with Hadoop, Spark’s in-memory processing can be 100 times faster and run about 10 times faster on disk. You can use it to process your data on a standalone computer in a local machine or even build models when the input datasets are much larger than your computer’s memory. In fact, what makes Spark perfect for ML is its in-memory processing, which is capable of delivering near real-time analytics. Spark also comes with an interactive mode, so users can get immediate feedback on their queries and actions. While it’s also good at batch processing, it transcends the competition in machine-based learning, processing interaction queries, streaming workloads, and real-time data processing capabilities. If you’re already familiar with Hadoop, you can easily add Spark to your arsenal, as it’s highly compatible (and is even listed as a module on Hadoop’s project page). Spark is user-friendly because it comes with the following APIs: Java Python Scala Spark SQL (which is very similar to SQL 92) This from Microsoft demonstrates how to train and create ML models with Spark. comprehensive guide scikit-learn scikit-learn is an open-source ML library for Python that features algorithms that support k-neighbours, random forests, and vector machines. It also supports numeric and scientific libraries for Python like NumPy and . Scikit-learn SciPy By far the cleanest and easiest ML library, Scikit-learn accomodates a wide selection of supervised and unsupervised algorithms. Designed with an engineering mindset, this tool is highly user-friendly, powerful, and flexible for running end-to-end ML research projects. To learn more about how Scikit-learn is used in ML, you can go through the following resources: An Introduction to Scikit-Learn Learning Model Building in Scikit-learn TensorFlow The open-source programming library, was developed to help ML algorithms construct and train frameworks and neural systems to mimic human perception, thinking, and learning. Some of , notably , utilize TensorFlow. TensorFlow Google’s leading products Google Translate It works by using different advancement strategies to make the calculation of numerical articulations less demanding while boosting overall performance. However, TensorFlow can prove to be far more challenging than Keras or PyTorch and requires a great deal of standard coding. TensorFlow resources: Deep Learning with TensorFlow for Beginners Getting Started with TensorFlow: A Machine Learning Tutotial TensorFlow Tutorials Clearly, the machine learning engineer’s toolbox is robust. The wealth of technology that’s available is quite significant and potentially overwhelming. However, if you’re a Python developer, it’ll be fairly easy to pick up the main components of the ML stack. If you feel like you need some help doing so, Springboard offers a curated curriculum, job guarantee, and unlimited calls with machine learning experts, including your own personal mentor through the . AI/Machine Learning Career Track