The Ultimate Toolbox Of ML Startups

Setting up a good tool stack for your Machine Learning team is important to work efficiently and be able to focus on delivering results. If you work at a startup you know that setting up an environment that can grow with your team, needs of the users and rapidly evolving ML landscape is especially important. We wondered: to tackle this challenge. “What are the best tools, libraries and frameworks that ML startups use?” And to answer that question we asked from all over the world. 41 Machine Learning startups The result? A ton of great advice that we grouped into: Methodology Software development setup Machine Learning frameworks MLOps Unexpected Read on to figure out what will work for your machine learning team. Good methodology is the key Tools are only as strong as the methodology that employs them. If you run around training models on some randomly acquired data and deploy whatever model you can get your hands on, sooner or later there will be trouble Kai Mildenberger from says that: psyML “To us, the careful versioning of all the training and testing data is probably the most essential tool/methodology. We expect that to remain one of the most key elements in our toolbox, even as all of the techniques and mathematical models iterate forever. A second aspect might be to be extremely hypothesis driven. We use that as the single most important methodology to develop models.” I think having a strong understanding of what you want to use your tools for (and that you actually need them) is the very first step. That said it is important to know what is out there and what people in similar situations use successfully. Let’s dive right into that! Software development tooling is the backbone of ML teams Development environment is the foundation of every team’s workflow. So it was very interesting to learn what tools companies around the world consider the best in this area. Source: giphy.com ML teams use various tools as an IDE. Many teams like and use Jupyter Notebooks and Jupyter Lab with its ecosystem of NB Extensions. SimpleReport Hypergiant – says Wenxi Chen from . “Jupyter Notebook is very useful for quick experiments and visualization, especially when exchanging ideas between multiple team members. Because we use Tensorflow, Google Colab is a natural extension to share our code more easily.” Juji Various flavours of Jupyter have been mentioned as well. Deepnote (a hosted Jupyter Notebook solution) is by the team of Intersect Labs while Google Colab for the team. “loved for their ML stuff” “is a natural extension to share our code more easily” Juji Others choose more standard software development IDEs. Among those Pycharm, tooted by Or Izchak from as and Visual Studio Code used by for its were mentioned the most. Hotelmize “the best Python IDE” Scanta “ease of connectivity with Azure and many ML-based extensions provided” For teams that use R language like SimpleReport, RStudio was a clear winner when it comes to the IDE of choice. As Kenton White from mentions Advanced Symbolics “ We mostly use R + RStudio for analysis and model building. The workhorse for our AI modeling is VARX for time series forecasts. “ When it comes to code versioning Github is a clear favourite. As Daniel Hanchen from mentions: Umbra AI “ “ Github (now free for all teams!!) with its super robust version control system and easy repository sharing functionality is super useful for most ML teams. Among most popular languages we have Python, R and interestingly (mentioned by Wenxi Chen from ).] Clojure Juji As for the environment/infrastructure setup notable mentions from ML startups are: ( ) “AWS as the platform for deployment” Simple Report “ ( ) Anaconda serves as our goto tool for running ML experiments due to its *live code* feature wherein it can be used to combine software code, computational output, explanatory text, and multimedia resources in a single document.” Scanta ( ) “Redis dominates as an in-memory data structure store due to its support for different kinds of abstract data structures, such as strings, lists, maps, sets, sorted sets, HyperLogLogs, bitmaps, streams, and spatial indexes.” Scanta “ ( ) Snowflake and Amazon S3 for data storage.” Hypergiant ( ) “Spark-pyspark – very simple api for distributing job to work on big data.” Hotelmize Sooo many Machine Learning Frameworks Source: giphy.com Integrated development environment is crucial, but one needs a good ML framework on top of that to transform the vision into a project. The range of tools pointed out by the startups is quite diverse here. For playing with tabular data, Pandas was mentioned the most. Additional benefit of using Pandas mentioned by Nemo D’Qrill, the CEO of is: Sigma Polaris “I’d say that Pandas is probably one of the most valuable tools, in particular when working in collaboration with external developers on various projects. Having all data files in the form of data frames, across teams and individual developers, makes for a much smoother collaboration and unnecessary hassle.” Interesting library mentioned by Software Developer from was – python extension library for panda which gives you insights on your panda code and data while working with panda. Hotelmize dovpanda When it comes to visualization matplotlib is used the most by the likes of , , and others. Trustium Hotelmize Hypergiant Plotly was also a common choice. As developers from explain . Dash, a tool for building interactive dashboards on top of Plotly charts, was recommended by Theodoros Giannakopoulos from for ML teams that need to present their analytical results in a nice, user-friendly manner. Wordnerds “for great visualisations to make data understandable and look good” Behavioral Signals For more standard machine learning problems most teams like , or use Scikit-Learn. ML team from t explains why it is such a great tool: Wordnerds Sensitrust Behavioral Signals iSchoolConnec “It is one of the most popular toolkits used by machine learning researchers, engineers, and developers. The ease with which you can get what you want is amazing! From feature engineering to interpretability, scikit-learn provides you with every functionality.” Truth be told Pandas and Sklearn are really the workhorses of ML teams all over the world. As Michael Phillips, Data Scientist from says: Numerai “Modern Python libraries like Pandas and Scikit-learn have 99% of the tools that an ML team needs to excel. Though simple, these tools have extraordinary power in the hands of an experienced data scientist” In my opinion, while in the general ML team population this may be true, in the case of ML Startups a lot of work goes into state of the art methods which usually means deep learning models. When it comes to general deep learning frameworks we had many different opinions. Many teams like and choose PyTorch. Wordnerds Behavioral Signals The team of ML experts from tells us why so many ML practitioners and researchers choose PyTorch. iSchoolConnect “If you want to go deep into the waters, PyTorch is the right tool for you! Initially, it will take time to get accustomed to it but once you get comfortable with it there is nothing like it! The library is even optimized for quickly training and evaluating your ML-models.” But it is still Tensorflow and Keras that are leading in popularity. Most teams like Strayos and Repetere choose it as their ML development frameworks. Cedar Milazzo from said: Trustium “Tensorflow, of course. Especially with 2.0! Eager execution was what TF really needed and now it’s here. I should note that when I say “”tensorflow”” I mean “”tensorflow + keras”” since keras is now built into TF”. It’s also important to mention that you don’t have to choose one framework and exclude others. For example, ’s Founder, Omid Aryan said that: Melodia “The tools that have been most beneficial to us are TensorFlow, PyTorch, and Python’s old scikit-learn tools.” There are some popular frameworks for more specialized applications. In Natural Language Processing we’ve heard: says Ben Lamm, the CEO of . “ Huggingface : it’s the most advanced and highest performance NLP library ever created. It’s the first of its kind in that researchers are directly contributing to a highly scalable NLP library. It separates itself from other similar tools by having production level tools available a few months after a newer model is published” Hypergiant mentions Cedar Milazzo, the CEO of “Spacy is a very cool natural language toolkit. NLTK is by far the most popular and I certainly use it, but spacy does lots of things NLTK can’t do so well, such as stemming and dependency parsing.” Trustium adds Cedar Milazzo. “Gensim is good for word vectors and document vectors too, and I believe it isn’t so popular.” In Computer Vision: for . Their CEO says “ OpenCV is indispensable for computer vision work” Hypergiant “It’s a classic CV ensemble of methods from the 1960s until 2014 that are useful pre and post processing and can work well in scenarios where a neural network would be overkill.” Also it’s worth noting that not every team is implementing deep learning models themselves. As Iuliia Gribanova and Lance Seidman from say, there are now API services where you can outsource some (or all) of the work. Munchron “Google ML kit is currently one of the best easy-to-entry tools that lets mobile developers easily embed ML API services like face recognition, image labeling, and other items that Google offers into an Android or iOS App. But additionally, you can also bring in your own TF (TensorFlow) lite models to run experiments and then bring them into production using Google’s ML Kit.” I think it’s important to mention that not always you can choose the latest and greatest libraries and the toolstack gets handed to you when you join the team. As Naureen Mahmood from shared: Meshcapade “In the past, some important autodiff libraries that have made it possible for us to run multiple joint optimizations, and in doing so helped us build some of the core tech we still use today, are Chumpy & OpenDR. Now there are fancier and faster ones out there, like Pytorch and TensorFlow.” When it comes to model deployment Patricia Thaine from mentions as their frameworks of choice. She also suggests that visualizing models is very important to them and they are using for that. Private AI “tflite, flask, tfjs and coreml” Netron But there are tools that go beyond frameworks that can help ML teams deliver real value quickly. This is where MLOps comes in. MLOps starts to be more important for machine learning startups You may be wondering what MLOps is or why you should care. Source: giphy.com The term alludes to DevOps and describes tools used for operationalization of machine learning activities. Jean-Christophe Petkovich CTO at provided us with an extremely thorough explanation of how their ML team approaches MLOps. It was so good that I decided to share it (almost) in full: Acerta “I think most of the interesting tools that are going to see broader adoption in 2020 are centered around MLOps. There was a big push to build those tools last year, and this year we’re going to find out who the winners will be. For me, MLflow seems to be in the lead for tracking experiments, artifacts, and outcomes. A lot of what we’ve built internally for this purpose are extensions to the functionality of MLflow to incorporate more data tracking similar to how DVC tracks data. The other big names in MLOps are Kubeflow, Airflow and TFX with Apache Beam—all tools designed for capturing data science workflows and pipelines end-to-end. There are several ingredients for a complete MLOps system: You need to be able to build model artifacts that contain all the information needed to preprocess your data and generate a result. Once you can build model artifacts, you have to be able to track the code that builds them, and the data they were trained and tested on. You need to keep track of how all three of these things, the models, their code, and their data, are related. Once you can track all these things, you can also mark them ready for staging, and production, and run them through a CI/CD process. Finally, to actually deploy them at the end of that process, you need some way to spin up a service based on that model artifact. When it comes to tracking, MLflow is our pick, it’s tried-and true at Acerta , as several of our employees already used it as part of their personal workflows, and now it’s the de facto tracking tool for our data scientists. For tracking data pipelines or workflows themselves, we are currently developing against Kubeflow since we’re already on Kubernetes making deployment a breeze, and our internal model pipelining infrastructure meshes well with the Kubeflow component concept. On top of all of this MLOps development, there’s a shift toward building feature stores—basically specialized data lakes for storing preprocessed data in various forms—but I haven’t seen any serious contenders that really stand out yet. These are all tools that need to be in place—I know a lot of places are doing their own home-baked solutions to this problem, but I think this year we’re going to see a lot more standardization around machine learning applications.” Emily Kruger from , which accidently is a startup building a feature store solution adds: Kaskada “The most useful tools from our perspective are feature stores, automated deployment pipelines, and experimentation platforms. All these tools address challenges with MLOps, which is an important emerging space for data teams, especially those running ML models in production and at scale.” Ok so in light of this what are other teams using to solve those problems? Some teams prefer end-to-end platforms, others create everything in-house. Many teams are somewhere in between with a mix of some specific tools and home-grown solutions. In terms of larger platforms, two names that were mentioned often were: Amazon SageMaker which according to ML team from and chooses as their platform for deployment. VCV “has a variety of tools for distributed collaboration” SimpleReport Azure which as team tells us Scanta “serves as a way to build, train, and deploy our Machine Learning applications as well as it helps in adding intelligence in our applications via their Language, Vision, and Speech recognition support. Azure has been our choice of IaaS due to rapid deployments and low-cost Virtual Machines.” Experiment tracking tools come in and we see ML startups use various options: Strayos uses Comet ML . “for model collaboration and results sharing” and others are going with tensorboard which Hotelmize “is the best tool to visualize your model behavior, specially for neural network models.” as Jean-Christophe Petkovich CTO at mentioned before. “MLflow seems to be in the lead for tracking experiments, artifacts, and outcomes.” Acerta Other teams like try to keep it simple and say that . Repetere ”Our tooling is very simple, we use tensorflow and s3 to version model artifacts for analysis” Typically, experiment tracking tools keep track of metrics and hyperparameters but as James Kaplan from points out: MeetKai “The most useful types of ML tools for us are anything that helps with dealing with model regressions caused by everything except the model architecture. Most of these are tools we have built ourselves, but I assume there are many existing options out there. We like to look at confusion matrices that can be visually diff’d under scenarios such as: new data added to the training set (and the providence of said data) quantization configurations pruning/distillation We have found that being able to track performance across new data additions is far more important than being able to just track performance across hyper parameters of the model itself. This is especially so when datasets grow/change far faster than model configurations” Speaking of pruning/distillation Malte Pietsch, Co-Founder of explains that: deepset “We see an increasing need for tools that help us profile & optimize models in terms of speed and hardware utilization. With the growing size of NLP models, it becomes increasingly important to make training and inference more efficient. While we are still looking for the ideal tooling here, we found pytest-benchmark, NVIDIA’s Nsight Systems and kernprof quite helpful.” Another interesting tool for benchmarking training/inference is suggested by Anton Lokhmotov from . MLPerf Dividiti Experimenting with models is undoubtedly very important but putting models in front of end-users is where the magic happens (for most of us). On that front Rosa Lin from mentioned using streamlit.io which is a Tolstoy “great tool for building ML model web apps easily.” Valuable word of warning when it comes to using ML focused solutions comes from Gianvito Pio, Co-Founder of : Sensitrust “There are also tools like Knife and Orange that allow you to design an entire pipeline in a drag-and-drop fashion, as well as AutoML tools (see AutoWEKA, auto-sklearn and JADBio) that will automatically select the most appropriate model for a specific task. However, in my opinion, a strong expertise in the Machine Learning and AI areas are still necessary. Even the “”best, automated”” tool can be misused, without a good background in the field.” Unexpected Ok, when I started working on this, some answers like PyTorch, Pandas or Jupyter Lab were what I expected. But one answer we received was really out-of-the-box. Source: giphy.com It put all the other things in perspective and made me think that perhaps we should take a step back and take a look at the larger picture. Christopher Penn from suggested that ML teams should use a rather interesting “tool”: Trust Insights “Wetware – the hardware and software combination that sits between your ears – is the most important, most useful, most powerful machine learning tool you have. Far, FAR too many people are hoping AI is a magic wand that solves everything with little to no human input. The reverse is true; AI requires more management and scrutiny than ever, because we lack so much visibility into complex models. Interpretability and explainability are the greatest challenges we face right now, in the wake of massive scandals about bias and discrimination. And AI vendors make this worse by focusing on post hoc explanations of models instead of building the expensive but worthwhile interpretations and checkpoints into models. So, wetware – the human in the loop – is the most useful tool in 2020 and for the foreseeable future.” Our perspective Since we are building tools for ML teams and some of our customers are AI startups I think it makes sense to give you our perspective. So we see: A lot of teams use Jupyter ecosystem for exploration and Pycharm/VSCode for development; For deep learning people are using everything Tensorflow, Keras and Pytorch. Notably, we see more and more people using ; high-level PyTorch training libraries like Lightning, Ignite, Catalyst, fastai and Skorch For visual exploration people are using matplotlib, plotly, altair and hiplot (hyperparameter visualizations); For running hyperparameter sweeps and general run orchestration some ; teams like YNAP choose AWS SageMaker For experiment tracking we see open-source packages like TensorBoard, MLflow and Sacred ( ). Neptune integrates with all of them … and since those are our customers naturally they use neptune-notebooks for tracking explorations in jupyter notebooks and neptune for experiment tracking and organization of their machine learning projects. Source: neptune.ai This article was originally written by Jakub Czakon and posted on the Neptune blog where you can find more in-depth articles for machine learning practitioners.