ChatGPT reached one million users in just five days, compared to the two and a half months it took Instagram, and two to three years for Netflix, Airbnb, and Twitter. As such, many are figuring out ways to incorporate the advancements of AI/ML into their apps. Microsoft has announced plans to integrate ChatGPT into all their products, and last week they launched Bing and Edge with ChatGPT built-in. Salesforce has also recently released Einstein. Apparently, half of the current YCombinator batch is now comprised of startups working with ChatGPT. In a seemingly short time, significant innovation is taking place at a rapid pace in AI. With all that buzz, are you considering building AI/ML into your product?
If you are beginning your journey to deploy a production-ready ML solution that can scale efficiently to thousands of users, here are a few topics to consider:
Beginners often underestimate the effort required to handle data when working on real-world datasets. On platforms like Kaggle, Coursera, and HackerRank, the data sets are usually clean and the features hardly need preprocessing. In contrast, it can take days to merely transfer and pull together all your raw data into place before you even start. You may need to break the data into smaller chunks and stitch them together later. Interruptions to your connection or losing your process halfway while handling large data sets often lead to huge time sinks.
Expect to spend more time sourcing, curating, and transforming your data than tinkering with your model. Ensuring your data is correctly labeled, optimizing its quality, and using the right features can have a greater impact on your results than improving your model. In fact, data sourcing and handling is so vital and challenging that it has become a whole industry. Large organizations hire hundreds of people in low-wage economies to label their data, while the companies focus on retaining customer relationships.
Real-world data can be highly sparse, especially if it’s not data that you gathered from users over time. Sourcing feature-rich data is essential for building an efficient model and is a massive challenge—startups providing high-quality training data, such as Scale AI, are valued at billions. As an established consumer product with millions of users and data points, data can be a valuable moat in the market—building a data pipeline from existing proprietary data can be relatively easier.
However, for fledgling startups, data sources and available volume may be limited. This is commonly known as the cold start problem in the industry. Without a considerable amount of data to begin with, the first job is to allocate resources to discover and build a data pipeline. Prepare to find trustworthy data sources, such as open repositories or APIs relevant to your problem.
Having limited data need not particularly be a curse, depending on the industry. Sometimes, even with a small amount of data, AI models can be surprisingly effective. Andrew Ng, the founder of Landing AI, Coursera, and Google Brain, says that in markets without the scale of consumer software internet, "good data beats big data".
Building an ML model and working on the algorithms is only part of a much bigger process. To deploy your model at scale, you need the necessary technical infrastructure in place to integrate ML into your application. Depending on the size of your data set and your technical infrastructure, running ML models can take a long time to compute results. To serve users with real-time results, you can’t afford to have them wait while you run the model and compute results on the fly. Instead, you may need to pre-compute the results and plan to have the necessary storage infrastructure to serve users with an API on demand.
Root Mean Square Deviation (RMSD), Jaccard Index, and other scoring approaches are useful for comparing the efficiency of different algorithms when solving standalone problems. However, optimizing for RMSD or accuracy alone is insufficient when you’re dealing with an ensemble of algorithms required in a real-world product.
The success of your ML solution must be measured using product metrics that demonstrate real-world evidence. For instance, Facebook and Instagram feed algorithms optimize for user engagement—the amount of time a user spends scrolling through a video or image. YouTube recommendations optimize for the number of minutes watched before leaving the recommendations. It is crucial for you and your team to track the right metric to optimize for, rather than spending your energy on model improvements that may not quite move the needle for your business.
You may have built the most accurate ML model for a data set, but if it doesn’t solve a customer problem, it’s futile. Your model then becomes a technology in search of a problem, which is never the right approach. You need a compelling reason as to why AI/ML is a prominent building block to solving a customer problem. Customers aren’t concerned about the technology as much as if it solves their problem. Andrew Ng recommends companies not build an AI-first business—companies should be built mission-led and customer-led, rather than technology-led.
In many cases, AI can only solve part of the problem. For instance, Stitchfix uses AI to curate personalized style suggestions to make a human curator’s job easier in finding the right unique fix for a customer. Even though the style recommendations are initially screened by ML from millions to a few hundred, a human curator makes the final decision as to what goes into the box for a customer to try. This is because AI/ML is incapable of evaluating emotionally or from data that is elusive, such as context.
Given the pace of AI/ML innovation and the available opportunities, many find working on ML an appealing career path. However, if you hire people without experience dealing with real-world datasets, you may find burn on your hands. Experienced data engineers, who may not have worked on cutting-edge algorithms like researchers, often prove to be a better fit for quickly bringing solutions to market. Researchers tend to be great at data analysis but may lack the skillset necessary to handle large datasets and build production-ready solutions.
For small teams, frequent experimentation and shipping at a rapid pace are crucial for success. The more experiments you ship, the better your ability to analyze the results and hence improve your ML model. Having inexperienced members holds you back from shipping rapidly and also hampers team morale. To build a strong ML team, you need a balance of hackers who can quickly ship models and experienced data engineers who know how to analyze and play with massive real-world datasets.
ML tools and libraries such as Tensorflow, scikit-learn, and PyTorch have abstracted away the need to write efficient algorithms widely used today. What is important though is having a strong understanding of the fundamentals and nuances of how different algorithms fit a particular dataset. In addition to being familiar with the tools, a key factor that sets a good data scientist apart is the ability to slice and dice the data—intuitive foresight and the ability to unearth patterns that may have been otherwise overlooked due to the sheer size of the data. A trained eye can spot these patterns in minutes, while an untrained one may be lost for weeks altogether.
Each of the above topics merits a thorough exploration. If you’re curious to learn more, consider reading how Instagram generates its feed, and how Stitchfix builds its algorithms. Reading engineering blogs is an excellent way to gain insight into how ML is deployed at scale.
Finally, here are a couple of questions to consider as you’re getting started with implementing AI/ML in production: