How to build better cars instead of drilling for more oil\n---------------------------------------------------------\n\n#### When was the last time you bought crude oil?\n\n“[Data is the new oil](https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data)” is by now a common phrase asserting the importance of data as a key resource. But Clive Humby actually coined it in the context of a crucial [observation](https://ana.blogs.com/maestros/2006/11/data_is_the_new.html) that is often overlooked: most people don’t go around shopping for crude oil. In his words,\n\n> Data is just like crude. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity.\n\nRaw data, like a clickstream, a sequence of financial transactions, or a dump of medical records, is very hard to use. Like oil, it needs to be refined and transformed into more readily usable forms, such as clean, organized data tables.\n\nNow, you don’t buy gasoline for its decorative value. You buy it to power the engine in your car, so you can get to where you need to go — and _that_ is what generates value.\n\n#### Machine Learning and AI are the new combustion engine\n\nData works in the same manner. What actually generates value is a _product._ In this post we focus on _data products:_ products that use data to generate value.\n\nData products need an engine that consumes refined data and powers value creation. This engine can be as straightforward as a simple way to display important aspects of the data, so humans can make more informed decisions. We call this “analytics”. The engine can also be more sophisticated: predictions made by a machine learning model, or a neural network that identifies objects in an image.\n\n> Machine learning and AI are the new combustion engine, and data products are the new cars.\n\nTaken together, these components form the data product’s value chain:\n\n!(https://hackernoon.com/hn-images/1*xHJVyniBIM4w4UY7aGPEOw.png)\n\nValue chain for a data product. Icons by Ale Estrada, Ayub Irawan, BomSymbols, Hadi Davodpour for the Noun Project\n\nSometimes, parts of this chain can be outsourced. For example, many companies successfully sell ‘analytics’ or ‘insights’. These are essentially data refineries: their product is refined data or sometimes even the engine. Then, other products use them to generate value in the market. The business model and strategy of a data refinery are very different than those of data products, which are my focus here.\n\n#### Product/Data Fit: Data strategy for data products\n\nThis post focuses on data strategy for data products and how to find [product/data fit](https://hackernoon.com/the-challenge-of-product-data-fit-92543078551b). It’s all about figuring out how the pieces in this chain fit together to optimize value creation.\n\nThis process is anchored by understanding how the product uses data to create business value. This guides you as you go up and down the chain to answer questions like:\n\n* _What are the most efficient engines to optimize value creation?_\n* _How much and what type of refined data do the engines need?_\n* _How do you generate (or acquire) and then refine the raw data?_\n\nOne way to think about these questions is by understanding the return on investment (ROI) on the engine. For simplicity, I’ll focus on the case where the engine is a machine learning model.\n\nThe _investment_ in the model includes the cost, in time and dollars, of acquiring and storing the data. It also includes the time and cost of refining the data and training the model.\n\nThe _return_ on the model depends on two components:\n\n* The _accuracy_ of the model\n* The _business value_ generated from a correct prediction (in dollars, clicks, or another quantifiable metric), and the business cost of a wrong or inaccurate prediction\n\n#### Data strategy is about building better cars\n\nThe key to data strategy is: focus on increasing the return, not increasing the investment. This sounds obvious, but is often lost in the hype around data and AI.\n\nSome people focus exclusively on the amount of data. They are the ones who always complain that “We need _more data!_” or brag about how “We are generating _so_ _much_ _data!”._\n\nBut these phrases are often markers of a poor data strategy. They emphasize the investment instead of the return. The real goal is to build better cars with more efficient engines, not to accumulate more crude oil.\n\n> The point of data strategy is to build better cars with more efficient engines, not to accumulate more crude oil.\n\nAnother common distraction is to focus too much on the engine. You wouldn’t use a jet engine to power a scooter. Likewise, for most early stage data products, sophisticated machine learning and AI are overkill. 99% of the time, it’s better to invest in figuring out how your product generates value in the market than in tinkering with the inner workings of a neural net.\n\n#### Match the engine to your data\n\nHow do you increase the returns from your model? One way is to improve model accuracy. But that will also increase the investment: you will need more data or more efficient methods. So the key here is to keep the ROI positive by matching the engine with the amount of data that you have.\n\nOne example is the evolution of a recommender system:\n\n* Start by recommending the most popular items to all users. This doesn’t require user level data, and the recommendation is based on simple summary statistics, so the investment is very small.\n* As you collect more granular data you can make suggestions like “users who bought X also bought Y”. This requires enough data for each user, but the methods are still very simple.\n* A mature recommendation engine will take into account the full shopping history of a user in addition to other features of users and items, often using a method called [collaborative filtering](https://en.m.wikipedia.org/wiki/Collaborative_filtering).\n\nAs the amount of data increases, the engine graduates from simple summary statistics to full blown machine learning. The model gets more and more accurate, but you never invest more than appropriate for the amount of data that you have at each stage.\n\n#### Data is subject to the law of diminishing returns…\n\nEventually, it gets hard to scale the ROI of a model on accuracy alone. The reason is that data is subject to diminishing returns.\n\nSuppose you want to predict the results of an election in a state with 1,000,000 voters who are choosing between two candidates, Daisy and Minnie. You survey 200 random voters and 53% of them are voting for Daisy. [It turns out](https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval) that you can be 80% sure that Daisy will in fact win. But if you want to be 90% sure, you’ll need more than double that, about 450 voters. To get to 95% you’ll need 750, and for 99%, _another_ 750.\n\nReal election polls are obviously much more complex, and so is any realistic data problem you are likely to run into. But the principle remains the same. As you are looking to make your predictions more and more accurate, the amount of data you need to collect increases exponentially.\n\n#### … as are machine learning and AI\n\nCan you solve this problem by using a more powerful engine, like deep learning? Not so much. Sophisticated methods typically require much larger amounts of data, and they are also subject to diminishing returns.\n\n!(https://hackernoon.com/hn-images/1*Ft2rLuO82eItlvJn5HOi9A.png)\n\nSample images from MNIST dataset. Source: [https://en.wikipedia.org/wiki/MNIST\\_database](https://en.wikipedia.org/wiki/MNIST_database)\n\n[MNIST](https://en.wikipedia.org/wiki/MNIST_database) is a dataset consisting of images of handwritten digits. It is widely used as a toy dataset in image recognition, where the goal is to correctly identify the digit in each image.\n\nOne of the simplest algorithms you could use for this purpose is [multinomial logistic regression](https://en.wikipedia.org/wiki/Multinomial_logistic_regression). Despite its simplicity, it correctly identifies [about 92.5%](http://deeplearning.net/tutorial/logreg.html) of the digits. A simple neural net is a reasonable next step, and it can quickly get you to about [99.3%](http://deeplearning.net/tutorial/lenet.html) accuracy. Clearly very impressive, but note that it’s just 7% better than a much simpler model. Further improvements are even harder to achieve: a state-of-the-art deep learning model, using methods fresh out of research, can improve accuracy by another [0.5%](https://arxiv.org/abs/1202.2745).\n\nMNIST is a toy example. Any realistic problem is going to be much more difficult and you should expect lower accuracy from the models. Sometimes, improving performance by 0.1% makes a big difference, and it makes sense to use the really sophisticated stuff. Regardless, both data and methods are subject to very strong diminishing returns.\n\n#### Ask more valuable questions\n\nBecause of the diminishing returns on data, eventually it gets difficult to increase the ROI of a machine learning model just by improving its accuracy. How else can you do that?\n\nAccuracy is just one part of the return on data. The other is the _business value of a prediction_. One way of thinking about it is to imagine that your model is 100% accurate. What would be the impact on your business? This is entirely about the the _question_ that the model is addressing, not the quality of the solution. So the way to increase the ROI is to ask more valuable questions.\n\nHere is an example. Daisy is running for office, and she is sending volunteers to knock on doors and increase turnout. However, the number of volunteers is limited, so she wants to build a model to target only the voters who are likely to vote for her and not for her opponent Minnie. This is called “response modeling”.\n\nThere is a more valuable model that Daisy could build: predict which voters will vote for her if visited by a volunteer, but would stay home otherwise. Voters who are predicted to vote even if they are not visited won’t be targeted, so the volunteers only visit those voters where they make a difference. This is called “[uplift modeling](https://en.wikipedia.org/wiki/Uplift_modelling)”.\n\nAccurate uplift modeling requires much more data than traditional response modeling. So if Daisy doesn’t have enough data, she should start by building response models and improve them as data accumulates. But eventually she should shift to uplift models, even if they are _less accurate_ than the response models — because over all they are _more valuable_.\n\n#### Achieving product/data fit\n\nLet’s summarize how you can improve the ROI on your model:\n\n* By increasing the accuracy of the model, while ensuring that the investment in the engine matches the amount of data\n* By increasing the business value of a prediction, especially if it can make up for a less accurate model\n\nThis is how you find product/data fit: iterate to simultaneously increase the value of your data, your models, and the questions they are tackling.\n\nLet’s see how it plays out in a more realistic situation. Many healthcare startups experiment with [clinical decision support](https://www.ahrq.gov/professionals/prevention-chronic-care/decision/clinical/index.html) (CDS) systems, products that are intended to assist clinicians in making complex decisions in a data-driven manner.\n\nSome CDS products focus on providing treatment suggestions, but they often encounter challenges with market adoption. One reason is that a single wrong suggestion can critically undermine the trust in the system. In terms of the ROI on the model, the cost of a wrong suggestion is exceedingly high. This means that the models making the suggestions must be extremely accurate, which in turn requires very high investment. It is probably better to defer the building of a suggestion engine until the company has secured access to enough data, as well as the trust of the users.\n\nA successful strategy for building a CDS will focus first on the areas where accuracy is less critical. One way of doing that is by _showing the data_ in a way that makes intuitive sense and provides useful insights to the clinician. This is a very common theme in data product development, and I will expand on it in a future post.\n\n#### Bottom Line\n\n* Data strategy is about building better products, not accumulating more data or using more sophisticated methods.\n* To do that, you must understand how your product uses data to generate value, and focus on increasing the ROI on your models.\n* One way to do that is by improving the accuracy of the models, but you will quickly run into diminishing returns.\n* Another way is by finding more valuable questions that your data can answer, which can result in better ROI even with less accurate models.\n* This is how you find product/data fit: iterate to simultaneously increase the value of your data, your models, and the questions they are tackling.