The quest for a useful definition Behold my pithiest attempt: “ .” Feel free to flee now or stick around of a tour of its three subfields. Data science is the discipline of making data useful Statistics Machine learning Data-mining / analytics The term no one really defined If you poke around in the early of the term , you see two themes coming together. Allow me to paraphrase for your amusement: history data science Big(ger) data means more tinkering with computers. can’t code their way out of a paper bag. Statisticians And thus, data science is born. The way I first heard the job defined is “A is a who can code.” I’ll be full of opinions on that , but first, why don’t we examine data science itself? data scientist statistician in a moment circa 2014. Twitter definitions I love how the 2003 launch of goes right for the narrowest possible scope: So… everything, then? It’s hard to think of something that has nothing to do with information. (I should stop thinking about this before my head explodes.) Journal of Data Science “By ‘Data Science’ we mean almost everything that has something to do with data.” Since then, we’ve seen a multitude of opinions, from Conway’s well-traveled Venn diagram ( ) to Mason and Wiggins’ . below classic post ’s definition of data science. My personal taste runs more towards the definition on . Drew Conway Wikipedia has one that’s very close to what I teach my students: Wikipedia Data science is a ‘concept to unify statistics, data analysis, machine learning and their related methods’ in order to ‘understand and analyze actual phenomena’ with data. That’s a mouthful, so let me see if I can make it short and sweet: “Data science is the discipline of making data useful.” What you’re thinking around about now might be, “Nice try, Cassie. It’s cute, but it’s an egregiously lossy reduction. How does the word ‘useful’ capture all of that jargon stuff?” Well, okay, let’s argue it out with pictures. Here’s a map for data science for you, perfectly faithful to the definition. Wikipedia What are these things and how do you know where you are on the map? If you’re about try breaking them down by , slow down. The difference between a statistician and a machine learning engineer is not that one uses R and the other uses Python. The SQL vs R vs Python taxonomy is ill-advised for so many reasons, not least of which is that software evolves. (As of recently, you can even do .) Wouldn’t you prefer a breakdown that’ll last? In fact, just go ahead and unread this entire paragraph. standard toolkits ML in SQL Perhaps worse is the favorite way novices split the space. Yup, you guessed it: by the (surprise! it’s how university courses are structured). Pretty please, don’t taxonomize by histograms vs t-tests vs neural networks. Frankly, if you’re clever and you have a point to make, you can use the same algorithm for any part of data science. It might look like Frankenstein’s monster, but I assure you it can be forced to do your bidding. algorithm Enough with the dramatic buildup! Here’s the taxonomy I propose: None-One-Many What on earth is this? Why, decisions, of course! (Under . When all the facts you need are visible to you, you can use descriptive analytics for making as many decisions as you please. Just look at the facts and you’re done.) incomplete information It’s through our actions — our decisions — that we affect the world around us. I’d promised you we were going to talk about making data useful. To me, the idea of usefulness is tightly coupled with influencing real-world actions. If I believe in Santa Claus, it doesn’t particularly matter unless it might influence my behavior in some way. Then, depending on the potential consequences of that behavior, it might start to matter an awful lot. It’s through our actions — our decisions — that we affect the world around us (and invite it to affect us right back). So here’s the new decision-oriented picture for you, complete with the three main ways to make your data useful. Data-mining / Analytics If you don’t know what decisions you want to make yet, the best you can do is go out there in search of inspiration. That’s called data-mining or or descriptive analytics or exploratory data analysis (EDA) or knowledge discovery (KD), depending on which crowd you hung out with during your impressionable years. analytics Golden rule of analytics: only make conclusions about what you can see. Unless you know how you intend to frame your decision-making, start here. The great news is that this one is easy. Think of your dataset as a bunch of negatives you found in a darkroom. Data-mining is about working the equipment to expose all the images as quickly as possible so you can see whether there’s anything inspiring on them. As with photos, remember not to take what you see too seriously. You didn’t take the photos, so you don’t know much about what’s off-screen. The golden rule of data-mining is: Only make conclusions about what you can see, never about what you can’t see (for that you need statistics and lot more expertise). stick to what is here. Other than that, you can do no wrong. Speed wins, so start practicing. Expertise in data-mining is judged by the speed with which you can examine your data. It helps not to snooze past the interesting nuggets. The darkroom’s intimidating at first, but there’s not that much to it. Just learn to work the equipment. Here’s a tutorial in and here’s one in to get you started. You can call yourself a data analyst as soon as you start having fun and you can call yourself an expert analyst when you’re able to expose photos (and all the other kinds of datasets) with lightning speed. R Python Statistical inference Inspiration is cheap, but rigor is expensive. If you want to leap beyond the data, you’re going to need specialist training. As someone with undergrad graduate majors in statistics, I may be just a tad biased here, but in my opinion statistical inference (statistics for short) is the most difficult and philosophy-laden of the three areas. Getting good at it takes the most time. and Inspiration is cheap, but rigor is expensive. If you intend to make high-quality, risk-controlled, important decisions that rely on conclusions about the world beyond the data available to you, you’re going to have to bring statistical skills onto your team. A great example is that moment when your finger is hovering over the launch button for an AI system and it occurs to you that you need to check it works before releasing it (always a good idea, seriously). Step away from the button and call in the statistician. Statistics is the science of changing your mind (under uncertainty). If you want to learn more, I’ve written for your enjoyment. this 8-minute super-summary of statistics Machine learning Machine learning is essentially . I’ve written a few posts about it, including whether it’s how to with it, , and the first couple of articles in a series of plain-language takes on the jargon nitty gritties ( ). Oh, and if you want to share them with non-English-speaking friends, a bunch of them are translated . making thing-labeling recipes using examples instead of instructions different from AI, get started why businesses fail at it start here here Data engineering What about , the work that delivers data to the data science team in the first place? Since it’s a sophisticated field in its own right, I prefer to shield it from data science’s hegemonic aspirations. Besides, it’s much closer in species to software engineering than to statistics. data engineering The difference between data engineering and data science is a difference of before and after. Feel free to see the difference as . Most of the technical work leading up to the birthing of the data (before) may comfortably be called “data engineering” and everything we do once some data have arrived (after) is “data science”. data engineering versus data science before versus after Decision intelligence , including decision-making at scale with data, which makes it an engineering discipline. It augments the applied aspects of data science with ideas from the social and managerial sciences. DI is all about decisions Decision intelligence adds components from the social and managerial sciences. In other words, it’s a superset of those bits of data science not concerned with researchy things like creating fundamental methodologies for general-purpose use. Still hungry? to entertain you while I go clack on my keyboard. Here’s a breakdown of the roles in a data science project
Share Your Thoughts