I’ve been writing code since I was an awkward middle schooler. I’ll never forget creating my AOL homepage with a seizure-inducing repeating background, under construction gif, and faux visitor counter. I begged my dad to drop me off at school early the next day so I could try and access the page from the library computer.
Now that I’m an awkward adult, there have been a couple more magical moments in front of a computer screen. My most recent? A year ago, I solved my first (trivial) machine learning problem. I was blown away at the ability to connect something so human to bits and bytes.
Since that machine learning awakening I’ve been diving in hard. I’m not an expert by any means, but I’d love to give you a smoother onramp than my own.
There’s three things I hope to leave you with:
Love means lots of things to different people. Lots of smart people have defined it in many great ways. The ecosystem around machine learning can generate this same confusion. Thankfully, I think if you master just five topics, you’ll be in the 90th percentile of conversational ML:
I’ll start with the most concrete definition: machine learning.
Machine learning is the study of computer algorithms that improve automatically through experience.
Professor and Former Chair of the Machine Learning Department at Carnegie Mellon University, Tom M. Mitchell
The better the data you feed into a machine learning algorithm, the better the algorithm will perform. We’re not modifying machine learning algorithms to improve our results: we’re modifying the data.
Machine learning isn’t new: in 1952, Arthur Samuel wrote the first computer learning program. It played checkers. So, why do you hear so much about machine learning today?
It goes back to data. We’re able to store a lot of data very cheaply today. Our computers can process this data very efficiently. This is making our ML models better and more widespread.
A data scientist is an expert at extracting nuggets of knowledge from a lot of information. And, they can do this very quickly.
A data scientist will use machine learning, but it’s only one of the tools in their tool set.
Like love, smart people define Artificial Intelligence (AI) in different ways.
- via Andrew Ng, Co-Founder Google Brain
AI is akin to building a rocket ship. You need a huge engine and a lot of fuel.
Andrew Ng, Co-Founder Google Brain
Some folks tie AI tightly to machine learning. This is understandable in today’s climate: much of the innovation in AI is powered by our more powerful machine learning models. In fact, it’s not uncommon to hear folks use AI and ML interchangeably.
Then there’s a more broad definition:
Artificial intelligence is the science and engineering of making computers behave in ways that, until recently, we thought required human intelligence.
Andrew Moore, Dean of the School of Computer Science at Carnegie Mellon University
AI?
By this definition, wouldn’t a calculator be AI at the time it was introduced? Adding numbers was certainly something that we thought required human intelligence.
Today, a calculator would not be considered AI but a self-driving car is. In thirty years, it’s likely a self-driving car will be as commonplace as a pocket calculator.
Which definition is correct?
I don’t know! Just be aware that some folks will go broad and others will align AI more tightly with the ML-fueled AI boom of today.
Software 1.0 is code we write. Software 2.0 is code written by the optimization based on an evaluation criterion (such as “classify this training data correctly”).
Andrej Karpathy, Director of AI @ Tesla
My background is a Software 1.0 kind-of-guy. The hard work is in maintaining a growing nest of algorithms. In software 2.0, the work shifts from algorithms — which we don’t create — to the data we feed in for training and evaluation.
While I agree on the difference between these two styles of software, I find the naming of Software 2.0 unfortunate. Software 2.0 is being applied to new problems — detecting cancer, driving cards, identifying sentiment — not replacing old work.
Are you baking bread or building ovens? — Photo Credit: lazy fri13th
Imagine hiring a chef to build you an oven or an electrical engineer to bake bread for you…that’s the kind of mistake I see…over and over.
Cassie Kozyrkov, Chief Decision Intelligence Engineer, Google
For years, companies have preferred to hire folks with PhDs in machine-learning-related fields to solve problems with machine learning. Today, many problems can be solved by open-source ML algorithms. The challenge — as always in ML — is in the data.
Having a post-grad degree in an ML-related field is still a great asset. However, if you’re more interested in applying ML than learning how models work, you probably don’t need to go back to school.
A sample of the dataset
Classifying handwritten digits is one of the most famous “hello world” problems in machine learning. With solid accuracy, you can solve this problem in just a few lines of code. It’s magical.
There are many Kaggle kernels that solve this problem. I’m going to skip the plumbing (importing libraries) and get right into the meat of the problem.
We’re given a collection of 70k handwritten digits and their associated labels. Each digit is actually an array of 728 integers. Each integer is a grayscale range from 0–256. The higher the number, the darker the pixel. This array can be arranged into a grid 28 pixels in width and 28 pixels in height:
Each instance of our dataset is an array of 728 values. The higher the value, the darker the pixel in the image.
The first step in every ML problem is splitting the entire dataset into a training and test set. We only train the model on the training set.
Why would we exclude data when an ML model gets better with more data?
If we trained our model with all of the 70k handwritten digits, we’d still need a way to evaluate its accuracy on data it hasn’t seen. Think about how much work (and how much time it would take) to digitize handwritten digits! By not fitting our model to the test set, we can see how it performs against pretend real-world data immediately.
Now that we have our data properly split, it’s time to train a model! A good model to start with is a Random Forest Classifier. This is a fairly simple model that produces solid results with little tuning.
Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.
Niklas Donges in The Random Forest Algorithm.
We can initialize and train our model in just two lines:
…and it worked:
We can see how the model performs against the test set by measuring its accuracy (the percentage of time the model correctly classified a 3 as a 3, a 4 as 4, etc.):
For two lines of code, we’re about 80% accurate. If I wrote a Software 1.0-style algorithm to do this, I doubt I’d get to this accuracy and it’d take me a lot longer! But, we can do better!
The astute of you might have noticed I only trained the model on 1,000 instances of the training set. Remember how ML algorithms get better with more data? Let us use all of the 60,000 training instances and get the new accuracy score:
Right now, we’re at peak hype in this article: two lines of code, 95% accuracy!
Why was this so effortless? What’s the usual answer for anything ML? The data! If I asked you to solve this problem without a dataset, you’d need a lot of time to put that data set together. You might start with 5k images, then notice the model doesn’t classify 3s well, then increase the images and hit a new issue, and so on. Training, fitting, and evaluating the results of ML models is pretty easy today. Data is tedious and hard.
If I haven’t scared you off, there’s a couple of resources I recommend to get you going:
The breadth of use cases for machine learning is so large it can be difficult to choose where to begin. I hope the above is enough to help you focus and get started!
Oh — and it’s all about the data.