January 6th 2019

Machine learning is a field within computer science. However, it is fundamentally different from traditional computational fields. In traditional fields, algorithms are explicit sets of instructions to be executed by computers. The algorithms instructions will not change over time unless a developer re-writes them. Machine learning algorithms however, are designed to change overtime based on the inputs, weights, and outputs.

Modern examples of machine learning include: facial recognition, character recognition (allows handwritten or printed text be converted into machine-encoded text), recommendation engines, stock market algorithms, and self driving cars.

Machine learning is capable of solving many different categories of problems. Different problems require different learning algorithms. Learning algorithms differ fundamentally based on whether outputs are ‘labeled,’ and how you want the algorithm to ‘learn.’ Two of the most common machine learning methods are **supervised** and **unsupervised** learning.

In **supervised learning**, a pool of input values, labeled with their desired outputs are passed to the algorithm. The goal here, is for the algorithm to compare it’s actual output with the desired output and adjust its model accordingly. Once the algorithm is accurate, you can use it to automatically classify thousands of unlabeled inputs. A common example of supervised learning would be a stock market algorithm that uses historical data and trends to forecast fluctuations.

In **unsupervised learning**, no inputs are labeled and it is up to the algorithm to find commonalities among the different inputs and classify them accordingly. Unsupervised learning methods could be particularly valuable to digital marketers or startup founders who want to discover their consumer archetype (the kind of person who is most likely to purchase their product). If you fed a bunch of transaction data into a unsupervised machine learning algorithm, you could determine things like: who is more likely to purchase a certain product? (men or women), what the most common age of consumers?, and, of the people who bought this product, what other products did they purchase?

One advantage of unsupervised learning, is that most data in the natural world is ‘unlabeled.’ Therefore, one could argue that unsupervised learning methods are more usable because every input does not require a corresponding “correct” output. For the same reason, there is more potential to make new discoveries using unsupervised learning methods because you can throw massive amounts of seemingly unrelated data to an unsupervised machine learning algorithm and see if any interesting relationships are discovered.

Identifying **relationships between variables** is a fundamental component of machine learning algorithms. Two important statistical concepts for understanding machine learning are correlation and regression.

**Correlation** measures the extent to which two (or more) variables fluctuate together (whatis.techtarget.com). A “positive correlation” represents a situation in which all variables move in parallel. For example, at the moment, there is a relatively positive correlation between bitcoin and the crypto-currency market. As the price of bitcoin rises, other crypto-currencies seam to move in tandem. A “negative correlation” measures the extent to which one variable increases while the other variables decrease. For example, there is a negative correlation between the unemployment rate and the level of consumer spending in the economy. As consumers spend more, companies need to produce more, which forces them to hire more workers, which lowers unemployment.

**Regression goes beyond correlation because it adds predictive capabilities.** Regression is used to “examine the relationship between one dependent and one independent variable” (www.statpac.com). After a performing an analysis, one should be able to predict the dependent variable if the independent variable is known. For example, imagine a doctor trying to decide what dosage to give a patient. In this example, the dependent variable is the dosage (because it depends on the patient) and the patients weight is the independent variable. The doctor knows the patients weight and needs to determine the appropriate dosage. By performing a regression, the doctor could develop a formula that, when given a patients weight, determines the correct dosage.

Let’s examine a few common machine learning algorithms:

**k-nearest neighbor (k-NN)-**

k-NN is a basic machine learning algorithm used for regression and **classification**. The k represents a small positive integer. In this example, we will use k-NN for classification. This means the output of the algorithm will be class membership. The algorithm will “assign a new object to the class most common among its k nearest neighbors” (www.digitalocean.com). In this example we will also assume that we’ve been asked to perform our classification based on k = 1. This tells us that the object should be assigned to the same class as its nearest neighbor.

Imagine that our robot captures the following image:

A second later, our robot captures this image:

Our robot recognizes an unknown image and needs to classify it as either a star or diamond. Because we are performing a k = 1 classification, our robot will classify the new object as belonging to the same class as its nearest neighbor. In this case, its nearest neighbor is a star, so the object gets classified as a star.

If we had performed a k = 3 classification, our robot would have examined the new objects closest three neighbors (two stars and one diamond). Again, in this case, our robot would have classified the new object as being a star.

This might seam like an arbitrary and ineffective method of “learning.” But imagine scenarios where certain objects consistently display on certain parts of the screen. In those cases, a simple k-NN algorithm might be the fastest way to train your machine to correctly identify new objects.

**Decision Tree Learning-**

Decision trees are used to create predictive models. The goal of decision tree learning is to generate a target value based on certain inputs. Here’s a simple decision tree. Imagine a man is using it to decide whether or not he should play golf:

The data’s attributes (barometric pressure, overcast, rain, etc) are represented by branches while the outputs you want based on the validity of those conditions are represented by leaves. If it is raining, and it is raining heavily, you shouldn’t go play golf. When building a decision tree, you must answer the following: what attributes (branches) to include?, how should you split those attributes?, and when should the tree end?

**Deep Learning:**

Deep learning refers more to a general class or structure of machine learning algorithms then it does to any particular algorithm. Basically, deep learning algorithms work to emulate the human brain by implementing “neural networks.” Neural networks are layered and the output of one layer gets fed into the next layer of the network. Neural networks are also “weighted” meaning that some inputs and outputs have greater influence than others. Deep learning algorithms can be either supervised or unsupervised. Deep learning has become the approach with the most potential in the artificial intelligence space (www.digitalocean.com). Some deep learning algorithms have outperformed humans in cognitive tasks.

**Linear Regression:**

Linear regression algorithms work to model target values based on the linear relationship between an independent and dependent variable. We develop our linear regression by computing the line y = ax + b for which there is a minimum difference between the expected and actual values. In the above formula, y is the dependent variable (output), a is the slope, x is the dependent variable (input), and b is the y intercept.

In the leftmost image illustrates a positive linear regression (the independent and dependent variables increase together), the middle image illustrates a negative linear regression (because the independent and dependent variables move in opposite directions), and the rightmost image illustrates an exponential relationship between the independent and dependent variables.

**Logistic Regression:**

Logistic regression is a supervised classification algorithm. To review, this means the values passed into the algorithm are labeled with their desired outputs. The goal here is to compare the actual output with the desired output and adjust the algorithm accordingly. Specifically, logistic regressions are useful when the dependent variable is binary (ex: win or lose) and you can use a logistic regression to, based on the independent variables, estimate the likelihood of a certain event occurring (ex: you will win the game). A great logistic regression can predict outcomes accurately.

**Support Vector Machines:**

Support Vector Machines (SVM) is another supervised classification algorithm that, given training data, draws a vector between the two categories. This sounds strange at first. Here’s an example scenario. Imagine that we want our algorithm to differentiate circles and squares. We will do this by building an algorithm that draws a line separating circles and squares.

The algorithm works to optimize the boundary line by assuring that the closest data points from each of the opposing classes are as far away from each other as possible.

These closest data points are the “extremes” in our sample and define the “support vectors” (illustrated by the dotted lines above). Our boundary line must remain in between each of the support vectors at all times. Otherwise we risk a miss-classification. A support vector machine could be used to determine whether an image contains a cat or a dog. In this example, our support vector for the cat category would be defined by a cat that looks like a dog.

**Random Forests:**

Random forests are a popular supervised ‘ensemble’ learning algorithm. Random forests can be used for both classification or regression. The word ensemble, means the algorithm takes many ‘weak learners’ and has them work together to form one strong predictor. In a random forest, the weak learners are decision trees and the forest represents the decision trees merged together to form a more accurate model (strong predictor). So how do you create a random forest? Many shallow decision trees are created from random samples of data. Individual decision trees may hold biases because the input data is not guaranteed to accurately represent the entire sample space. Therefore, by aggregating the data from multiple decision trees, random forests look to improve prediction accuracy by ‘washing out’ biases.

Imagine that you’re buying a new car. You’re overwhelmed buy the number of options on the market so you decide to consult your friends. Your friends will probably ask you what kind of features you’re looking for in a car. We could model this feedback as a decision tree. Each feature (attribute) your friends ask you to consider represents a branch, and each branch is split by the possible responses to that feature. For example, if your friends asked you whether or not you want to drive on the beach, that branch would be split into ‘four-wheel drive’ or ‘no four-wheel drive.’ After consulting many friends and considering many different features, you can generate your strong predictor (identify the car most suitable for you) by including the features most frequently recommended to you in the car you buy.

**K-Means Clustering:**

K-Means Clustering is an unsupervised machine learning classification algorithm. ‘K’ is input by the user and refers to the desired number of clusters in the data set. Clustering is defined as “grouping a set of data examples so that examples in one group (or one cluster) are more similar (according to some criteria) than those in other groups,” (The SAS Data Science Blog). The algorithm considers a sample of data points and best separates the sample into K number of clusters. The burden is completely on the data scientist to select the appropriate number of clusters. In comparison to other unsupervised algorithms, K-Means is blazingly fast. K-Means could be used to identify different customers segments in a marketplace and to classify new customers as belonging to one of these categories. By gathering further data on how different customer segments behave, businesses can generate accurate predictions regarding future earnings, potential for growth, customer demographic, average customer lifetime, etc.

**So what machine learning algorithm should I use?**

This is the big question. First, you can narrow it down by determining **whether you need a supervised or unsupervised learning algorithm.** Remember, if you have desired outputs (labels) for each of your input values then you’ll want to use a supervised learning algorithm. If you want the machine to learn how to classify data for you than you’ll want to use an unsupervised algorithm. The most important factors to consider when choosing an algorithm are (according to The SAS Data Science Blog) are:

**The size, quality, and nature of data.****The available computational time.****The urgency of the task.****What you want to do with the data.****Accuracy, training time, ease of use**

Here is a machine learning cheat sheet that offers some direction with regards to which algorithms you should try first given your circumstances.

This chart is to be used only as “rule of thumb.” Sometimes no pathways will perfectly apply to your circumstance. This is why some data scientists say “the only way to guarantee the best algorithm is to try all of them.”