The Hundred-Page Machine Learning Book [Review] by@DataGeneralist

November 17th 2020 401 reads

Data analytics generalist. I publish notes, lessons, and tools for data analytics and investing.

I first ordered The Hundred-Page Machine Learning book back in May and am only just now finishing it up. In COVID-time, that was about 10 years ago. As you might have inferred, this book is NOT a quick read. What it lacks in easy reading, it makes up for in efficiency. This book swallows up the heavyweight mathematics textbooks and spits out a slim product no thicker than the width of my smartphone. From page one all the way to page 136, Andriy Burkov, the author, does not waste a single word in distilling the most practical concepts in machine learning. You read that right. It is MORE than 100 pages! Sounds like the book has some bias. Get it? Now get ready for my hundred-page book review. Just kidding.

This book caters to a wide range of data science enthusiasts. However, your experiences reading this novel will be drastically different. They are highly dependent on your prior knowledge of mathematics and data science, as well as, how quickly you want to zoom through this book. Here is my breakdown on what to expect based on your background.

TL;DR: Scroll down to the Audience Summary Table.

**The Bold Novice**

The Bold Novice is someone who has read a handful of data-driven articles from mainstream sources like Forbes and has some interest in this hot new subject. They have heard of machine learning, but never really excelled at statistics. I would not recommend opening up this book for this individual. At a minimum, you should be comfortable with Calculus, Statistics, Probability, vectors and matrices, as well as, familiar with data science concepts. If you don’t meet these standards, then I would highly recommend choosing a less technical data science book to start with. I list some on my Resources page, such as Applied AI: A Handbook for Business Leaders.

**The Undergrad or Entry Level Data Professional**

Next up are the quant-heavy undergraduates or early career data professionals. You have completed several, relevant college level courses (e.g. Math, Data Science) and/or have some familiarity with data science concepts through your first job. This book will be challenging, but it is a worthy investment. The mathematical notation and introduction to many new concepts will seem daunting. However, I would advise you to take it slow and power through the book. This resource is a treasure trove of learnings and introductions to a wide range of machine learning concepts. You will and should read most pages over twice before proceeding to the next subject. Lastly, do not hesitate to pause and google certain notation or concepts whenever needed. When in doubt, assume any unknown symbol is part of the Greek alphabet.

**The Grad Student or Mid-Career Data Professional**

This individual has completed graduate level statistics or data science courses and/or has more than a few years of job experience as a data professional. This is where I would classify myself. While reading this book, there were sections that I could breeze right through, like reading a children’s novel. As soon as my confidence level grew, Andriy quickly put me in my place. Each new concept builds off of your understanding of introductory machine learning and statistical concepts. If the foundation of your knowledge in this area has any weak points, they will be exposed. The advanced mathematical notation really challenges your ability to translate the formulas into plain English. The diagrams and examples of common applications are life savers.

**The Data Scientist**

This individual is not just any data professional, but rather has direct experience building and testing machine learning and statistical models to solve business problems. Most data scientists will spend the vast majority of their time cleaning data. When they are lucky enough to be building models, most of the time it is regression, decision trees, clustering, and random forests (Source). Because most of their time is spent on a handful of models, I think this book would be a great resource to expand their knowledge. The Hundred Page ML Book discusses a wide range of machine learning models, including an entire chapter devoted to neural networks and deep learning. It also includes an introduction to new research areas, such as zero-shot learning. This is where you want to train a model to assign labels to objects; however, the goal is to predict labels that did not have any training data. Woah.

Because this book is already a summary of the broader topic of machine learning, it was difficult to take notes. At times, you wanted to highlight everything. My suggestion is to capture specific areas that you want to dive into more detail at a later time. These chapter snapshots are from the point of view of a mid-career data professional with a solid understanding of most data science concepts. You will see that my snapshots are more expansive in the later sections since I already have a firm grasp of the fundamentals. As a bonus, I added some commentary for you.

Note: “Learning” is an abbreviation for machine learning.

**Chapter 1: Introduction**

**Key Notes**

Two essential differences between learning algorithms are speed of model building and prediction processing timePAC Learning (Probably Approximately Correct) theoryClose relationship between model error, size of the training set, the form of the mathematical equation that defines the model, and the time it takes to build the model

**Key Notes**

argMax f(a) returns the element of the set A that maximizes f(A)

**Commentary**

The breadth of mathematical topics covered is interesting. It ranges from number lines, which you learned in elementary school, all the way to college level topics, such as linear algebra. There is a lot of mathematical notation that you might not be familiar with so do not be afraid to Google.

I don’t recall learning the argMax function expressed this way. Argument means an input into a function to get an output. Therefore, you are returning the maximum value of the inputs. This has applications in math and computer science.

In another connection to programming, the assignment operator, “<-“, is used. This is a familiar object for our R programmers.

The visualizations for the pmf and pdf are incredibly important. The visual presentation of this information makes it much easier to understand what is going on in any probability/stats class.

Pi capital notation mentioned in this book has a common application when using the maximum likelihood estimation to estimate parameters for probability distributions. Andriy makes this connection in Chapter 3.

Image Source: A simple linear regression model from an R tutorial I published.

**Commentary**

Most of the fundamental algorithms (i.e. regression, decision trees, etc.) will look familiar to anyone who has already taken a machine learning course or consumes material on these subjects often.

I think the higher dimension topics are tough to comprehend in all aspects of mathematics and programming. In this chapter, you see this with high dimension kernels.

**Key Notes**

- Building blocks of each learning algorithm:
- A loss function,
- Optimization criterion based on the loss function,
- Optimization routine leveraging training data to find a solution to the optimization criterion
- Some learning algorithms were developed intuitively and later explained with optimization criteria (e.g. Decision trees, KNN)
- Gradient descent finds minimums for non-closed form functions
- All algorithms implemented in scikit-learn require numerical features

**Commentary**

What does a closed-form function mean? An expression that can be solved with a finite number of operations.

Important distinction between machine learning in Python and R. Andriy mentions that algorithms implemented in scikit-learn expect numerical features. I do not think this is the case for most of the popular ML libraries in R.

**Key Notes**

- Model Performance Assessment: The two most frequently used metrics to assess the model are precision and recall. Typically an increase in one will decrease the other, trade-off between which is more important to optimize
- Cost-sensitive accuracy- assign a cost to both types of mistakes (FN, FP), and weigh the FN/FP using these costs when calculating accuracy
- Hyperparameter TuningGrid search- train multiple models of different combinations of the hyperparameters, spread out combinations using a logarithmic scale (0.1, 1, 10, 100, 1000, etc.)
- More efficient techniques- random search and Bayesian hyperparameter optimization

**Commentary**

People often say that data science is both an art and a science. The entire hyperparameter tuning concept seems to enter the realm of art. All of these techniques for optimizing the hyperparameters seem like you are just guessing. Is there an instinct that gets built over time to improve your hyperparameter tuning?

**Key Notes**

- Softmax regression, a generalization for multiclass classification (e.g. logistic regression), is a standard unit in a neural network
- Modern deep learning includes hundreds of layers, but many business problems can be solved with 2-3 layers in between input and output
- Convolutional neural network (CNN)- special kind of FFNN that significantly reduces the number of parameters in a deep neural network with many units without losing too much in the quality of the model. Applications often in image and text processing.
- Each pixel of an image is a feature (100 x 100 = 10,000 features)
- Apply different filters, account for bias, moving window approach
- Recurrent Neural Networks (RNN)- used to label, classify, or generate sequences. They are often used in text processing or speech processing.

**Commentary**

A neural network is a nested function. Nested functions are common all throughout mathematics and programming. In calculus, you have a nested function when differentiating a function with the chain rule. When writing formulas in Excel, you can nest functions. In cell E1 below, you have an IF function nested inside another IF function to check multiple conditions. The first one checks whether C1 equals “Red” and the second one checks whether D1 is greater than 10.

Image Caption: IF function nested inside another IF function in Excel

Inputs for neural networks are vectors of parameters. Vectors organize the values of different parameters and then they are inserted into different functions.

What are activation functions and why are they popular? My understanding is that they simplify the output into a range that is easier to interpret and/or it provides a distribution of numbers with properties that that are easy to work with. A common example seems to be tan(x).

**Key Notes**

- Ensemble learning is an approach to boost the performance of simple learning algorithms. It combines a large number of low-accuracy models, combines the predictions of the weak models to obtain a high accuracy meta-model
- Ensemble Learning Types: boosting, bagging
- Random forest is one of the most widely used ensemble learning algorithms. It uses multiple samples of the original dataset, reducing variance of the final model
- Gradient boosting is an effective ensemble learning algorithm based on the idea of boosting
- The goal of semi-supervised learning is to leverage a large number of the unlabeled examples to improve the model performance without asking for additional labeled examples

**Key Notes**

- Decision trees, random forest, gradient boosting are less sensitive to imbalanced datasets than SVM
- Combining algorithms can boost performance if the models are uncorrelated or have different features
- When training neural networks, it is suggested to use modern architecture if you don’t have enough clean, normalized training data.
- Transfer learning is when you pick an existing model trained on some dataset and you adapt this model to predict examples from another dataset, different than the model was built on.
- Transfer learning has useful applications in neural networks
- Example: Find a trained model for visuals online. Remove several last layers (quantity of layers is a hyperparameter). Add your own prediction layers. Train the model

**Commentary**

Boosting performance via uncorrelated models has some practical application in finance. In portfolio theory, uncorrelated assets can reduce variance and boost performance.

This was an extremely useful section in this chapter on algorithmic efficiency. Andriy stresses the importance of using libraries that are optimized for performance, as well as, choosing your data structure wisely.

**Key Notes**

- K-means algorithm favors shapes of hypersphere
- HDBSCAN does not favor any particular shape; however, it has slower performance. Because it can build clusters of varying densities, it is recommended by the author.
- Ensemble algorithms and neural networks are good with high dimensional examples, including millions of features
- Dimensionality reduction techniques are used less often nowadays than past
- Dimensionality reduction techniques are useful for building interpretable visualizations and models. They remove redundant or highly correlated features and reduce noise in the data.

**Commentary**

The author claims that none of the techniques for selecting k, the number of clusters in k-means, is optimal.

Apparently, if the choice for the number of clusters, k, is reasonable, then most points will belong to the same cluster in the training and test set. I thought this was a really cool and useful finding!

There was a fantastic link in the book that pointed towards this video that explains principle component analysis.

**Key Notes**

- Metric Learning: Two most frequently used metrics for similarity between feature vectors are Euclidean distance and cosine similarity
- You can create a metric for your dataset
- Learning to Rank: LambdaMART is a good algorithm for ranking that implements via a list-wise approach. It optimizes the model directly based on some metric, such as mean average precision
- Typically, supervised learning optimizes the cost function instead of the metric because the metric is not differentiable
- Learning to Recommend: Content-based vs. collaborative filtering approach
- Most real world applications use a combination of both
- Word Embeddings are feature vectors that represent words
- Algorithms to learn word embeddings: word2vec and skip-gram
- Goal: To convert a one-hot encoding of a word into a word embedding
- One-hot encoding- 10,000 dimensional vector with all zeros except one dimension that contains a one for that word

**Commentary**

It was interesting to see Andriy comment that the choice of metrics for similarity between feature vectors is somewhat arbitrary. Often in mathematics, students will take formulas as gospel, but here you see the need to question them.

In this chapter, there was an incredible diagram that visualizes a sparse matrix which organizes information on users, movies, and ratings. This was important to help understand how information is stored prior to fitting a factorization machine model. If I had one request, it would be more diagrams throughout the book. The use of subscript and superscript notation, as well as, letters to represent unknown variables can be tough to translate and visualize in your head.

**Key Notes**

Topics Not Covered in this Book include the following:

- Topic Modeling
- Generalized Linear Models
- Probabilistic Graphical Models
- Markov Chain Monte Carlo
- Generative Adversarial Networks
- Genetic Algorithms
- Reinforcement Learning

**Commentary**

I’m curious why Andriy states that readers who understand the book can now be a “modern data analyst” or “machine learning engineer”, but fails to mention a “data scientist”. I think it is further evidence that the “data scientist” position will subdivide into various specialties. I wrote a similar prediction in a previous article.

While my chapter snapshots provide a brief overview of the book, I highly recommend purchasing The Hundred-Page Machine Learning Book yourself. Each reader will have their own “key notes” and “commentary” based on their specific experiences. A cool design element by Andriy that might be overlooked by most was his use of QR codes in each chapter. Each one points you towards a link with more details on that subject.

Although this book is thin, it is still a textbook. Treat it as a reference guide that touches the wide breadth of concepts across machine learning. If you don’t know the appropriate lingo for any aspect of machine learning, it is likely called out somewhere in here. However, if you need to dive deep into real life applications or programming examples for a specific machine learning model, then this is not the book for you. Nevertheless, I found Andriy’s concise nature a breadth of fresh air. Nobody enjoys lugging around 20 pound textbooks that expand upon unimportant topics to no end. I look forward to reading his next book, Machine Learning Engineering.

*Image Sources: **Greek Letters*

*Also published **here**.*

Join Hacker Noon

Create your free account to unlock your custom reading experience.