Data analytics generalist. I publish notes, lessons, and tools for data analytics and investing.
I first ordered The Hundred-Page Machine Learning book back in May and am only just now finishing it up. In COVID-time, that was about 10 years ago. As you might have inferred, this book is NOT a quick read. What it lacks in easy reading, it makes up for in efficiency. This book swallows up the heavyweight mathematics textbooks and spits out a slim product no thicker than the width of my smartphone. From page one all the way to page 136, Andriy Burkov, the author, does not waste a single word in distilling the most practical concepts in machine learning. You read that right. It is MORE than 100 pages! Sounds like the book has some bias. Get it? Now get ready for my hundred-page book review. Just kidding.
This book caters to a wide range of data science enthusiasts. However, your experiences reading this novel will be drastically different. They are highly dependent on your prior knowledge of mathematics and data science, as well as, how quickly you want to zoom through this book. Here is my breakdown on what to expect based on your background.
TL;DR: Scroll down to the Audience Summary Table.
The Bold Novice
The Bold Novice is someone who has read a handful of data-driven articles from mainstream sources like Forbes and has some interest in this hot new subject. They have heard of machine learning, but never really excelled at statistics. I would not recommend opening up this book for this individual. At a minimum, you should be comfortable with Calculus, Statistics, Probability, vectors and matrices, as well as, familiar with data science concepts. If you don’t meet these standards, then I would highly recommend choosing a less technical data science book to start with. I list some on my Resources page, such as Applied AI: A Handbook for Business Leaders.
The Undergrad or Entry Level Data Professional
Next up are the quant-heavy undergraduates or early career data professionals. You have completed several, relevant college level courses (e.g. Math, Data Science) and/or have some familiarity with data science concepts through your first job. This book will be challenging, but it is a worthy investment. The mathematical notation and introduction to many new concepts will seem daunting. However, I would advise you to take it slow and power through the book. This resource is a treasure trove of learnings and introductions to a wide range of machine learning concepts. You will and should read most pages over twice before proceeding to the next subject. Lastly, do not hesitate to pause and google certain notation or concepts whenever needed. When in doubt, assume any unknown symbol is part of the Greek alphabet.
The Grad Student or Mid-Career Data Professional
This individual has completed graduate level statistics or data science courses and/or has more than a few years of job experience as a data professional. This is where I would classify myself. While reading this book, there were sections that I could breeze right through, like reading a children’s novel. As soon as my confidence level grew, Andriy quickly put me in my place. Each new concept builds off of your understanding of introductory machine learning and statistical concepts. If the foundation of your knowledge in this area has any weak points, they will be exposed. The advanced mathematical notation really challenges your ability to translate the formulas into plain English. The diagrams and examples of common applications are life savers.
The Data Scientist
This individual is not just any data professional, but rather has direct experience building and testing machine learning and statistical models to solve business problems. Most data scientists will spend the vast majority of their time cleaning data. When they are lucky enough to be building models, most of the time it is regression, decision trees, clustering, and random forests (Source). Because most of their time is spent on a handful of models, I think this book would be a great resource to expand their knowledge. The Hundred Page ML Book discusses a wide range of machine learning models, including an entire chapter devoted to neural networks and deep learning. It also includes an introduction to new research areas, such as zero-shot learning. This is where you want to train a model to assign labels to objects; however, the goal is to predict labels that did not have any training data. Woah.
Because this book is already a summary of the broader topic of machine learning, it was difficult to take notes. At times, you wanted to highlight everything. My suggestion is to capture specific areas that you want to dive into more detail at a later time. These chapter snapshots are from the point of view of a mid-career data professional with a solid understanding of most data science concepts. You will see that my snapshots are more expansive in the later sections since I already have a firm grasp of the fundamentals. As a bonus, I added some commentary for you.
Note: “Learning” is an abbreviation for machine learning.
Chapter 1: Introduction
Two essential differences between learning algorithms are speed of model building and prediction processing timePAC Learning (Probably Approximately Correct) theoryClose relationship between model error, size of the training set, the form of the mathematical equation that defines the model, and the time it takes to build the model
argMax f(a) returns the element of the set A that maximizes f(A)
The breadth of mathematical topics covered is interesting. It ranges from number lines, which you learned in elementary school, all the way to college level topics, such as linear algebra. There is a lot of mathematical notation that you might not be familiar with so do not be afraid to Google.
I don’t recall learning the argMax function expressed this way. Argument means an input into a function to get an output. Therefore, you are returning the maximum value of the inputs. This has applications in math and computer science.
In another connection to programming, the assignment operator, “<-“, is used. This is a familiar object for our R programmers.
The visualizations for the pmf and pdf are incredibly important. The visual presentation of this information makes it much easier to understand what is going on in any probability/stats class.
Pi capital notation mentioned in this book has a common application when using the maximum likelihood estimation to estimate parameters for probability distributions. Andriy makes this connection in Chapter 3.
Image Source: A simple linear regression model from an R tutorial I published.
Most of the fundamental algorithms (i.e. regression, decision trees, etc.) will look familiar to anyone who has already taken a machine learning course or consumes material on these subjects often.
I think the higher dimension topics are tough to comprehend in all aspects of mathematics and programming. In this chapter, you see this with high dimension kernels.
What does a closed-form function mean? An expression that can be solved with a finite number of operations.
Important distinction between machine learning in Python and R. Andriy mentions that algorithms implemented in scikit-learn expect numerical features. I do not think this is the case for most of the popular ML libraries in R.
People often say that data science is both an art and a science. The entire hyperparameter tuning concept seems to enter the realm of art. All of these techniques for optimizing the hyperparameters seem like you are just guessing. Is there an instinct that gets built over time to improve your hyperparameter tuning?
A neural network is a nested function. Nested functions are common all throughout mathematics and programming. In calculus, you have a nested function when differentiating a function with the chain rule. When writing formulas in Excel, you can nest functions. In cell E1 below, you have an IF function nested inside another IF function to check multiple conditions. The first one checks whether C1 equals “Red” and the second one checks whether D1 is greater than 10.
Image Caption: IF function nested inside another IF function in Excel
Inputs for neural networks are vectors of parameters. Vectors organize the values of different parameters and then they are inserted into different functions.
What are activation functions and why are they popular? My understanding is that they simplify the output into a range that is easier to interpret and/or it provides a distribution of numbers with properties that that are easy to work with. A common example seems to be tan(x).
Boosting performance via uncorrelated models has some practical application in finance. In portfolio theory, uncorrelated assets can reduce variance and boost performance.
This was an extremely useful section in this chapter on algorithmic efficiency. Andriy stresses the importance of using libraries that are optimized for performance, as well as, choosing your data structure wisely.
The author claims that none of the techniques for selecting k, the number of clusters in k-means, is optimal.
Apparently, if the choice for the number of clusters, k, is reasonable, then most points will belong to the same cluster in the training and test set. I thought this was a really cool and useful finding!
There was a fantastic link in the book that pointed towards this video that explains principle component analysis.
It was interesting to see Andriy comment that the choice of metrics for similarity between feature vectors is somewhat arbitrary. Often in mathematics, students will take formulas as gospel, but here you see the need to question them.
In this chapter, there was an incredible diagram that visualizes a sparse matrix which organizes information on users, movies, and ratings. This was important to help understand how information is stored prior to fitting a factorization machine model. If I had one request, it would be more diagrams throughout the book. The use of subscript and superscript notation, as well as, letters to represent unknown variables can be tough to translate and visualize in your head.
Topics Not Covered in this Book include the following:
I’m curious why Andriy states that readers who understand the book can now be a “modern data analyst” or “machine learning engineer”, but fails to mention a “data scientist”. I think it is further evidence that the “data scientist” position will subdivide into various specialties. I wrote a similar prediction in a previous article.
While my chapter snapshots provide a brief overview of the book, I highly recommend purchasing The Hundred-Page Machine Learning Book yourself. Each reader will have their own “key notes” and “commentary” based on their specific experiences. A cool design element by Andriy that might be overlooked by most was his use of QR codes in each chapter. Each one points you towards a link with more details on that subject.
Although this book is thin, it is still a textbook. Treat it as a reference guide that touches the wide breadth of concepts across machine learning. If you don’t know the appropriate lingo for any aspect of machine learning, it is likely called out somewhere in here. However, if you need to dive deep into real life applications or programming examples for a specific machine learning model, then this is not the book for you. Nevertheless, I found Andriy’s concise nature a breadth of fresh air. Nobody enjoys lugging around 20 pound textbooks that expand upon unimportant topics to no end. I look forward to reading his next book, Machine Learning Engineering.
Image Sources: Greek Letters
Also published here.
Create your free account to unlock your custom reading experience.