paint-brush
Learning From Machine Learning: Why Green Tests Are Not Good Newsby@oleksandrkaleniuk
396 reads
396 reads

Learning From Machine Learning: Why Green Tests Are Not Good News

by Oleksandr KaleniukApril 24th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Making all your tests green is essentially overfitting. Instead of patching your code to make it look like it works, you should measure how often it fails. Then you can employ techniques such as ensemble learning but for humans to improve its success rate.
featured image - Learning From Machine Learning: Why Green Tests Are Not Good News
Oleksandr Kaleniuk HackerNoon profile picture

I work in additive manufacturing so my mind revolves around geometric algorithms that support 3D printing in all ways possible.


So, let’s say we want an algorithm that predicts whether a model can be printed with a particular printer or not.


is_printable(model, printer_parameters) -> bool


If we chose a machine learning approach, then we first gather data – a set of models supplied with printer parameters that have been either printed successfully or failed to print for one reason or another. Then we split the data. We need data to train our model but we also need data to validate how well a model is trained. Then we train our model and train it again, and keep training until our validation data set shows that the model can not be trained any further.


To see whether we’re making progress in our learning, we measure the error rate of our model on both the training set and the validation set. Usually, the error rate our model shows on the training set goes down predictably each iteration, but what’s surprising, the error rate on the validation set goes down but at some point, it starts going up again.


So, does learning more make our model worse?..






Well, yes. This effect is called overfitting – when we teach our model over and over on the same training data, at some point we start teaching our model not to classify input, any input it may face, but to classify our training set specifically. Consider this. It is ridiculously simple to build a model that has zero error rate on the training set. A big long nested `if` will do: `if the model is the n-th model from the training set, and the parameters are the n-th parameters, return the n-th result’. That’s it.


It’s also easy to build a model with a matching error rate for training and validation data: just return a random result regardless of the input. The predictable value of both models is zero so, naturally, we want something in between. We want something that accumulates the knowledge diluted in the training set to classify input not from the training set, and not even from the validation set, but from the real world. The validation set only approximates real-world usage.


In machine learning, the validation set shows if we’re there yet.


And now, let’s say we want to build the is_printable algorithm by hand.


First, we gather data. Depending on the process we chose, we might gather data in a form of real models with printer parameters, a good old requirement specification, or even anecdotal evidence from the coffee point. Anyway, before we start coding, we usually turn all the data into a set of test cases. Sometimes we even start segregating the tests into several groups, for instance: unit tests for disjointed pieces of code, integration tests to cover the whole algorithm, and validation tests that show that the algorithm follows the expectations even if the tests are not meticulously designed to cover one aspect of the algorithm or another.


Then we write the code. Pretty much like a machine, we accumulate the knowledge diluted in our data to classify the input in all our cases. We don’t consider our algorithm ready until we get all our tests green. And that’s exactly what we’re doing wrong.


When we try to green all our validation tests, we effectively start treating them as training data, and then we give up having a measure to see whether we’re overfitting or not.




A few years ago, I was working on an automation workflow that was working with medical imagery. My team’s job was to make a classifier that tells whether the image has been written properly into files and could be relied upon. There is a huge DICOM standard that regulates all of that so, technically, we didn’t have to do any exploratory work, we just had to make sure the images follow the standard.


We had a large validation test base, about 10K images. And we worked hard to make that all images are classified correctly. Unfortunately, since DICOM is a huge standard, nobody really reads it from cover to cover and almost none of the images in the wild are 100% correct. There are always some minor issues and more often than not you have to find a reliable workaround to ignore them and still have your data read.


We did that for all the cases but two. So almost all the cases were importable, and the few which were not were detected as such. It seemed that we had a reader with ~0.9998 success rate and a predictor with a full 1.0.


Then the thing got into production. In a year or so we asked our product manager how it was going. “Great!” she said, “the customers are very happy! Almost 30% of the cases are getting processed automatically. Huge success!”


That’s... not the number we were hoping to see. I mean, we’re happy that the customers are happy but how come ~0.9998 turned out to be ~0.3?


Well, there were never 0.9998, to begin with. We basically mistreated the validation data as training data and “overfitted” our algorithm to pass the tests. It’s just instead of a machine, we were training ourselves.


As for the 0.3, since we didn’t have a validation test set, (we used it all up for training, remember) we never saw this number, although it was there all the time.


So the first lesson is to use your validation data for validation, and not for overfitting your algorithms whether they are machine learned or manually crafted. But that’s not all we can learn.


One more thing to learn: ensemble

Of course, a 0.3 success rate for an image reader is not something we’re striving for. And we can’t just measure our rate, say “It is what it is” and move on. We need ways to make our algorithms better.


The conventional way to do so is to react to users’ complaints. When someone posts a bug, we turn it into a regression test, fix the bug, make sure that the test passes, and add it to our test base. Reacting to our users’ needs is a good thing but the whole process is still just overfitting with extra steps.


The problem with patching our algorithms all the time is the time. Time is a limited resource, and every new patch makes the algorithm larger and harder to work with, and, consequently, makes the lead time for the next patch higher. This process of patching the patches is clearly not sustainable. At some point, the average patching time for the algorithm exceeds the average programmer’s tenure and this effectively makes the algorithm untouchable.


So what can we do instead? Once again, the answer comes from machine learning, and it’s called an ensemble. The idea is, if a single machine learning method doesn’t work very well, do several different methods and combine their output.


Or, put in the context of medical imagery, if a DICOM reader doesn’t work, run another DICOM reader. This is what real-world users do anyway so why not do that automatically?


If you have a reader with a 0.3 success rate, this is a bad reader. But you have ten of them, and each has a 0.3 success rate, but they are completely independent so when one fails, the other still has the same 0.3 chance of success, then the total rate of success is 1 - (1 - 0.3)10 ≈ 0.97.


Just like that, if an `is_printable` classifier, misclassifies way too often, write a different one. And then train a support vector machine too. When you have three different classifiers, you can already pick a binary result by a simple majority vote.


I particularly like this ensemble in manual programming approach because it puts ingenuity back into software engineering. Instead of spending our brainpower trying to figure out how to fix a 30-year-old piece of code, we spend our brainpower trying to figure out how to solve a real-world problem better. And in an ensemble, having several solutions to a single problem, a 30-year-old piece of code included, becomes an asset and not just a liability.


The featured image for this piece was generated with Kadinsky 2

Prompt: Illustrate a machine learning model.