TL, DR; When coming across an ML problem, don’t try to be a hero and dive right into solving it. Process and understand the problem, review your dataset, set a realistic goal and then go about actually solving the problem. Chances are that you will end up saving a lot of resources (most importantly time) if you plan your execution properly.
Whenever we come across an (ML) problem, we are very prone to just pick a Machine Learning model and train it. We don’t plan the whole problem-solving process and end up doing months of effort for something that could have been done much more efficiently. If you have been there, then you have landed at the right place. Read on to understand how to avoid such mistakes.
“It is indispensable to understand the problem at hand. It never is like: here is some data, here is a neural network and bam !””
Being at a company which is gradually making its way from a startup to a corporation, you get to learn and experience a lot of things. You also get to grow along the way. The processes mature and so does your approach towards different problems.
Applied Machine Learning is a relatively new field. Every problem is different from the previous one. It usually lacks the repetitiveness that some other fields typically have.
Random applied ML meme from https://twitter.com/ai_memes
The bad thing about that is that it is very difficult to formalize a process to be followed whenever a new problem is given. The good thing is that it is always new and interesting and you don’t get bored.
So, how do we generally go about problem-solving? If you are like most of us engineers (who have cough-less-cough-respect-cough for processes), it is very easy to just jump right into coming up with and trying a solution. This is true for any kind of problem we encounter not just engineering problems. We don’t often realize the significance of sitting down to process the situation, understanding the problem and then formalizing a process for solving any given problem. As a result, we are half-way through the solution (or implementation in case of programming) that we realize that we did not consider the possibility of A and B while solving X.
If you look up steps for problem-solving online; here is an example of what you generally find. It is usually this same procedure with a little bit of rephrasing.
courtesy: https://www.humorthatworks.com/wp-content/uploads/2012/10/Problem-Solving-Cycle.jpg
In my opinion, when you’re developing a software, you are essentially solving a problem, therefore, SDLC is just a more specific version of problem-solving steps.
courtesy: https://3.imimg.com/data3/BV/OL/MY-12221668/software-development-life-cycle-500x500.jpg
Since solving machine learning problems is a little different than developing any other software, SDLC doesn’t exactly translate to our use-case. What we lack in Applied Machine Learning, is a version of problem-solving steps specific to our domain.
Being in the field of Machine Learning for a few years and working on a variety of Computer Vision and Natural Language Processing problems in this time, following is the process that I have found to work in most of the situations:
courtesy: Me (Thank you, very much!)
These steps essentially are just a combination of typical steps to solve a problem as well as the Software Development Life Cycle.
Understand the Problem
This is pretty straight forward. You need to be able to define and understand what problem you’re trying to solve. You should have a higher perspective of what real-world problem you want to help solve. If done well, this part gets you really motivated about everything else that is to follow.
Review Dataset
This step is ignored so often and is so underrated that I cannot emphasize the importance of it enough. You really need to carefully look at your dataset in order to analyze not only if you have enough data or not, if you need more data or if/how you can augment it but also see the quality of the data.
The quality of the data fundamentally determines if you will be able to solve the problem at all or not. For instance, you’re working at a project where the problem is to be able to classify certain rarely found lab equipment images. You thought, okay, I have so many of these images, I’ll turn up a GPU and feed these images to a neural network and see what happens. When you test the neural network, you find out that it doesn’t have the slightest idea of what is what. What happened? Aren’t neural networks supposed to magically work? No, they are not! Your data had blurry and dark images. Some pictures didn’t even have the whole equipment as part of the image. You didn’t review your dataset and spent all your resources without realizing the dataset’s capacity to potentially train a model.
Set an end goal
Rephrase: Set a realistic engineering end goal. At the business side, we make big promises of what revolutionary product we want to make. But only after reviewing the dataset, can you have a real perspective of what is really possible. Therefore, it is very significant to get a general idea of things and then setting an end goal.
List alternate solutions
This step can be approached differently in different situations. If you have seen a similar problem in the past, you will have a more accurate idea of what kind of solutions can help in the situation.
If you’re working on a relatively new problem, this step really puts your R&D gears to use. You will need to research what kind of solutions can be suitable depending on the problem, the dataset and the kind of time you have to be able to do your predictions/classifications etc. (For instance, YOLO works fast but is relatively less accurate for similar looking objects while RetinaNet is slower but more accurate)
Select a solution
You can choose a subset of your dataset and train your dataset for a lesser number of iterations to see how well your model is learning.
Based on this, you are in a very good position to select which solution is better and use that for implementation.
Implementation
This is pretty straight-forward. Implement your solution and implement it well. Try following your language standards as much as you can. Make sure your code is readable. Use OOP if you can. Reuse things if you can and don’t reinvent the wheel.
Evaluation
Always select an evaluation metric relevant to your goal. Benchmark your solution on the basis of your selected evaluation metric. Just testing it on a bunch of test data is not enough.
It is vital to use the acceptance criteria set initially. It stops you from iterating 100 times for minimal improvements.
Remember: When coming across an ML problem, don’t try to be a hero and dive right into solving it. Process and understand the problem, review your dataset, set a realistic goal and then go about actually solving the problem. Chances are that you will end up saving a lot of resources (most importantly time) if you plan your execution properly.