Machine Learning (ML) applications have emerged as a new standard, providing a fresh approach to processing vast amounts of data and producing valuable insights through data mining. However, the field currently lacks sufficient research and established best practices for project management and effective execution of data-driven projects, including those involving Machine Learning algorithms.
The iterative nature of ML projects entails multiple iterations at each step until the desired level of performance is attained. This concept of iterations applies not only to specific components of the project but also to the project versions themselves. After the initial version is released and feedback is collected, goals must be reevaluated, leading to another round of interactions and improvements.
Furthermore, ML tasks are prone to unexpected deviations, making it challenging to predict the exact number of iterations needed or establish clear goals and expectations from the outset. ML engineers must regularly validate whether the training data accurately reflects the observed conditions or if any biases are present.
Therefore, a standard project management approach often employed for software development - since code is an essential component of such a project - is not applicable in this case. There is no guarantee that the intended result will be attained, or that the project and tasks will be completed on time.
Thus, in this article, I'd like to look into how to approach and structure ML projects while keeping all of the aforementioned considerations in mind.
Before diving into a project, a few factors should be considered.
First, the project and its practicality should be assessed (efforts to obtain and label data, expenses due to an incorrect prediction, available computational resources, such as latency, and so on). Then, for model evaluation, a primary optimization metric (such as memory need, prediction latency, coverage, and so on) should be developed. Finally, it is necessary to create a modular-structured codebase by isolating components such as data processing, model definition and training, and experiment management.
In most cases, data should be manually labeled. It makes no difference whether labeling is done alone or by a team: either way, documentation of the process is essential. However, due to the vast volume of unlabeled data, it is not always affordable; in this situation, active learning can be used to determine which subset of data to label. If you are lucky, the accessible data may contain some information that can serve as a noisy approximation of the ground truth. In a health-related project, for example, historical medical records can be leveraged to extract relevant features and construct heuristics to aid in the initial diagnosis of certain conditions.
Next, it is time to establish a minimum expected performance level, using, for example, KNN or Random Forest as simple baselines, and a target performance level. My advice is to use a progressive approach, starting with simpler models and then gradually introducing complexity. Literature research and peer consultation may aid in finding relevant model architectures and selecting a proven approach for the baseline model, ensuring that its performance on a widely-used dataset aligns with reported outcomes. Finally, investigate model scalability by plotting the relationship between increasing dataset sizes and the performance of the identified baseline models.
By this point, you should have an accurate grasp of which architectures and techniques are suitable for the initial problem, and it is time to optimize and maximize the performance of the chosen model. Now, we need to build an adaptable data pipeline. To identify the next steps, use the bias-variance tradeoff. Then, use hyperparameter tuning to optimize the models and targeted data collection to address prevailing failure modes, thoroughly evaluating the errors of the current model and categorizing them for subsequent data acquisition targeted at comprehensive coverage.
After that, debug the model, discover failure modes, and decide on the optimal model refinement actions to boost performance and overall predictive accuracy by meticulously categorizing incorrect predictions and observations.
During the testing phase, evaluate the model's performance on the test distribution thoroughly and learn the differences between the train and test set distributions. The model assessment metric should then be reassessed to ensure that it drives desirable downstream user behavior. Finally, run tests to check the input data pipeline, validate the model's inference process, evaluate its performance on validation data, and consider scenarios that may arise in production by developing tests to ensure new models still work exactly as expected.
To ensure a smooth deployment, consider releasing the new model to only a handful of users at first to allow for thorough testing and troubleshooting; once verified, proceed with a gradual rollout to all users. One of the most important steps here is to maintain the capability to roll back to previous versions in case of issues. For the rest, regularly monitor the live data and model prediction distributions to spot any deviations that may require adjustments.
As mentioned at the very beginning, even if the first model version appears to be complete, this is actually not due to unforeseeable changes, such as periodic retraining, that may arise and negatively influence the system that may. Implementing a permission-based system in which external components request access and declare their intended use of the model may help to limit that risk. Another concern is the potential for stagnation, which can be mitigated by retraining the model on a regular basis to ensure that performance remains up to date. If the initial model owner transfers model ownership, it is critical to discuss future model maintenance with a new team, ensuring they will grasp the requirements for properly managing and maintaining the model.
As you can see, ML projects are exploratory in their nature, with a significant risk of failure, while a return on investment is rarely guaranteed in the early stages of adoption. Furthermore, it involves a distinctive and oftentimes custom-tailored strategy that incorporates components from Scrum, Kanban, CRISP-DM, or TDSP. The process described above can be used as a guideline for organizing such a project. However, as previously said, given that ML projects themselves are quite specific, every individual case is even more distinct and must be considered on its own. Therefore, this structure cannot serve as a universal solution and must undergo substantial changes. Still, it may be a good starting point for discovering how such projects could be structured but do not have to.
The lead image for this article was generated by HackerNoon's AI Image Generator via the prompt "machine learning"