The CI/CD Model Development Process

Continuous Integration (CI) and Continuous Delivery (CD) are staples of a modern software development workflow that enable developers to release their code rapidly, reproducibly, and reliably.

At Modzy, we take the best of traditional continuous integration practices and augment them to fit the needs of a modern data science team. In this way, we empower our agile team of researchers and engineers to continually add to and update a growing portfolio of over 100 machine learning models.

Figure 1: Overview of the continuous integration process to deploy a model into production using Modzy.

To account for the diverse backgrounds of our data scientists, it is essential to define a unifying set of requirements for the development of all machine learning models while ensuring accessibility to developers. Pictured above in Figure 1, we see a schematic that outlines the process that allows data scientists to integrate and deploy their models with confidence. In addition to traditional code review, a series of automatic requirements are applied and enforced prior to merging code.

We perform several levels of unique checks against all our data science repositories in order to standardize the model development process. This ensures every model release is reliable and traceable.

Ensuring License Compliance

During the process of developing models, we rely on third-party software libraries and open-source software implementations. From the data science perspective, we also rely on publicly available data sources and open-source datasets to train and enhance the performance of our models. Mandatory license checks ensure that we comply with the legal parameters associated with the software and datasets we use.

Model Versioning

In traditional software applications, development is often a linear process in which each new version supplants the previous version. In this case, overwriting old artifacts or only having the latest version of the software deployed to a production environment is often sufficient.

For data science model development, different versions of a model may offer trade-offs in speed, accuracy, or intended use case. In order to support this, we use semantic versioning in conjunction with model identifiers to maintain and deploy different versions or lineages of models through time.

Container Security

Model security is paramount. Users count on our container security to ensure the protection of the intellectual property associated with all machine learning models deployed through the Modzy platform.

As a result, we’ve developed an assortment of custom, secure Modzy base images for our models. We scan the docker image during the continuous integration process in order to ensure that there are no Common Vulnerabilities and Exposures (CVEs) that could be exploited by an attacker at the operating system/network level.

Key Takeaways

It is crucial to establish a CI/CD pipeline to maintain and produce high-quality, reproducible, and secure models, while keeping pace with continually increasing demand. By following a consistent, repeatable process and adapting proven techniques from the software development space for data science, organizations can move past the challenges of deploying and managing machine learning models in production systems.