How To Productionalize ML By Development Of Pipelines Since The Beginning

Today, Machine Learning powers the top 1% of the most valuable organizations in the world (FB, ALPH, AMZ, N etc). However, 99% of enterprises struggle to productionalize ML, even with the possession of hyper-specific datasets and exceptional data science departments.

Going one layer further into how ML propagates through an organization reveals the problem in more depth. The graphic below shows an admittedly simplified representation of a typical setup for machine learning:

There are three stages to the above process:

Experimenting & PoCs:

Technologies: Jupyter notebooks, Python scripts, experiment tracking tools, data exploration tools.

Persona: Data scientists.

Description: Quick and scientific experiments define this phase. The team wants to increase their understanding of the data and machine learning objective as rapidly as possible.

Conversion:

Technologies: ETL pipelining tools such as Airflow.

Persona: Data Engineers.

Description: Converting finalized experiments into automated, repeatable processes is the aim of this code. Sometimes this starts before the next phase, some times after, but the essence is the same — take the code from the data scientists and try to put them in sort form of an automated framework.

Productionalization & Maintenance:

Technologies: Flask/FastAPI, Kubernetes, Docker, Cortex, Seldon.

Persona: ML Engineers.

Description: This is the phase that starts at the deployment of the model, and spans monitoring, retraining, and maintenance. The core focus of this phase is to keep the model healthy and serving at any scale, all the while accounting for drift.

Each of these stages requires different skills, tooling, and organization. Therefore, it is only natural that there are many potholes that a ML team can run into along the way. Inevitably things that are important downstream are not accounted for in the earlier stages. E.g. If training happens in isolation from the deployment strategy, that is never going to translate well in production scenarios — leading to inconsistencies, silent failures, and eventually failed model deployments.

The Solution

Looking at the above multi-phase process in the image above, it seems like a no-brainer to simply reduce the steps involved and therefore eliminate the friction that exists between them. However, given the different requirements + skillsets for each phase, this is easier said than done. Data scientists are not trained or equipped to be diligent to care about production concepts such as reproducibility — they are trained to iterate and experiment. Therefore, what is required is an implementation of a framework that is flexible but enforces production standards from the get-go.

A very natural mechanism to achieve this is a framework that exposes an automated, standardized way to run ML pipelines in a controlled environment. ML is inherently a process that can be broken down into individual, concrete steps (e.g. preprocessing, training, evaluating, etc), so a pipeline is a good solution here. Critically, by standardizing the development of these pipelines at the early stages, organizations can lose the cycle of destruction/recreation of ML models through multiple tooling and steps, and hasten the speed of research to deployment.

Furthermore, if a data scientist has ownership of these pipelines from training till deployment, a large portion of the technical debt issues mentioned above immediately dissolve. They can test their models in a near-production/production environment whilst training them, and unify their tooling and codebases closer to production.

If an organization can incentivize their data scientists to buy into such a framework, then they have won half the battle of productionalization. However, the devil is really in the details — how do you give data scientists the flexibility they need for experimentation in a framework that is robust enough to be taken all the way to production?

It's hard for data scientists to write ML pipelines

So why don't data scientists just write these pipelines themselves? The simple answer is that they are not equipped to do so. In my opinion, currently, the tooling landscape is too split into frameworks that are ML tools for ML people, or Ops tools for Ops people, not really satisfying all the boxes I mentioned in the last section.

As a concrete example, here is a snippet to create a simple training pipeline with Kubeflow, a tool normally used to create ML pipelines:

@dsl.pipeline(
    name='XGBoost Trainer',
    description='A trainer that does end-to-end distributed training for XGBoost models.'
)
def xgb_train_pipeline(
    output='gs://your-gcs-bucket',
    project='your-gcp-project',
    cluster_name='xgb-%s' % dsl.RUN_ID_PLACEHOLDER,
    region='us-central1',
    train_data='gs://ml-pipeline-playground/sfpd/train.csv',
    eval_data='gs://ml-pipeline-playground/sfpd/eval.csv',
    schema='gs://ml-pipeline-playground/sfpd/schema.json',
    target='resolution',
    rounds=200,
    workers=2,
    true_label='ACTION',
):
    output_template = str(output) + '/' + dsl.RUN_ID_PLACEHOLDER + '/data'

    # Current GCP pyspark/spark op do not provide outputs as return values, instead,
    # we need to use strings to pass the uri around.
    analyze_output = output_template
    transform_output_train = os.path.join(output_template, 'train', 'part-*')
    transform_output_eval = os.path.join(output_template, 'eval', 'part-*')
    train_output = os.path.join(output_template, 'train_output')
    predict_output = os.path.join(output_template, 'predict_output')

    with dsl.ExitHandler(exit_op=dataproc_delete_cluster_op(
        project_id=project,
        region=region,
        name=cluster_name
    )):
        _create_cluster_op = dataproc_create_cluster_op(
            project_id=project,
            region=region,
            name=cluster_name,
            initialization_actions=[
              os.path.join(_PYSRC_PREFIX,
                           'initialization_actions.sh'),
            ],
            image_version='1.2'
        )

        _analyze_op = dataproc_analyze_op(
            project=project,
            region=region,
            cluster_name=cluster_name,
            schema=schema,
            train_data=train_data,
            output=output_template
        ).after(_create_cluster_op).set_display_name('Analyzer')

        _transform_op = dataproc_transform_op(
            project=project,
            region=region,
            cluster_name=cluster_name,
            train_data=train_data,
            eval_data=eval_data,
            target=target,
            analysis=analyze_output,
            output=output_template
        ).after(_analyze_op).set_display_name('Transformer')

        _train_op = dataproc_train_op(
            project=project,
            region=region,
            cluster_name=cluster_name,
            train_data=transform_output_train,
            eval_data=transform_output_eval,
            target=target,
            analysis=analyze_output,
            workers=workers,
            rounds=rounds,
            output=train_output
        ).after(_transform_op).set_display_name('Trainer')

        _predict_op = dataproc_predict_op(
            project=project,
            region=region,
            cluster_name=cluster_name,
            data=transform_output_eval,
            model=train_output,
            target=target,
            analysis=analyze_output,
            output=predict_output
        ).after(_train_op).set_display_name('Predictor')

        _cm_op = confusion_matrix_op(
            predictions=os.path.join(predict_output, 'part-*.csv'),
            output_dir=output_template
        ).after(_predict_op)

        _roc_op = roc_op(
            predictions_dir=os.path.join(predict_output, 'part-*.csv'),
            true_class=true_label,
            true_score_column=true_label,
            output_dir=output_template
        ).after(_predict_op)

    dsl.get_pipeline_conf().add_op_transformer(
        gcp.use_gcp_secret('user-gcp-sa'))

Glancing at the above, one can spot a few things that would make it hard for a data scientist to produce and own this pipeline independently:

Data is not versioned.
There is ops-centric language like clusters, kubernetes secrets and dsl Exit handlers.
Docker Image and Bash Know-how required.
Arbitrary data formats and no schema control.
Untracked metadata and configuration.

Invariably, a data scientist at this point either needs to learn how to circumvent the above, or a more engineering driven team has to take over, which leads us right back to the original problems described above.

Higher-order abstractions democratize engineering paradigms

In order to get data scientists to really create production-ready artifacts, they require an Ops (read pipelines) tool for ML people, where they can use higher-order abstractions at the right level for a data scientist.

In order to understand why abstractions are important, we can cast an eye towards how web development has matured from raw Javascript scripts (the Jupyter notebooks of web development) to the powerful React/Angular/Vue-based modern web development stacks of today. Looking at these modern frameworks, their success has been dictated by providing higher-order abstractions that are easier to consume and digest for a larger audience. They did not change the fundamentals of how the underlying web technology worked. They simply re-purposed it in a way that is understandable and accessible to a larger audience. Specifically, by providing components as first-class citizens, these frameworks have ushered in a new mechanism of breaking down, utilizing, and building the HTML and Javascript that powers the modern web. However, ML pipeline-ing tools do not have an equivalent movement to figure out the right order of abstraction to have a similar effect.

In order to expedite such a movement, me and some like-minded individuals decided to create ZenML, an open-source MLOps framework to create iterative, reproducible pipelines.

ZenML is an exercise in finding the right layer of abstraction for ML. Here, we treat pipelines as first-class citizens. This means that data scientists are exposed to pipelines directly in the framework, but not in the same manner as the data pipelines from the ETL space (Prefect, Airflow et al.). Pipelines are treated as experiments — meaning they can be compared and analyzed directly. Only when it is time to flip over to productionalization, can they be converted to more 'classical' data pipelines.

Within pipelines are steps, that are abstracted in familiar ML language towards the data scientist. E.g. There is a TokenizerStep, TrainerStep, EvaluatorStep and so on. Paradigms that are way more understandable than plugging scripts into some form of orchestrator wrapper.

Each pipeline run tracks the metadata, parameters and can be compared to other runs. The data for each pipeline is automatically versioned and tracked as it flows through. Each run is linked to git commits and compiled into an easy-to-read YAML file, which can be optionally compiled to other DSL’s such as on Airflow or Kubeflow Pipelines. This is necessary to satisfy other stakeholders such as the data engineers and ML engineers in the value chain.

Additionally, the interfaces exposed for individual steps are mostly set up in a way to be easy to extend in an idempotent, and therefore a distributed, manner. The data scientist can therefore scale-out with different processing backends (like Dataflow/Spark) when they are dealing with larger datasets.

Of course, ZenML is not the only mechanism this — Many companies build their own home-grown abstraction frameworks to solve their specific needs. Often-times these are built on top of some of the other tools I have mentioned above. Regardless of how to get there, the goal should be clear: Get the data scientists as close to production as possible with as little friction as possible, incentivizing them to increase their ownership of the models after deployment.

This is a win-win for every persona involved, and ultimately a big win for any organization that aims to make it to the top 1% using ML as a core driver for their business growth.

If you like the thoughts here, we’d love to hear your feedback on ZenML. It is open-source and we are looking for early adopters and contributors! And if you find it is the right order of abstraction for you/your data scientists, then let us know as well via our Slack — looking forward to hearing from you!

Also published at https://towardsdatascience.com/why-ml-should-be-written-as-pipelines-from-the-get-go-b2d95003f998