Today, Machine Learning powers the top 1% of the most valuable organizations in the world (FB, ALPH, AMZ, N etc). However, 99% of enterprises struggle to productionalize ML, even with the possession of hyper-specific datasets and exceptional data science departments. Going one layer further into how ML propagates through an organization reveals the problem in more depth. The graphic below shows an admittedly simplified representation of a typical setup for machine learning: There are three stages to the above process: Experimenting & PoCs: : Jupyter notebooks, Python scripts, experiment tracking tools, data exploration tools. Technologies : Data scientists. Persona : Quick and scientific experiments define this phase. The team wants to increase their understanding of the data and machine learning objective as rapidly as possible. Description Conversion: : ETL pipelining tools such as Airflow. Technologies : Data Engineers. Persona : Converting finalized experiments into automated, repeatable processes is the aim of this code. Sometimes this starts before the next phase, some times after, but the essence is the same — take the code from the data scientists and try to put them in sort form of an automated framework. Description Productionalization & Maintenance: : Flask/FastAPI, Kubernetes, Docker, , Technologies Cortex Seldon . : ML Engineers. Persona : This is the phase that starts at the deployment of the model, and spans monitoring, retraining, and maintenance. The core focus of this phase is to keep the model healthy and serving at any scale, all the while accounting for drift. Description Each of these stages requires . Therefore, it is only natural that there are many potholes that a ML team can run into along the way. Inevitably things that are important downstream are not accounted for in the earlier stages. E.g. If training happens in isolation from the deployment strategy, that is never going to translate well in production scenarios — leading to inconsistencies, silent failures, and eventually failed model deployments. different skills, tooling, and organization The Solution Looking at the above multi-phase process in the image above, it seems like a no-brainer to simply reduce the steps involved and therefore eliminate the friction that exists between them. However, given the different requirements + skillsets for each phase, this is easier said than done. Data scientists are not trained or equipped to be diligent to care about production concepts such as reproducibility — they . Therefore, what is required is an implementation of a framework that is from the get-go. are trained to iterate and experiment flexible but enforces production standards A very natural mechanism to achieve this is a framework that exposes an automated, standardized way to run ML pipelines in a controlled environment. ML is inherently a process that can be broken down into individual, concrete steps (e.g. preprocessing, training, evaluating, etc), so a pipeline is a good solution here. Critically, by standardizing the development of these , organizations can lose the cycle of destruction/recreation of ML models through multiple tooling and steps, and hasten the speed of research to deployment. pipelines at the early stages Furthermore, if a data scientist has of these pipelines from training till deployment, a large portion of the technical debt issues mentioned above immediately dissolve.  They can test their models in a near-production/production environment whilst training them, and unify their tooling and codebases closer to production. ownership If an organization can incentivize their data scientists to buy into such a framework, . However, the devil is really in the details — how do you give data scientists the they need for experimentation in a framework that is enough to be taken all the way to production? then they have won half the battle of productionalization flexibility robust It's hard for data scientists to write ML pipelines So why don't data scientists just write these pipelines themselves? The simple answer is that they are not equipped to do so. In my opinion, currently, the tooling landscape is too split into frameworks that are ML tools for ML people, or Ops tools for Ops people, not really satisfying all the boxes I mentioned in the last section. As a concrete example, here is a , a tool normally used to create ML pipelines: snippet to create a simple training pipeline with Kubeflow name= ,
    description= ) output_template = str(output) + + dsl.RUN_ID_PLACEHOLDER + analyze_output = output_template
    transform_output_train = os.path.join(output_template, , )
    transform_output_eval = os.path.join(output_template, , )
    train_output = os.path.join(output_template, )
    predict_output = os.path.join(output_template, ) dsl.ExitHandler(exit_op=dataproc_delete_cluster_op(
        project_id=project,
        region=region,
        name=cluster_name
    )):
        _create_cluster_op = dataproc_create_cluster_op(
            project_id=project,
            region=region,
            name=cluster_name,
            initialization_actions=[
              os.path.join(_PYSRC_PREFIX, ),
            ],
            image_version= )

        _analyze_op = dataproc_analyze_op(
            project=project,
            region=region,
            cluster_name=cluster_name,
            schema=schema,
            train_data=train_data,
            output=output_template
        ).after(_create_cluster_op).set_display_name( )

        _transform_op = dataproc_transform_op(
            project=project,
            region=region,
            cluster_name=cluster_name,
            train_data=train_data,
            eval_data=eval_data,
            target=target,
            analysis=analyze_output,
            output=output_template
        ).after(_analyze_op).set_display_name( )

        _train_op = dataproc_train_op(
            project=project,
            region=region,
            cluster_name=cluster_name,
            train_data=transform_output_train,
            eval_data=transform_output_eval,
            target=target,
            analysis=analyze_output,
            workers=workers,
            rounds=rounds,
            output=train_output
        ).after(_transform_op).set_display_name( )

        _predict_op = dataproc_predict_op(
            project=project,
            region=region,
            cluster_name=cluster_name,
            data=transform_output_eval,
            model=train_output,
            target=target,
            analysis=analyze_output,
            output=predict_output
        ).after(_train_op).set_display_name( )

        _cm_op = confusion_matrix_op(
            predictions=os.path.join(predict_output, ),
            output_dir=output_template
        ).after(_predict_op)

        _roc_op = roc_op(
            predictions_dir=os.path.join(predict_output, ),
            true_class=true_label,
            true_score_column=true_label,
            output_dir=output_template
        ).after(_predict_op)

    dsl.get_pipeline_conf().add_op_transformer(
        gcp.use_gcp_secret( )) @dsl.pipeline( 'XGBoost Trainer' 'A trainer that does end-to-end distributed training for XGBoost models.' : def xgb_train_pipeline (
    output= ,
    project= ,
    cluster_name= % dsl.RUN_ID_PLACEHOLDER,
    region= ,
    train_data= ,
    eval_data= ,
    schema= ,
    target= ,
    rounds= ,
    workers= ,
    true_label= ,
) 'gs://your-gcs-bucket' 'your-gcp-project' 'xgb-%s' 'us-central1' 'gs://ml-pipeline-playground/sfpd/train.csv' 'gs://ml-pipeline-playground/sfpd/eval.csv' 'gs://ml-pipeline-playground/sfpd/schema.json' 'resolution' 200 2 'ACTION' '/' '/data' # Current GCP pyspark/spark op do not provide outputs as return values, instead, # we need to use strings to pass the uri around. 'train' 'part-*' 'eval' 'part-*' 'train_output' 'predict_output' with 'initialization_actions.sh' '1.2' 'Analyzer' 'Transformer' 'Trainer' 'Predictor' 'part-*.csv' 'part-*.csv' 'user-gcp-sa' Glancing at the above, one can spot a few things that would make it hard for a data scientist to produce and own this pipeline independently: Data is not versioned. There is ops-centric language like clusters, kubernetes secrets and dsl Exit handlers. Docker Image and Bash Know-how required. Arbitrary data formats and no schema control. Untracked metadata and configuration. Invariably, a data scientist at this point either needs to learn how to circumvent the above, or a more engineering driven team has to take over, which leads us right back to the original problems described above. Higher-order abstractions democratize engineering paradigms In order to get data scientists to really create production-ready artifacts, they require an Ops (read pipelines) tool for ML people, where they can use higher-order abstractions at the right level for a data scientist. In order to understand why abstractions are important, we can cast an eye towards how web development has matured from raw Javascript scripts (the Jupyter notebooks of web development) to the powerful React/Angular/Vue-based modern web development stacks of today. Looking at these modern frameworks, their success has been dictated by providing higher-order abstractions that are easier to consume and digest for a larger audience. They did not change the fundamentals of how the underlying web technology worked. They simply re-purposed it in a way that is understandable and accessible to a larger audience. Specifically, by providing as first-class citizens, these frameworks have ushered in a new mechanism of breaking down, utilizing, and building the HTML and Javascript that powers the modern web. However, ML pipeline-ing tools do not have an equivalent movement to figure out the right order of abstraction to have a similar effect. components In order to expedite such a movement, me and some like-minded individuals decided to create , an open-source MLOps framework to create iterative, reproducible pipelines. ZenML is an exercise in finding the right layer of abstraction for ML. Here, we treat This means that data scientists are exposed to pipelines directly in the framework, but not in the same manner as the data pipelines from the ETL space ( , et al.). Pipelines are treated as meaning they can be compared and analyzed directly. Only when it is time to flip over to productionalization, can they be converted to more 'classical' data pipelines. ZenML pipelines as first-class citizens. Prefect Airflow experiments — Within pipelines are steps, that are abstracted in E.g. There is a , , and so on. Paradigms that are way more understandable than plugging scripts into some form of orchestrator wrapper. familiar ML language towards the data scientist. TokenizerStep TrainerStep EvaluatorStep Each pipeline run tracks the metadata, parameters and can be compared to other runs. The data for each pipeline is automatically versioned and tracked as it flows through. Each run is linked to git commits and compiled into an easy-to-read YAML file, which can be optionally compiled to other DSL’s such as on Airflow or Kubeflow Pipelines. This is necessary to satisfy other stakeholders such as the data engineers and ML engineers in the value chain. Additionally, the interfaces exposed for individual steps are mostly set up in a way to be easy to extend in an idempotent, and therefore a distributed, manner. The data scientist can therefore scale-out with different processing backends (like Dataflow/Spark) when they are dealing with larger datasets. Of course, ZenML is not the only mechanism this — Many companies build their own home-grown abstraction frameworks to solve their specific needs. Often-times these are built on top of some of the other tools I have mentioned above. Regardless of how to get there, the goal should be clear: Get the data scientists with as little friction as possible, incentivizing them to increase their ownership of the models after deployment. as close to production as possible This is a win-win for every persona involved, and ultimately a big win for any organization that aims to make it to the top 1% using ML as a core driver for their business growth. If you like the thoughts here, we’d love to hear your feedback on ZenML. It is and we are looking for early adopters and ! And if you find it is the right order of abstraction for you/your data scientists, then let us know as well via our — looking forward to hearing from you! open-source contributors Slack Also published at https://towardsdatascience.com/why-ml-should-be-written-as-pipelines-from-the-get-go-b2d95003f998

Stacks

Chain

Slack

Target

How To Productionalize ML By Development Of Pipelines Since The Beginning

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Why ML in Production is (still) Broken and Ways we Can Fix it

10 Indications That You Should Invest in Automation Via APIs

10 Commandments for AI-Assisted Social Media Marketers

11 Best Automation Testing Tools to Try in 2021

12 Use Cases of AI and Machine Learning In Finance

12 Ways You Can Use Email to Nurture Leads to Conversion

Why ML in Production is (still) Broken and Ways we Can Fix it

10 Indications That You Should Invest in Automation Via APIs

10 Commandments for AI-Assisted Social Media Marketers

11 Best Automation Testing Tools to Try in 2021

12 Use Cases of AI and Machine Learning In Finance

12 Ways You Can Use Email to Nurture Leads to Conversion

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps