Hackernoon logoModzy's Approach to Processing Engines and AI Infrastructure Management by@modzy

Modzy's Approach to Processing Engines and AI Infrastructure Management

Infrastructure management is one component of managing AI for enterprise scale. Forrester has a widely used definition for ModelOps: the tools, technology, and practices that enable cross-functional AI teams to efficiently deploy, monitor, retrain, and govern AI models in production systems. Modzy uses Kubernetes to manage models, where each model is served as a microservice in a container. This paradigm allows customers to scale horizontally and provide customers with an efficient way of running many inference jobs to one, or many, models in parallel.
image
Modzy Hacker Noon profile picture

@modzyModzy

A software platform for organizations and developers to responsibly deploy, monitor, and get value from AI - at scale.

Our approach to infrastructure management for artificial intelligence (AI) relies on processing engines to support a wide range of deployment needs. Infrastructure management is one component of managing AI for enterprise scale.

As organizations aspire to integrate AI and machine learning (ML) technologies into their daily workflows, they must first embrace and establish an effective ModelOps lifecycle. Forrester has a widely used definition for ModelOps: the tools, technology, and practices that enable cross-functional AI teams to efficiently deploy, monitor, retrain, and govern AI models in production systems.

For an AI model to succeed in the deployment portion of ModelOps, organizations must ensure they can replicate the model development environment in a production-ready system. However, doing so presents a challenge: the ability to effectively and efficiently manage infrastructure. With several factors impacting infrastructure management, including scalability, latency, and security tradeoffs, the process becomes even more complicated.

What is a Processing Engine?

All Modzy deployments require the deployment of Kubernetes, which is an open-source container orchestration system and the core underlying technology that makes Modzy work. More specifically, Modzy uses Kubernetes to manage models, where each model is served as a microservice in a container. This paradigm allows Modzy to scale horizontally and provide customers with an efficient way of running many inference jobs to one, or many, models in parallel.

Kubernetes utilizes nodes — physical or virtual machines within a Kubernetes cluster that can vary based on the user’s choice of infrastructure—to run different application instances. Modzy uses these nodes to run inference jobs through model containers.  For example, deploying  Modzy onto existing AWS infrastructure would result in a Kubernetes cluster of EC2 compute instances as nodes. These nodes (or  “processing engines” as we term them at Modzy) are where the total amount of processing engines allocated to a Modzy deployment determines the account’s overall parallel processing capacity (see Figure 1).

image

Figure 1. The top right portion of the “Models” page within a Modzy instance describes the total amount of processing engines an account can access and therefore determines the account’s level of overall parallel processing capacity.

Managing Processing Engines with Modzy

When a Modzy user submits an inference job to a model, the platform performs a few actions behind the scenes:

  • Spins up a processing engine and deploys a single instance of the model container
  • Executes inference on that processing engine and returns the model predictions to the user
  • Shuts down the processing engine

Modzy manages all job requests in a queue that executes sequentially based on available processing engines and job management protocols. Organizations can modify the job priority in the queue with account API keys and other measures, however jobs can only execute when a processing engine is available.

By default, Modzy automatically scales and manages an account’s available processing engines to optimize infrastructure costs (i.e., auto-scaling). To further explain, the minimum number of processing engines to be used by a given model is set to zero (see Figure 2). So when a user submits an inference job, Modzy kicks off spinning up a processing engine, executes inference, and automatically shuts it down upon completion of the inference job.

image

Figure 2. By default, the minimum number of processing engines for each model is set to zero.

While auto-scaling minimizes infrastructure costs, it is accompanied by a higher latency trade-off for each inference job because it takes a few minutes to spin up a processing engine.

To decrease these latencies, Modzy gives end users the ability to manually upscale and downscale processing engines as needed. Figure 3 demonstrates a model with one processing engine scaled up. In this scenario, a user can submit subsequent jobs to the model without experiencing latency. While this may expedite the integration of AI inference jobs into an application, it will also mean the organization incurs infrastructure costs until the minimum processing engines field is manually set back to zero.

image

Figure 3. Minimum processing engines manually upscaled to one.

Giving organizations the power to choose how they optimize their infrastructure cost versus inference runtime trade-offs makes Modzy a viable choice for all customers. Deploying AI to a production-ready environment can be challenging, particularly because of the complexities related to infrastructure optimization. Modzy turns this challenge into an easy and seamless process for the user.

Modzy Hacker Noon profile picture
by Modzy @modzy. A software platform for organizations and developers to responsibly deploy, monitor, and get value from AI - at scale.Visit us

Tags

Join Hacker Noon

Create your free account to unlock your custom reading experience.