As builders of an MLOps platform, we often get asked whether serverless is the right compute architecture to deploy models. The cost savings touted by serverless seem extremely appealing for ML workloads as for other traditional workloads.
However, the special requirements of ML models as related to hardware and resources can cause impediments to using serverless architectures. To provide the best solution to our customers, we ran extensive benchmarking to compare serverless to traditional computing for inference workloads. In particular, we evaluated inference workloads on different systems including AWS Lambda, Google Cloud Run, and Verta.
This post talks about how to get started with deploying models on AWS Lambda, along with the pros and cons of using this system for inference.
AWS lambda is AWS’s serverless offering and arguably the most popular cloud-based serverless framework. Specifically, AWS Lambda is a compute service that runs code on demand (i.e., in response to events) and fully manages the provisioning and management of compute resources for running your code. Of course, as with serverless offerings, it only charges you for the time Lambda is in use.
The main appeal of AWS Lambda is the ease of use on multiple levels: first, the developer doesn’t have to worry about infrastructure or resources and can focus only on business logic; second, the developer doesn’t need to maintain infrastructure, so upgrades, patches, and scaling is fully taken care of; and finally, as mentioned prior, using Lambda can be cheaper in terms of total cost of ownership (TCO).
In this blog, we will describe how to run ML models on AWS Lambdas. In particular, we will demonstrate how to run DistillBERT on Lambda.
Note that this blog post focuses on the ML-specific aspects of deploying to AWS Lambda. For a primer on how to deploy to Lambda in general, check out these tutorials: Hello World with console, Hello world with AWS SAM.
First, go to the AWS Console and perform the setup for Lambda. In our case, we choose to trigger the Lambda via an HTTP request to an API gateway.
API Gateway provides a HTTP POST endpoint which passes the request body to the actual Lambda function. The logs and metrics from the gateway and Lambda are stored in AWS CloudWatch.
For this example, we use the DistillBERT question and answer model from HuggingFace. Our inference function performs the following actions:
def predict_answer(event, context):
try:
body = json.loads(event['body'])
answer = model.predict(body['question'], body['context']) # model is a wrapper class I have defined
return {
"statusCode": 200,
"headers": {
'Content-Type': 'application/json',
'Access-Control-Allow-Origin': '*',
"Access-Control-Allow-Credentials": True
},
"body": json.dumps({'answer': answer})
}
except Exception as e:
return {
"statusCode": 500,
"headers": {
'Content-Type': 'application/json',
'Access-Control-Allow-Origin': '*',
"Access-Control-Allow-Credentials": True
},
"body": json.dumps({"error": repr(e)})
}
Note that the model was serialized by the following two lines of code.
DistilBertTokenizer.from_pretrained('distilbert-base-uncased',return_token_type_ids = True).save_pretrained('./model')
DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased-distilled-squad').save_pretrained('./model')
Now that the code is ready, we upload the Lambda function to an S3 bucket as a deployment package and we are good to go.
Our setup above looks almost perfect but we immediately hit a resource limitation. The DistillBERT model is 253MB and the PyTorch + HuggingFace libraries and their dependencies (support for only cpu) are 563MB uncompressed. These are well outside the limits for AWS (relevant ones shown below, full list at here).
It looks like we cannot deploy the model on AWS Lambda.
Now that we understand the restrictions in using DistillBERT on Lambda, let’s try to fix them by reducing the size of our models and or libraries or the way that we load the model into the lambda.
First, we choose to download the model into memory vs. disk when the lambda boots up. This allows us to work on the 3GB memory buffer and use memory to simulate disk. For this approach, we download the model on our local machine (via save_pretrained) and upload it to S3. Then when the Lamda starts up, we will download the model into memory. In Python, this is feasible since the model is one big binary and many python libraries can operate with streams as opposed to files. The snippet below shows how to do this:
def load_model_from_s3(self, model_path: str, s3_bucket: str, file_prefix: str):
if model_path and s3_bucket and file_prefix:
obj = s3.get_object(Bucket=s3_bucket, Key=file_prefix)
bytestream = io.BytesIO(obj['Body'].read())
tar = tarfile.open(fileobj=bytestream, mode="r:gz")
config = AutoConfig.from_pretrained(f'{model_path}/config.json')
for member in tar.getmembers():
if member.name.endswith(".bin"):
f = tar.extractfile(member)
state = torch.load(io.BytesIO(f.read()))
model = AutoModelForQuestionAnswering.from_pretrained(
pretrained_model_name_or_path=None, state_dict=state, config=config)
return model
else:
raise KeyError('No S3 Bucket and Key Prefix provided')
Downloading the model into memory was easy. When we get to Python libraries, things get trickier. As mentioned before, libraries required for this model amount to 563MB on disk uncompressed. Most of this size comes from binaries with C extensions for Python. Removing those files and placing them in memory is a non-trivial task, since we’d have to change how python module loading works.
Instead, we used a technique similar to this tutorial: delete everything you know you won’t need. It takes a bit of trial and error to get this right. In the case of this model, the following was enough to get the package to fit in the 512MB of space on /tmp.
find . -type d -name "tests" -exec rm -rf {} +
find . -type d -name "__pycache__" -exec rm -rf {} +
find . -type d -name "include" -exec rm -rf {} +
rm -rf ./{caffe2,wheel,wheel-*,pkg_resources,boto*,aws*,pip,pip-*,pipenv,setuptools}
rm -rf ./{*.egg-info,*.dist-info}
find . -name \*.pyc -delete
find . -type d -name "test" -exec rm -rf {} +
We then compress the requirements and the compressed file amounts to 124MB, which unfortunately doesn’t fit in the 50MB compressed limit for the package. It also wouldn’t’ fit in the 250MB uncompressed limit either. So we have to do yet another layer of manipulating the packages to place them in exactly the right place and use them in the right way.
In this case, we uploaded the compressed artifact into S3 and, at the start of the function, we download it (into memory since it won’t fit in disk together with its uncompressed version!), unzip to /tmp and alter sys.path to change where python looks for packages. This is done via the following snippet:
import os
import shutil
import sys
import zipfile
from io import BytesIO
import boto3
import zipfile
venv = '/tmp/venv'
venv_tmp = '/tmp/venv_tmp'
if not os.path.exists(venv):
s3 = boto3.client('s3', use_ssl=False)
obj = BytesIO(s3.get_object(Bucket='<BUCKET_NAME>', Key='bert/packages.zip')['Body'].read())
if os.path.exists(venv_tmp):
shutil.rmtree(venv_tmp)
with zipfile.ZipFile(obj, 'r') as zip_ref:
zip_ref.extractall(venv_tmp)
os.rename(venv_tmp, venv)
if venv+"/packages" not in sys.path:
sys.path.append(venv+"/packages")
Now that we have everything in the right place (either in disk or memory), we can actually load the libraries, models and run predictions! It does take a while to spin up a new worker since a lot of work needs to be done by our customized code and can’t be cached by Lambda to use in other workers.
So now we have our model deployed on AWS Lambda. But the big question is should you?
The complexity of running a ML model on Lambda is directly proportional to the size of the model and its dependencies. A very simple model with dependencies in a few MBs can be easily deployed on lambda in a matter of minutes. However, as the size of the model and its dependencies increase, the problem of deploying to Lambda becomes progressively harder. It involves hours of trial and error to get the model + dependencies to fit within the constraints imposed by AWS. Moreover, this effort must be repeated for every model that is deployed.
In addition, from a performance standpoint, the optimizations to fit into the Lambda resource constraints, have side effects during execution. In our case reading and uncompressing the dependencies to `/tmp` or reading the compressed model from s3 to memory impose an overhead on the cold start latency of the lambda. This overhead is in addition to the typical cold-start overhead for serverless systems that we evaluate in our benchmarking blog (below).
Finally, when deploying or making changes to any of the set up above via the AWS Console does not provide any audit trail or versioning capabilities, making updates error-prone and burdensome (this can be mitigated via an automated CI system).
So our verdict:
Curious about alternatives for AWS Lambda? Check out our blog on Google Cloud Run and, of course, check out Verta!
Previously published at https://blog.verta.ai/blog/how-to-deploy-ml-models-with-aws-lambda