Deploying Transformers in Production: Simpler Than You Think

Machine learning, particularly Natural Language Processing (NLP), is transforming the way we build software. Whether you're improving search experiences with embedding models for semantic matching, generating content using powerful text-generation models, or optimizing retrieval with specialized ranking models, NLP capabilities have become crucial building blocks for modern applications. Yet there's a lingering perception that deploying these language models into production requires complex tooling or specialized knowledge, making many developers understandably hesitant to dive in. This hesitation often stems from the belief that NLP deployment is inherently difficult or overly technical—something reserved for machine learning specialists. But that's simply not the case. Modern frameworks, especially Transformers, have made powerful NLP accessible and surprisingly straightforward to use. In fact, if you've worked with standard backend technologies like Docker, Flask, or cloud services like AWS, you already have the skills needed to easily deploy a Transformer-based NLP model. Transformers Docker In this blog post, we'll gently unravel this myth by demonstrating how approachable and developer-friendly deploying Transformers can be. No deep machine learning expertise required—just familiar tools you probably already use daily. Of course, the intention here isn’t to trivialize the complexities that still exist—optimizing large-scale models, fine-tuning GPU performance, managing massive datasets, or deploying cutting-edge architectures like Mixture-of-Experts (MoEs) still involves specialized knowledge and substantial practice. However, there’s an entire universe of valuable, practical ML models that you can deploy right now with minimal friction. This post is intended to lay a solid foundation upon which you can gradually build deeper expertise through continued practice. You’re about to discover how easy it is to wield some of AI’s most powerful tools using skills you already have. Let's dive in! 🤖 Making Transformers Accessible: From Hugging Face to Your Local API What exactly is a transformer model? Put simply, Transformers are a powerful family of deep-learning models specifically designed to excel at language tasks. Whether you're implementing semantic search through embeddings, analyzing sentiment, generating natural-sounding text, or ranking content for better retrieval, Transformers power some of the most impactful NLP applications today. Enter Hugging Face 🤗: Democratizing Transformers Thankfully, Hugging Face has made Transformer models accessible, approachable, and developer-friendly. Rather than starting from scratch or managing complex training pipelines, Hugging Face provides a vast selection of ready-to-use Transformer models—making sophisticated NLP capabilities available to anyone comfortable writing a few lines of Python. By providing easy access to thousands of pre-trained models, Hugging Face significantly lowers the barrier for integrating NLP into your applications. You can easily download models, test their performance, and incorporate them directly into your workflow—no deep ML expertise or expensive hardware required. How Easy Is It Really? Using these transformer models locally doesn't require complicated infrastructure or deep ML expertise. Here's the simple flow: Pick your model: Choose one from Hugging Face's vast catalog. Load the model: With just a couple lines of Python, you'll download and load the model into memory. Serve predictions: Wrap your model in a simple HTTP API with Flask to handle prediction requests. Scale requests: Use Gunicorn, a robust WSGI server, to handle concurrent traffic smoothly in production. Containerize with Docker: Package your Flask API into a Docker container to ensure it runs consistently anywhere. In the rest of this post, we'll walk through exactly how you can use these tools—Flask, Docker, and Hugging Face transformers—to effortlessly deploy an ML model as a professional-grade API on AWS SageMaker. Pick your model: Choose one from Hugging Face's vast catalog. Pick your model: Load the model: With just a couple lines of Python, you'll download and load the model into memory. Load the model: Serve predictions: Wrap your model in a simple HTTP API with Flask to handle prediction requests. Serve predictions: Flask Scale requests: Use Gunicorn, a robust WSGI server, to handle concurrent traffic smoothly in production. Scale requests Gunicorn Containerize with Docker: Package your Flask API into a Docker container to ensure it runs consistently anywhere. In the rest of this post, we'll walk through exactly how you can use these tools—Flask, Docker, and Hugging Face transformers—to effortlessly deploy an ML model as a professional-grade API on AWS SageMaker. Containerize with Docker: 🐳 Why Docker? (And Why It Matters Here) Why Docker? (And Why It Matters Here) Docker plays a central role in simplifying the ML deployment workflow. Here’s why it’s critical: Consistency: Docker ensures your application runs the same everywhere—locally, on AWS SageMaker, or any cloud provider. This eliminates the notorious "it works on my machine" problem. Portability: You build your app once, package it as a Docker container, and then deploy it anywhere without worrying about environment discrepancies. Simplicity & Efficiency: With Docker, you manage dependencies cleanly, avoid manual setup headaches, and streamline the path from development to production. Consistency: Docker ensures your application runs the same everywhere—locally, on AWS SageMaker, or any cloud provider. This eliminates the notorious "it works on my machine" problem. Consistency Portability: You build your app once, package it as a Docker container, and then deploy it anywhere without worrying about environment discrepancies. Portability Simplicity & Efficiency: With Docker, you manage dependencies cleanly, avoid manual setup headaches, and streamline the path from development to production. Simplicity & Efficiency For this project, Docker allows you to package your Flask API and transformer model in a single container image that easily deploys to AWS SageMaker, ensuring a frictionless deployment experience. Docker ensures your ML inference app is consistent and robust no matter where you run it. Docker 📌 What's Our Goal? 📌 What's Our Goal? We'll build a straightforward Dockerized API hosting a HuggingFace DistilBERT sentiment analysis model using: Flask to handle HTTP requests Gunicorn to robustly handle concurrent requests in production Docker and Docker Compose to containerize our application AWS SageMaker for seamless cloud deployment Flask to handle HTTP requests Flask Gunicorn to robustly handle concurrent requests in production Gunicorn Docker and Docker Compose to containerize our application Docker and Docker Compose AWS SageMaker for seamless cloud deployment AWS SageMaker 🚀 Follow Along on GitHub: Check out the Docker Transformer Inference repo—run, customize, and deploy your own transformer models effortlessly! Follow Along on GitHub: Docker Transformer Inference 💻 Project Structure 💻 Project Structure Here's the project setup, highlighting how Docker seamlessly packages our Transformer-serving Flask app: DockerTransformerInference/ ├── app/ # App source code │ ├── api/ │ │ └── model.py # Transformer model wrapper (DistilBERT) │ └── main.py # Flask API (prediction & health-check endpoints) │ ├── Dockerfile # Container setup (Python, Flask, Gunicorn, dependencies) ├── docker-compose.yml # Quick local container setup & testing ├── requirements.txt # Python dependencies │ └── sagemaker/ # Scripts for AWS SageMaker deployment & testing ├── build_and_push.sh ├── deploy_model.py └── test_endpoint.py DockerTransformerInference/ ├── app/ # App source code │ ├── api/ │ │ └── model.py # Transformer model wrapper (DistilBERT) │ └── main.py # Flask API (prediction & health-check endpoints) │ ├── Dockerfile # Container setup (Python, Flask, Gunicorn, dependencies) ├── docker-compose.yml # Quick local container setup & testing ├── requirements.txt # Python dependencies │ └── sagemaker/ # Scripts for AWS SageMaker deployment & testing ├── build_and_push.sh ├── deploy_model.py └── test_endpoint.py 📌 Key Files Explained Key Files Explained 🐳 Dockerfile Dockerfile Defines our app environment (Python, Flask, Gunicorn). Installs dependencies & sets key environment variables. Prepares the app to run consistently everywhere (local, AWS, etc.). Defines our app environment (Python, Flask, Gunicorn). Installs dependencies & sets key environment variables. Prepares the app to run consistently everywhere (local, AWS, etc.). 🚀 docker-compose.yml docker-compose.yml Quickly spins up our app locally for testing & debugging. Maps container port (8080) to your machine for easy access. Quickly spins up our app locally for testing & debugging. Maps container port (8080) to your machine for easy access. 8080 ⚙️ app/main.py app/main.py Contains our Flask API endpoints (/ping, /invocations), crucial for SageMaker compatibility. Contains our Flask API endpoints (/ping, /invocations), crucial for SageMaker compatibility. /ping /invocations 🧠 app/api/model.py app/api/model.py Wraps Hugging Face DistilBERT model—simple transformer model inference logic. Wraps Hugging Face DistilBERT model—simple transformer model inference logic. 🛠️ requirements.txt & SageMaker scripts requirements.txt & SageMaker scripts requirements.txt: Lists Python dependencies to ensure reproducibility. SageMaker scripts: Automate image build, deployment, and testing on AWS SageMaker. requirements.txt: Lists Python dependencies to ensure reproducibility. requirements.txt: SageMaker scripts: Automate image build, deployment, and testing on AWS SageMaker. SageMaker scripts: With this clear and lightweight setup, deploying your transformer model becomes straightforward! 🚀 Step-by-Step: Let's Build It! Step-by-Step: Let's Build It! In this section, we'll walk through the exact steps needed to deploy your transformer-serving API to AWS SageMaker. Along the way, I'll highlight crucial considerations to help you avoid common pitfalls when deploying ML models with Docker and Flask. 1. Setting up Your Flask API (Familiar Territory with a Twist) 1. Setting up Your Flask API (Familiar Territory with a Twist) If you've built Flask APIs before, this will feel straightforward. But SageMaker adds some specific requirements, so let's highlight those clearly: Your Flask API (app/main.py) requires two key endpoints: Your Flask API ( app/main.py GET /ping: A health check endpoint. AWS SageMaker mandates this endpoint return a HTTP 200 status quickly. POST /invocations: Your inference endpoint. This handles requests and sends them to your transformer model for predictions. GET /ping: A health check endpoint. AWS SageMaker mandates this endpoint return a HTTP 200 status quickly. GET /ping: A health check endpoint. AWS SageMaker mandates this endpoint return a HTTP 200 status quickly. GET /ping 200 POST /invocations: Your inference endpoint. This handles requests and sends them to your transformer model for predictions. POST /invocations: Your inference endpoint. This handles requests and sends them to your transformer model for predictions. POST /invocations Here's how your Flask code looks in practice: from flask import Flask, request, jsonify from api.model import TransformerModel # Flask app setup app = Flask(__name__) # Load transformer model (cached for fast inference) model = TransformerModel("distilbert-base-uncased-finetuned-sst-2-english") @app.route('/ping', methods=['GET']) def ping(): # SageMaker expects HTTP 200 status return '', 200 @app.route('/invocations', methods=['POST']) def predict(): # Parse input JSON payload (example: {"text": "Great blog post!"}) data = request.get_json() # Guard clause: make sure input data has 'text' field if not data or 'text' not in data: return jsonify({"error": "Please provide input text."}), 400 # Run inference using transformer model result = model.predict(data['text']) # Return inference result as JSON return jsonify(result) if __name__ == "__main__": # Ensure app is accessible externally in Docker app.run(host='0.0.0.0', port=8080) from flask import Flask, request, jsonify from api.model import TransformerModel # Flask app setup app = Flask(__name__) # Load transformer model (cached for fast inference) model = TransformerModel("distilbert-base-uncased-finetuned-sst-2-english") @app.route('/ping', methods=['GET']) def ping(): # SageMaker expects HTTP 200 status return '', 200 @app.route('/invocations', methods=['POST']) def predict(): # Parse input JSON payload (example: {"text": "Great blog post!"}) data = request.get_json() # Guard clause: make sure input data has 'text' field if not data or 'text' not in data: return jsonify({"error": "Please provide input text."}), 400 # Run inference using transformer model result = model.predict(data['text']) # Return inference result as JSON return jsonify(result) if __name__ == "__main__": # Ensure app is accessible externally in Docker app.run(host='0.0.0.0', port=8080) 2. Your Transformer Model Wrapper: Hugging Face Simplifies Everything 2. Your Transformer Model Wrapper: Hugging Face Simplifies Everything If you have never hosted a transformer model yourself, a key insight I want you to walk away with is that Hugging Face dramatically simplifies this process, and you can use the same framework to deploy your own custom transformer models that are not available on Hugging Face as well. Let's briefly clarify the main concepts involved: The app/api/model.py wrapper takes care of loading the model, tokenizing input text, and performing predictions: app/api/model.py from transformers import AutoModelForSequenceClassification, AutoTokenizer import torch class TransformerModel: def __init__(self, model_name): # Load pretrained tokenizer & model directly from Hugging Face hub self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.model = AutoModelForSequenceClassification.from_pretrained(model_name) def predict(self, text): # Tokenize input text (convert words to numeric vectors) inputs = self.tokenizer(text, return_tensors="pt") # Run inference (get raw predictions from transformer model) outputs = self.model(**inputs) # Convert raw logits into probabilities with softmax probs = torch.nn.functional.softmax(outputs.logits, dim=1).detach().numpy()[0] # Human-readable labels for sentiment analysis (negative, positive) return { "negative": float(probs[0]), "positive": float(probs[1]) } from transformers import AutoModelForSequenceClassification, AutoTokenizer import torch class TransformerModel: def __init__(self, model_name): # Load pretrained tokenizer & model directly from Hugging Face hub self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.model = AutoModelForSequenceClassification.from_pretrained(model_name) def predict(self, text): # Tokenize input text (convert words to numeric vectors) inputs = self.tokenizer(text, return_tensors="pt") # Run inference (get raw predictions from transformer model) outputs = self.model(**inputs) # Convert raw logits into probabilities with softmax probs = torch.nn.functional.softmax(outputs.logits, dim=1).detach().numpy()[0] # Human-readable labels for sentiment analysis (negative, positive) return { "negative": float(probs[0]), "positive": float(probs[1]) } This snippet provides a concise wrapper for sentiment analysis using Hugging Face transformers. It loads a pretrained model and tokenizer, converts input text into numeric tokens, performs inference, and outputs clear, human-readable sentiment probabilities. Tokenization Tokenization Transformers can't read plain text directly. Tokenization converts text into numeric tokens (unique IDs) so models can process it. Tokenization Example: "I love Docker!" → [1045, 2293, 2035, 999] "I love Docker!" → [1045, 2293, 2035, 999] Softmax Softmax Transformer models output raw scores (logits) indicating prediction strength. Softmax transforms these logits into clear probabilities between 0 and 1, making results easy to interpret. logits Softmax Example: Logits: [2.0, 4.0] → Probabilities: [0.12, 0.88] Logits: [2.0, 4.0] → Probabilities: [0.12, 0.88] This means an 88% likelihood for the second category. 3. Dockerizing Your Service: A Known Process, With Some Gotchas 3. Dockerizing Your Service: A Known Process, With Some Gotchas If you're familiar with Docker, containerizing your Flask API is straightforward, but deploying on AWS SageMaker introduces specific considerations: Dockerfile Explanation: Dockerfile Explanation: FROM public.ecr.aws/sam/build-python3.10 # Environment variables important for clean & fast execution ENV PYTHONDONTWRITEBYTECODE=1 \ PYTHONUNBUFFERED=1 WORKDIR /app # Copy dependencies and install them COPY requirements.txt requirements.txt RUN pip install -r requirements.txt # Copy application code into container COPY . . # Critical point for SageMaker: ENTRYPOINT vs CMD ENTRYPOINT ["gunicorn", "app.main:app", "-b", "0.0.0.0:8080"] FROM public.ecr.aws/sam/build-python3.10 # Environment variables important for clean & fast execution ENV PYTHONDONTWRITEBYTECODE=1 \ PYTHONUNBUFFERED=1 WORKDIR /app # Copy dependencies and install them COPY requirements.txt requirements.txt RUN pip install -r requirements.txt # Copy application code into container COPY . . # Critical point for SageMaker: ENTRYPOINT vs CMD ENTRYPOINT ["gunicorn", "app.main:app", "-b", "0.0.0.0:8080"] Why ENTRYPOINT instead of CMD? Why ENTRYPOINT instead of CMD ? AWS SageMaker uses a command structure like docker run serve to launch the container. Defining an explicit ENTRYPOINT ensures the container correctly handles this requirement and avoids startup errors. docker run serve ENTRYPOINT Docker Compose (docker-compose.yml) Docker Compose ( docker-compose.yml For Local DevelopmentFor smooth local testing, this configuration makes life easy: version: '3.8' services: transformer-api: build: . ports: - "8080:8080" volumes: - .:/app restart: always version: '3.8' services: transformer-api: build: . ports: - "8080:8080" volumes: - .:/app restart: always Important Docker Gotchas for SageMaker Deployment: Important Docker Gotchas for SageMaker Deployment: Architecture Compatibility: SageMaker infrastructure runs Linux on AMD64 architecture. When building your Docker image on MacOS (especially ARM64), explicitly specify the target platform to avoid runtime errors: Architecture Compatibility: SageMaker infrastructure runs Linux on AMD64 architecture. When building your Docker image on MacOS (especially ARM64), explicitly specify the target platform to avoid runtime errors: Architecture Compatibility: docker build --platform linux/amd64 -t your-image-name . docker build --platform linux/amd64 -t your-image-name . Docker Credential Configuration: Ensure Docker credentials (~/.docker/config.json) correctly specify "credStore" (not "credsStore"), as misconfiguration will cause authentication issues when pushing images to Amazon ECR. Docker Credential Configuration: Ensure Docker credentials (~/.docker/config.json) correctly specify "credStore" (not "credsStore"), as misconfiguration will cause authentication issues when pushing images to Amazon ECR. Docker Credential Configuration: ~/.docker/config.json "credStore" "credsStore" 4. AWS SageMaker Deployment 4. AWS SageMaker Deployment This section outlines a streamlined process for deploying your Docker container onto AWS SageMaker. In this project, I used AWS CLI and custom python scripts to demonstrate the basic steps needed for deployment. However, you can also automate this process using Cloud Formation, CDK or other CI/CD frameworks. But that's probably for another blog post, here we stick to the basics: Step 1: Push Docker Container to AWS ECR Step 1: Push Docker Container to AWS ECR Your image must reside in Amazon ECR before deploying to SageMaker. Use this straightforward script (build_and_push.sh): build_and_push.sh aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin YOUR_AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com docker build --platform linux/amd64 -t transformer-inference . docker tag transformer-inference:latest YOUR_AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/transformer-inference:latest docker push YOUR_AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/transformer-inference:latest aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin YOUR_AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com docker build --platform linux/amd64 -t transformer-inference . docker tag transformer-inference:latest YOUR_AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/transformer-inference:latest docker push YOUR_AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/transformer-inference:latest Step 2: SageMaker Endpoint Deployment Step 2: SageMaker Endpoint Deployment Once you've pushed your Docker image to Amazon ECR, you're ready to deploy your model onto AWS SageMaker. The deployment involves three primary steps clearly handled by the provided deployment script (deploy_model.py): deploy_model.py What the deployment script does: What the deployment script does: Creates a SageMaker Model: Connects your Docker container from Amazon ECR with an AWS IAM role, defining permissions needed for SageMaker to run your image. Defines an Endpoint Configuration: Specifies AWS hardware resources, including: Instance Type: Type of EC2 instance (e.g., ml.m5.large). Instance Count: Number of instances for scaling purposes. Deploys the Endpoint: Launches your Docker container on AWS infrastructure and makes it accessible via a public endpoint URL. Creates a SageMaker Model: Connects your Docker container from Amazon ECR with an AWS IAM role, defining permissions needed for SageMaker to run your image. Creates a SageMaker Model: Connects your Docker container from Amazon ECR with an AWS IAM role, defining permissions needed for SageMaker to run your image. Connects your Docker container from Amazon ECR with an AWS IAM role, defining permissions needed for SageMaker to run your image. Defines an Endpoint Configuration: Specifies AWS hardware resources, including: Instance Type: Type of EC2 instance (e.g., ml.m5.large). Instance Count: Number of instances for scaling purposes. Defines an Endpoint Configuration: Specifies AWS hardware resources, including: Instance Type: Type of EC2 instance (e.g., ml.m5.large). Instance Count: Number of instances for scaling purposes. Specifies AWS hardware resources, including: Instance Type: Type of EC2 instance (e.g., ml.m5.large). Instance Count: Number of instances for scaling purposes. Instance Type: Type of EC2 instance (e.g., ml.m5.large). Instance Count: Number of instances for scaling purposes. Instance Type: Type of EC2 instance (e.g., ml.m5.large). Instance Type ml.m5.large Instance Count: Number of instances for scaling purposes. Instance Count Deploys the Endpoint: Launches your Docker container on AWS infrastructure and makes it accessible via a public endpoint URL. Deploys the Endpoint: Launches your Docker container on AWS infrastructure and makes it accessible via a public endpoint URL. Launches your Docker container on AWS infrastructure and makes it accessible via a public endpoint URL. How to run the deployment script: How to run the deployment script: Navigate to your project directory and run: python sagemaker/deploy_model.py --instance-type ml.m5.large python sagemaker/deploy_model.py --instance-type ml.m5.large Optional customization parameters: Optional customization parameters: You can customize your deployment using additional command-line options: --model-name: Sets a custom name for your SageMaker model (default: docker-transformer-inference). --instance-type: Selects a specific AWS instance type (default: ml.m5.large). --instance-count: Defines how many instances to run concurrently (default: 1). --region: AWS region for deployment (default: configured AWS CLI region). --role-arn: Specify an existing IAM role for SageMaker execution explicitly. --model-name: Sets a custom name for your SageMaker model (default: docker-transformer-inference). --model-name docker-transformer-inference --instance-type: Selects a specific AWS instance type (default: ml.m5.large). --instance-type ml.m5.large --instance-count: Defines how many instances to run concurrently (default: 1). --instance-count 1 --region: AWS region for deployment (default: configured AWS CLI region). --region --role-arn: Specify an existing IAM role for SageMaker execution explicitly. --role-arn Example with custom options: Example with custom options: python sagemaker/deploy_model.py --instance-type ml.c5.xlarge --instance-count 2 --region us-west-2 python sagemaker/deploy_model.py --instance-type ml.c5.xlarge --instance-count 2 --region us-west-2 Important Considerations: Important Considerations: Ensure your IAM role has permissions for SageMaker and Amazon ECR access. The deployment will take several minutes; SageMaker health checks (/ping) must pass quickly or the deployment will fail. Ensure your IAM role has permissions for SageMaker and Amazon ECR access. The deployment will take several minutes; SageMaker health checks (/ping) must pass quickly or the deployment will fail. /ping Step 3: Testing Your Deployed Endpoint Step 3: Testing Your Deployed Endpoint After deploying your model, you'll need to confirm the endpoint works correctly. The provided script (test_endpoint.py) simplifies this verification process: test_endpoint.py What the test script does: What the test script does: Uses the SageMaker runtime API (boto3) to call your endpoint. Sends a JSON payload (e.g., sentiment-analysis text) to the /invocations endpoint. Receives and prints the model’s inference output, such as sentiment classification probabilities. Uses the SageMaker runtime API (boto3) to call your endpoint. boto3 Sends a JSON payload (e.g., sentiment-analysis text) to the /invocations endpoint. /invocations Receives and prints the model’s inference output, such as sentiment classification probabilities. How to run the test script: How to run the test script: From your project directory, execute: python sagemaker/test_endpoint.py --endpoint-name docker-transformer-inference-endpoint python sagemaker/test_endpoint.py --endpoint-name docker-transformer-inference-endpoint Replace docker-transformer-inference-endpoint if you customized your endpoint name during deployment. Replace docker-transformer-inference-endpoint if you customized your endpoint name during deployment. docker-transformer-inference-endpoint Alternative Testing Methods: Alternative Testing Methods: If you prefer using the AWS CLI directly, here’s how you can invoke the endpoint: Using modern AWS CLI (with automatic JSON encoding): Using modern AWS CLI (with automatic JSON encoding): Using modern AWS CLI (with automatic JSON encoding): aws sagemaker-runtime invoke-endpoint \ --endpoint-name docker-transformer-inference-endpoint \ --content-type application/json \ --body '{"text": "This is a great product!"}' \ --body-encoding json \ output.json # To view the prediction results cat output.json aws sagemaker-runtime invoke-endpoint \ --endpoint-name docker-transformer-inference-endpoint \ --content-type application/json \ --body '{"text": "This is a great product!"}' \ --body-encoding json \ output.json # To view the prediction results cat output.json Using AWS CLI (manual base64 encoding): Using AWS CLI (manual base64 encoding): Using AWS CLI (manual base64 encoding): aws sagemaker-runtime invoke-endpoint \ --endpoint-name docker-transformer-inference-endpoint \ --content-type application/json \ --body $(echo '{"text": "This is a great product!"}' | base64) \ output.json # To view the prediction results cat output.json aws sagemaker-runtime invoke-endpoint \ --endpoint-name docker-transformer-inference-endpoint \ --content-type application/json \ --body $(echo '{"text": "This is a great product!"}' | base64) \ output.json # To view the prediction results cat output.json Important Considerations: Important Considerations: Ensure your JSON payload exactly matches the expected format defined in your Flask app ({"text": " "}). For straightforward testing, the Python script is recommended as it handles payload formatting automatically and avoids potential confusion with AWS CLI encoding requirements. Ensure your JSON payload exactly matches the expected format defined in your Flask app ({"text": " "}). {"text": " "} For straightforward testing, the Python script is recommended as it handles payload formatting automatically and avoids potential confusion with AWS CLI encoding requirements. ✨ Wrapping Up ✨ Wrapping Up As you can see, deploying transformers using Docker and Flask is manageable—particularly because you already have these fundamental backend engineering skills. Your familiarity with containerization, backend APIs, and AWS tooling makes deploying ML services much easier than you initially expect. 🚀 Code Repo: docker-transformers-inference Code Repo docker-transformers-inference If you enjoyed this post or have questions, let's connect! If you enjoyed this post or have questions, let's connect! Happy ML Deployments! 🚀✨