MLOps A hands-on, beginner-friendly guide to deploying, monitoring, and optimizing a Machine Learning model in production — from API creation to data drift detection, using the real Home Credit Default Risk dataset from Kaggle. A hands-on, beginner-friendly guide to deploying, monitoring, and optimizing a Machine Learning model in production — from API creation to data drift detection, using the real Home Credit Default Risk dataset from Kaggle. Last Saturday morning, I was scrolling through freelance gig postings on Fiverr when I stumbled upon something that caught my attention. A small fintech startup was looking for someone to "take our trained credit scoring model and make it production-ready — API, Docker, CI/CD, the whole shebang." The budget was decent, the deadline was two weeks, and I thought: "How hard can it be?" "take our trained credit scoring model and make it production-ready — API, Docker, CI/CD, the whole shebang." They pointed me to the Home Credit Default Risk dataset from Kaggle — a real-world dataset with 300,000+ loan applications and 122 features. Home Credit is a company that provides loans to people with little or no credit history. The challenge: predict which applicants are likely to default. Home Credit Default Risk Spoiler: the project was more involved than I expected. But by Sunday evening, I had a working end-to-end MLOps pipeline running, and I learned an incredible amount. This article is the tutorial I wish I had before starting. Whether you're a data science student, a junior ML engineer, or someone curious about what happens after a model is trained, this guide will walk you through every step. I'll explain not just the how but the why behind every decision — because understanding the reasoning is what separates someone who follows a tutorial from someone who can adapt the knowledge to new situations. after how why What we'll build together: What we'll build together: A prediction API using FastAPI that serves a credit scoring model trained on real Kaggle data Automated tests to make sure our API doesn't break A Docker container to package everything for deployment A CI/CD pipeline with GitHub Actions to automate testing and deployment A data drift analysis to monitor model health over time Performance optimizations to speed up inference A prediction API using FastAPI that serves a credit scoring model trained on real Kaggle data prediction API Automated tests to make sure our API doesn't break Automated tests A Docker container to package everything for deployment Docker container A CI/CD pipeline with GitHub Actions to automate testing and deployment CI/CD pipeline A data drift analysis to monitor model health over time data drift analysis Performance optimizations to speed up inference Performance optimizations Let's dive in. Table of Contents The Big Picture: What Is MLOps? Setting Up the Project Exploring and Preparing the Home Credit Dataset Training the Credit Scoring Model Creating a Prediction API with FastAPI Writing Automated Tests Containerizing with Docker Building a CI/CD Pipeline Logging Production Data Data Drift Detection Performance Optimization The Final Architecture Key Takeaways The Big Picture: What Is MLOps? Setting Up the Project Exploring and Preparing the Home Credit Dataset Training the Credit Scoring Model Creating a Prediction API with FastAPI Writing Automated Tests Containerizing with Docker Building a CI/CD Pipeline Logging Production Data Data Drift Detection Performance Optimization The Final Architecture Key Takeaways 1. The Big Picture: What Is MLOps? Before we write a single line of code, let's understand the landscape we're operating in. This context is what makes the difference between blindly following steps and actually understanding what you're building. The "Last Mile" Problem Here's a reality check that most online courses don't tell you: training a model is only about 20% of the work in a real ML project. The remaining 80% is everything that happens around it — data pipelines, deployment, monitoring, maintenance. training a model is only about 20% of the work in a real ML project. Think about it: you've built a great model in a Jupyter notebook. It achieves 0.85 AUC. Everyone's happy. But then what? The model just sits in your notebook. How does the loan officer at the bank actually use it to make decisions? How do you know it's still accurate 6 months from now? What happens when it breaks at 3 AM on a Friday? use This is what MLOps (Machine Learning Operations) solves. It's the set of practices that takes a model from "it works on my laptop" to "it runs reliably in production, 24/7, and we know when something goes wrong." Think of it as DevOps (the practices software engineers use to deploy and maintain applications), but specifically adapted for the unique challenges of machine learning — where the code and the data can both change and cause failures. MLOps and The MLOps lifecycle — training is just one piece of a much larger puzzle. (source: ml-ops.org) The MLOps lifecycle — training is just one piece of a much larger puzzle. (source: ml-ops.org) source: ml-ops.org What We'll Cover and Why Pillar What It Means Why It Matters Tool We'll Use Model Serving Making predictions available via an API So other systems and people can actually use the model FastAPI Containerization Packaging code + dependencies together So the code runs identically everywhere, not just on your machine Docker CI/CD Automating tests and deployment So broken code never reaches production GitHub Actions Monitoring Watching model behavior in production So you know when the model starts making bad predictions Evidently AI Optimization Making inference faster So users don't wait 10 seconds for a prediction cProfile, ONNX Version Control Tracking every change So you can trace what changed, when, and why Git + GitHub Pillar What It Means Why It Matters Tool We'll Use Model Serving Making predictions available via an API So other systems and people can actually use the model FastAPI Containerization Packaging code + dependencies together So the code runs identically everywhere, not just on your machine Docker CI/CD Automating tests and deployment So broken code never reaches production GitHub Actions Monitoring Watching model behavior in production So you know when the model starts making bad predictions Evidently AI Optimization Making inference faster So users don't wait 10 seconds for a prediction cProfile, ONNX Version Control Tracking every change So you can trace what changed, when, and why Git + GitHub Pillar What It Means Why It Matters Tool We'll Use Pillar Pillar What It Means What It Means Why It Matters Why It Matters Tool We'll Use Tool We'll Use Model Serving Making predictions available via an API So other systems and people can actually use the model FastAPI Model Serving Model Serving Model Serving Making predictions available via an API Making predictions available via an API So other systems and people can actually use the model So other systems and people can actually use the model FastAPI FastAPI Containerization Packaging code + dependencies together So the code runs identically everywhere, not just on your machine Docker Containerization Containerization Containerization Packaging code + dependencies together Packaging code + dependencies together So the code runs identically everywhere, not just on your machine So the code runs identically everywhere, not just on your machine Docker Docker CI/CD Automating tests and deployment So broken code never reaches production GitHub Actions CI/CD CI/CD CI/CD Automating tests and deployment Automating tests and deployment So broken code never reaches production So broken code never reaches production GitHub Actions GitHub Actions Monitoring Watching model behavior in production So you know when the model starts making bad predictions Evidently AI Monitoring Monitoring Monitoring Watching model behavior in production Watching model behavior in production So you know when the model starts making bad predictions So you know when the model starts making bad predictions Evidently AI Evidently AI Optimization Making inference faster So users don't wait 10 seconds for a prediction cProfile, ONNX Optimization Optimization Optimization Making inference faster Making inference faster So users don't wait 10 seconds for a prediction So users don't wait 10 seconds for a prediction cProfile, ONNX cProfile, ONNX Version Control Tracking every change So you can trace what changed, when, and why Git + GitHub Version Control Version Control Version Control Tracking every change Tracking every change So you can trace what changed, when, and why So you can trace what changed, when, and why Git + GitHub Git + GitHub Without containerization, your code works on your machine but breaks on the server because of a different Python version. Without CI/CD, every deployment is a manual, error-prone process where someone eventually forgets to run the tests. Without monitoring, your model silently degrades for months before anyone notices. Each piece we build solves a real, concrete problem. Let's start. 2. Setting Up the Project Why project structure matters Before we write any ML code, we need to organize our workspace. This might seem boring, but good project structure is the foundation of good MLOps. When someone new joins the project (or when future-you comes back in 6 months), they should immediately understand where everything lives. good project structure is the foundation of good MLOps. Think of it like organizing a kitchen: if ingredients, utensils, and recipes are all mixed together in one drawer, cooking is a nightmare. If they're in labeled cabinets, anyone can find what they need. Here's the structure we'll use: credit-scoring-mlops/ │ ├── app/ # API code lives here │ ├── __init__.py # Makes 'app' a Python package │ ├── main.py # FastAPI application │ ├── model_loader.py # Model loading logic │ └── schemas.py # Input/output validation rules │ ├── model/ # Trained model artifacts │ └── credit_model.pkl │ ├── tests/ # All test code │ ├── __init__.py │ ├── test_api.py # Tests for the API │ └── test_model.py # Tests for the model │ ├── notebooks/ # Analysis notebooks │ └── data_drift_analysis.ipynb │ ├── monitoring/ # Logging and monitoring code │ └── logger.py │ ├── .github/ # CI/CD pipeline configuration │ └── workflows/ │ └── ci-cd.yml │ ├── Dockerfile # How to build the Docker container ├── requirements.txt # Python dependencies with versions ├── .gitignore # Files Git should ignore └── README.md # Documentation for humans credit-scoring-mlops/ │ ├── app/ # API code lives here │ ├── __init__.py # Makes 'app' a Python package │ ├── main.py # FastAPI application │ ├── model_loader.py # Model loading logic │ └── schemas.py # Input/output validation rules │ ├── model/ # Trained model artifacts │ └── credit_model.pkl │ ├── tests/ # All test code │ ├── __init__.py │ ├── test_api.py # Tests for the API │ └── test_model.py # Tests for the model │ ├── notebooks/ # Analysis notebooks │ └── data_drift_analysis.ipynb │ ├── monitoring/ # Logging and monitoring code │ └── logger.py │ ├── .github/ # CI/CD pipeline configuration │ └── workflows/ │ └── ci-cd.yml │ ├── Dockerfile # How to build the Docker container ├── requirements.txt # Python dependencies with versions ├── .gitignore # Files Git should ignore └── README.md # Documentation for humans The separation between app/ (API code), tests/ (test code), model/ (artifacts), and monitoring/ (observability code) follows a common convention in ML engineering. Each folder has one responsibility. If there's a bug in the API, you look in app/. If a test is failing, you look in tests/. Simple. app/ tests/ model/ monitoring/ app/ tests/ Step by step — create the folders Creating the directory skeleton. So every file we create later has a logical home. mkdir credit-scoring-mlops && cd credit-scoring-mlops mkdir -p app model tests notebooks monitoring .github/workflows mkdir credit-scoring-mlops && cd credit-scoring-mlops mkdir -p app model tests notebooks monitoring .github/workflows The -p flag tells mkdir to create parent directories as needed. Without it, mkdir .github/workflows would fail because .github doesn't exist yet. -p mkdir mkdir .github/workflows .github Initialize Git Turning this folder into a Git repository. Git tracks every change you make, creating a history you can go back to if something breaks. It's like having infinite "undo" for your entire project. git init git init This creates a hidden .git/ folder that stores all the tracking information. You'll never need to touch it directly. .git/ Create the .gitignore Telling Git which files to never track. Some files should never be in a repository: never Secrets (API keys, passwords) — if they end up in Git, they're in the history forever, even if you delete them later Large data files (CSVs) — Git isn't designed for large binary files; it would make the repo slow to clone Generated files (__pycache__/, .pyc) — these are recreated automatically and just add noise Secrets (API keys, passwords) — if they end up in Git, they're in the history forever, even if you delete them later Secrets forever Large data files (CSVs) — Git isn't designed for large binary files; it would make the repo slow to clone Large data files Generated files (__pycache__/, .pyc) — these are recreated automatically and just add noise Generated files __pycache__/ .pyc cat > .gitignore << 'EOF' # Python bytecode — generated automatically, no need to track __pycache__/ *.py[cod] # Virtual environments — each developer creates their own .venv/ venv/ # Data files — too large for Git (use DVC or Git LFS for these) *.csv *.parquet data/ # Secrets — NEVER commit these. Use environment variables instead. .env *.secret # IDE configuration — specific to each developer's setup .vscode/ .idea/ # OS-specific files — useless noise .DS_Store Thumbs.db # Log files — generated at runtime *.log logs/ EOF cat > .gitignore << 'EOF' # Python bytecode — generated automatically, no need to track __pycache__/ *.py[cod] # Virtual environments — each developer creates their own .venv/ venv/ # Data files — too large for Git (use DVC or Git LFS for these) *.csv *.parquet data/ # Secrets — NEVER commit these. Use environment variables instead. .env *.secret # IDE configuration — specific to each developer's setup .vscode/ .idea/ # OS-specific files — useless noise .DS_Store Thumbs.db # Log files — generated at runtime *.log logs/ EOF First commit Creating our first "snapshot" of the project. Every commit is a checkpoint. If something goes wrong later, we can always come back to this clean state. The commit message should describe what this snapshot contains. what git add .gitignore git commit -m "Initial commit: project structure and .gitignore" git add .gitignore git commit -m "Initial commit: project structure and .gitignore" A note on commit messages: Write them as if someone else will read them (they will — future you). "fix stuff" is useless. "feat: add input validation for age field" tells you exactly what changed. A common convention is to prefix with feat: (new feature), fix: (bug fix), docs: (documentation), or refactor: (code cleanup). A note on commit messages: Write them as if someone else will read them (they will — future you). "fix stuff" is useless. "feat: add input validation for age field" tells you exactly what changed. A common convention is to prefix with feat: (new feature), fix: (bug fix), docs: (documentation), or refactor: (code cleanup). A note on commit messages: "fix stuff" "feat: add input validation for age field" feat: fix: docs: refactor: 3. Exploring and Preparing the Home Credit Dataset 3.1 — About the dataset Understanding what data we're working with before touching any code. You can't build a good model — or a good API — if you don't deeply understand your data. This step is often rushed but it's the most important. The Home Credit Default Risk dataset comes from a real Kaggle competition. Home Credit provides loans to people who have little or no traditional credit history — the "unbanked" population. These are people who might be rejected by traditional banks simply because they don't have enough credit history, not because they're actually risky. Home Credit Default Risk The main file, application_train.csv, contains 307,511 rows (one per loan application) and 122 columns (features about the applicant and the loan). The target column TARGET is binary: application_train.csv 307,511 rows 122 columns TARGET 0 = the applicant repaid the loan successfully 1 = the applicant had payment difficulties (default) 0 = the applicant repaid the loan successfully 0 1 = the applicant had payment difficulties (default) 1 Download it from Kaggle and place it in a data/ folder: data/ # Option 1: Use the Kaggle CLI (you need a Kaggle account + API token) # pip install kaggle # kaggle competitions download -c home-credit-default-risk # Option 2: Download manually from # https://www.kaggle.com/c/home-credit-default-risk/data # You only need application_train.csv for this tutorial # Option 1: Use the Kaggle CLI (you need a Kaggle account + API token) # pip install kaggle # kaggle competitions download -c home-credit-default-risk # Option 2: Download manually from # https://www.kaggle.com/c/home-credit-default-risk/data # You only need application_train.csv for this tutorial 3.2 — Load and inspect the data Loading the CSV file and getting a first overview. Before any analysis, you need to know the size, the types of columns, and the general "shape" of your data. This prevents surprises later. import pandas as pd import numpy as np df = pd.read_csv('data/application_train.csv') import pandas as pd import numpy as np df = pd.read_csv('data/application_train.csv') print(f"Shape: {df.shape}") print(f"Columns: {df.shape[1]}") print(f"Rows: {df.shape[0]:,}") print(f"Shape: {df.shape}") print(f"Columns: {df.shape[1]}") print(f"Rows: {df.shape[0]:,}") You should see: 307,511 rows, 122 columns. That's a substantial dataset — much bigger than the toy datasets you typically see in tutorials. 307,511 rows, 122 columns Checking the distribution of our target variable. This tells us how "balanced" the problem is. If 50% of applicants default, the model has an easy time distinguishing the two groups. If only 1% default, it's much harder — the model can just predict "no default" for everyone and be right 99% of the time while being useless. print(f"\nTarget distribution:") print(df['TARGET'].value_counts(normalize=True)) print(f"\nTarget distribution:") print(df['TARGET'].value_counts(normalize=True)) Important observation: The dataset is heavily imbalanced — about 92% repaid (0) and only 8% defaulted (1). This is completely typical in credit scoring: most people do repay their loans. Our model will need to be evaluated with metrics that account for this imbalance (like ROC AUC, not just accuracy). Important observation: imbalanced 3.3 — Understanding the key features Selecting which features to use from the 122 available. We can't (and shouldn't) use all 122 columns. Many are redundant, many have too many missing values, and our API needs to accept inputs — we don't want to ask users for 122 fields. We pick the most predictive and interpretable features based on domain knowledge and research. Research on this dataset (including the Kaggle competition results) consistently shows these features matter most: Feature What It Means Why It Predicts Default EXT_SOURCE_1/2/3 Normalized scores from external credit bureaus (0 to 1) These are the single strongest predictors. They summarize a person's entire credit reputation from other institutions. DAYS_BIRTH Client's age in days (stored as negative number) Older applicants have more financial stability and default less. DAYS_EMPLOYED How long they've been at their current job (negative = employed) Longer employment indicates stability — they're less likely to lose income suddenly. AMT_INCOME_TOTAL Total annual income More income means more capacity to make loan payments. AMT_CREDIT Credit amount of the loan Larger loans are riskier — more to repay. AMT_ANNUITY Monthly payment amount Higher monthly payments create more financial strain. AMT_GOODS_PRICE Price of the goods being financed Context for the loan — is the person buying something that costs $5K or $500K? DAYS_ID_PUBLISH Days since ID document was issued A proxy for personal stability. People who change ID documents frequently may be less settled. CODE_GENDER Gender (M/F) A demographic factor that has statistical correlations in this dataset. FLAG_OWN_CAR Owns a car? (Y/N) Asset ownership indicates financial stability. FLAG_OWN_REALTY Owns real estate? (Y/N) Same reasoning — owning a home suggests financial roots. CNT_CHILDREN Number of children Family context — more dependents can mean more financial strain. NAME_EDUCATION_TYPE Education level Higher education correlates with higher earning potential and financial literacy. Feature What It Means Why It Predicts Default EXT_SOURCE_1/2/3 Normalized scores from external credit bureaus (0 to 1) These are the single strongest predictors. They summarize a person's entire credit reputation from other institutions. DAYS_BIRTH Client's age in days (stored as negative number) Older applicants have more financial stability and default less. DAYS_EMPLOYED How long they've been at their current job (negative = employed) Longer employment indicates stability — they're less likely to lose income suddenly. AMT_INCOME_TOTAL Total annual income More income means more capacity to make loan payments. AMT_CREDIT Credit amount of the loan Larger loans are riskier — more to repay. AMT_ANNUITY Monthly payment amount Higher monthly payments create more financial strain. AMT_GOODS_PRICE Price of the goods being financed Context for the loan — is the person buying something that costs $5K or $500K? DAYS_ID_PUBLISH Days since ID document was issued A proxy for personal stability. People who change ID documents frequently may be less settled. CODE_GENDER Gender (M/F) A demographic factor that has statistical correlations in this dataset. FLAG_OWN_CAR Owns a car? (Y/N) Asset ownership indicates financial stability. FLAG_OWN_REALTY Owns real estate? (Y/N) Same reasoning — owning a home suggests financial roots. CNT_CHILDREN Number of children Family context — more dependents can mean more financial strain. NAME_EDUCATION_TYPE Education level Higher education correlates with higher earning potential and financial literacy. Feature What It Means Why It Predicts Default Feature Feature What It Means What It Means Why It Predicts Default Why It Predicts Default EXT_SOURCE_1/2/3 Normalized scores from external credit bureaus (0 to 1) These are the single strongest predictors. They summarize a person's entire credit reputation from other institutions. EXT_SOURCE_1/2/3 EXT_SOURCE_1/2/3 EXT_SOURCE_1/2/3 Normalized scores from external credit bureaus (0 to 1) Normalized scores from external credit bureaus (0 to 1) These are the single strongest predictors. They summarize a person's entire credit reputation from other institutions. These are the single strongest predictors. They summarize a person's entire credit reputation from other institutions. DAYS_BIRTH Client's age in days (stored as negative number) Older applicants have more financial stability and default less. DAYS_BIRTH DAYS_BIRTH DAYS_BIRTH Client's age in days (stored as negative number) Client's age in days (stored as negative number) Older applicants have more financial stability and default less. Older applicants have more financial stability and default less. DAYS_EMPLOYED How long they've been at their current job (negative = employed) Longer employment indicates stability — they're less likely to lose income suddenly. DAYS_EMPLOYED DAYS_EMPLOYED DAYS_EMPLOYED How long they've been at their current job (negative = employed) How long they've been at their current job (negative = employed) Longer employment indicates stability — they're less likely to lose income suddenly. Longer employment indicates stability — they're less likely to lose income suddenly. AMT_INCOME_TOTAL Total annual income More income means more capacity to make loan payments. AMT_INCOME_TOTAL AMT_INCOME_TOTAL AMT_INCOME_TOTAL Total annual income Total annual income More income means more capacity to make loan payments. More income means more capacity to make loan payments. AMT_CREDIT Credit amount of the loan Larger loans are riskier — more to repay. AMT_CREDIT AMT_CREDIT AMT_CREDIT Credit amount of the loan Credit amount of the loan Larger loans are riskier — more to repay. Larger loans are riskier — more to repay. AMT_ANNUITY Monthly payment amount Higher monthly payments create more financial strain. AMT_ANNUITY AMT_ANNUITY AMT_ANNUITY Monthly payment amount Monthly payment amount Higher monthly payments create more financial strain. Higher monthly payments create more financial strain. AMT_GOODS_PRICE Price of the goods being financed Context for the loan — is the person buying something that costs $5K or $500K? AMT_GOODS_PRICE AMT_GOODS_PRICE AMT_GOODS_PRICE Price of the goods being financed Price of the goods being financed Context for the loan — is the person buying something that costs $5K or $500K? Context for the loan — is the person buying something that costs $5K or $500K? DAYS_ID_PUBLISH Days since ID document was issued A proxy for personal stability. People who change ID documents frequently may be less settled. DAYS_ID_PUBLISH DAYS_ID_PUBLISH DAYS_ID_PUBLISH Days since ID document was issued Days since ID document was issued A proxy for personal stability. People who change ID documents frequently may be less settled. A proxy for personal stability. People who change ID documents frequently may be less settled. CODE_GENDER Gender (M/F) A demographic factor that has statistical correlations in this dataset. CODE_GENDER CODE_GENDER CODE_GENDER Gender (M/F) Gender (M/F) A demographic factor that has statistical correlations in this dataset. A demographic factor that has statistical correlations in this dataset. FLAG_OWN_CAR Owns a car? (Y/N) Asset ownership indicates financial stability. FLAG_OWN_CAR FLAG_OWN_CAR FLAG_OWN_CAR Owns a car? (Y/N) Owns a car? (Y/N) Asset ownership indicates financial stability. Asset ownership indicates financial stability. FLAG_OWN_REALTY Owns real estate? (Y/N) Same reasoning — owning a home suggests financial roots. FLAG_OWN_REALTY FLAG_OWN_REALTY FLAG_OWN_REALTY Owns real estate? (Y/N) Owns real estate? (Y/N) Same reasoning — owning a home suggests financial roots. Same reasoning — owning a home suggests financial roots. CNT_CHILDREN Number of children Family context — more dependents can mean more financial strain. CNT_CHILDREN CNT_CHILDREN CNT_CHILDREN Number of children Number of children Family context — more dependents can mean more financial strain. Family context — more dependents can mean more financial strain. NAME_EDUCATION_TYPE Education level Higher education correlates with higher earning potential and financial literacy. NAME_EDUCATION_TYPE NAME_EDUCATION_TYPE NAME_EDUCATION_TYPE Education level Education level Higher education correlates with higher earning potential and financial literacy. Higher education correlates with higher earning potential and financial literacy. 3.4 — Select features Extracting only the columns we need from the full 122-column dataframe. Working with 15 carefully chosen features is more manageable than 122, and for our API, each feature will become a field that users need to provide. 15 fields is reasonable; 122 is not. selected_features = [ 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'NAME_EDUCATION_TYPE', ] X = df[selected_features].copy() y = df['TARGET'].copy() selected_features = [ 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'NAME_EDUCATION_TYPE', ] X = df[selected_features].copy() y = df['TARGET'].copy() We use .copy() to create an independent copy of the data. Without it, modifications to X would also modify the original df, which can cause hard-to-debug issues. .copy() X df 3.5 — Fix the DAYS_EMPLOYED anomaly Handling a known data quality issue. The DAYS_EMPLOYED column contains the value 365243 for unemployed people. That's roughly 1,000 years of employment — obviously a placeholder, not a real value. If we leave it, it would massively skew our model because the scaler would treat it as a real data point, compressing all actual employment durations into a tiny range. DAYS_EMPLOYED 365243 1,000 years # How many rows have this anomalous value? print(f"Anomalous DAYS_EMPLOYED values: {(df['DAYS_EMPLOYED'] == 365243).sum():,}") # How many rows have this anomalous value? print(f"Anomalous DAYS_EMPLOYED values: {(df['DAYS_EMPLOYED'] == 365243).sum():,}") You'll see about 55,374 rows — about 18% of the data. That's not a trivial amount. We replace it with NaN (Not a Number), which means "missing value": NaN X['DAYS_EMPLOYED'] = X['DAYS_EMPLOYED'].replace(365243, np.nan) X['DAYS_EMPLOYED'] = X['DAYS_EMPLOYED'].replace(365243, np.nan) 3.6 — Convert days to years Transforming the DAYS columns from "negative days before application" into "positive years." Two reasons. First, readability: -14,585 days is hard for a human to interpret, but 39.9 years is immediately clear. Second, our API will accept age in years — it would be a terrible user experience to ask someone for their "age in negative days since application date." -14,585 days 39.9 years # DAYS_BIRTH is negative: -14585 means the person was born 14585 days before the application # Dividing by -365.25 converts to positive years (365.25 accounts for leap years) X['AGE_YEARS'] = (-X['DAYS_BIRTH'] / 365.25).round(1) # DAYS_EMPLOYED works the same way X['YEARS_EMPLOYED'] = (-X['DAYS_EMPLOYED'] / 365.25).round(1) # DAYS_ID_PUBLISH: years since the ID document was issued X['YEARS_ID_PUBLISH'] = (-X['DAYS_ID_PUBLISH'] / 365.25).round(1) # DAYS_BIRTH is negative: -14585 means the person was born 14585 days before the application # Dividing by -365.25 converts to positive years (365.25 accounts for leap years) X['AGE_YEARS'] = (-X['DAYS_BIRTH'] / 365.25).round(1) # DAYS_EMPLOYED works the same way X['YEARS_EMPLOYED'] = (-X['DAYS_EMPLOYED'] / 365.25).round(1) # DAYS_ID_PUBLISH: years since the ID document was issued X['YEARS_ID_PUBLISH'] = (-X['DAYS_ID_PUBLISH'] / 365.25).round(1) Now we can drop the raw DAYS columns since we have the cleaner YEARS versions: X = X.drop(columns=['DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH']) X = X.drop(columns=['DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH']) 3.7 — Encode categorical variables Converting text values into numbers. Machine learning models only work with numbers. They can't process the string "F" or "Higher education" directly. We need to convert these into a numeric representation. "F" "Higher education" For simple binary features (Yes/No, Male/Female), we use binary encoding — just 0 and 1: binary encoding X['CODE_GENDER'] = X['CODE_GENDER'].map({'M': 0, 'F': 1}).fillna(0).astype(int) X['FLAG_OWN_CAR'] = X['FLAG_OWN_CAR'].map({'N': 0, 'Y': 1}).astype(int) X['FLAG_OWN_REALTY'] = X['FLAG_OWN_REALTY'].map({'N': 0, 'Y': 1}).astype(int) X['CODE_GENDER'] = X['CODE_GENDER'].map({'M': 0, 'F': 1}).fillna(0).astype(int) X['FLAG_OWN_CAR'] = X['FLAG_OWN_CAR'].map({'N': 0, 'Y': 1}).astype(int) X['FLAG_OWN_REALTY'] = X['FLAG_OWN_REALTY'].map({'N': 0, 'Y': 1}).astype(int) What does .fillna(0) do? There are a small number of rows where CODE_GENDER is "XNA" (unknown). The .map() function turns these into NaN (because "XNA" isn't in our mapping dict), and .fillna(0) replaces them with 0. We have to handle every case — production data is messy. What does .fillna(0) CODE_GENDER "XNA" .map() NaN "XNA" .fillna(0) For education, we use ordinal encoding. Because there's a natural order from lower to higher education. Ordinal encoding preserves this order (0 < 1 < 2 < 3 < 4), which gives the model useful information. One-hot encoding would treat "Lower secondary" and "Academic degree" as equally different from "Higher education," losing the ordering signal. ordinal encoding education_map = { 'Lower secondary': 0, 'Secondary / secondary special': 1, 'Incomplete higher': 2, 'Higher education': 3, 'Academic degree': 4, } X['EDUCATION_LEVEL'] = X['NAME_EDUCATION_TYPE'].map(education_map).fillna(1).astype(int) X = X.drop(columns=['NAME_EDUCATION_TYPE']) education_map = { 'Lower secondary': 0, 'Secondary / secondary special': 1, 'Incomplete higher': 2, 'Higher education': 3, 'Academic degree': 4, } X['EDUCATION_LEVEL'] = X['NAME_EDUCATION_TYPE'].map(education_map).fillna(1).astype(int) X = X.drop(columns=['NAME_EDUCATION_TYPE']) 3.8 — Handle missing values Filling in gaps where data is missing. Most ML models can't handle missing values (NaN). If we feed them a row with a missing EXT_SOURCE_1, they'll either crash or produce garbage. We need to fill these gaps with reasonable substitute values. NaN EXT_SOURCE_1 Let's first see how much is missing: print("Missing values before filling:") missing = X.isnull().sum() print(missing[missing > 0]) print("Missing values before filling:") missing = X.isnull().sum() print(missing[missing > 0]) You'll see that EXT_SOURCE_1 has about 56% missing, EXT_SOURCE_3 about 20%, and YEARS_EMPLOYED about 18% (those are the unemployed people we set to NaN earlier). EXT_SOURCE_1 EXT_SOURCE_3 YEARS_EMPLOYED Filling with the median of each column. The median is robust to outliers. If most people earn $50K but one person earns $50M, the mean income would be pulled way up by that one outlier. The median stays at $50K, which is a much more representative "typical" value. For the same reason, median is the standard choice for imputation in financial data. X = X.fillna(X.median()) print("\nMissing values after filling:", X.isnull().sum().sum()) # Should be 0 X = X.fillna(X.median()) print("\nMissing values after filling:", X.isnull().sum().sum()) # Should be 0 3.9 — Feature engineering Creating new features by combining existing ones. Raw features tell part of the story, but ratios often tell a much richer story. A $400,000 loan means something very different to someone earning $40,000/year (that's 10 years of income!) versus someone earning $400,000/year (just 1 year). The absolute credit amount is the same, but the relative burden is completely different — and it's the relative burden that actually predicts default. ratios relative burden # Credit-to-income ratio: how many years of income does the loan represent? # A ratio of 5 means the loan is 5x the annual income — that's a heavy burden X['CREDIT_INCOME_RATIO'] = X['AMT_CREDIT'] / (X['AMT_INCOME_TOTAL'] + 1) # Credit-to-income ratio: how many years of income does the loan represent? # A ratio of 5 means the loan is 5x the annual income — that's a heavy burden X['CREDIT_INCOME_RATIO'] = X['AMT_CREDIT'] / (X['AMT_INCOME_TOTAL'] + 1) We add +1 to the denominator to avoid division by zero. It's a tiny number compared to actual incomes (tens of thousands), so it doesn't affect the result meaningfully. +1 # Annuity-to-income ratio: what fraction of income goes to monthly payments? # A ratio of 0.3 means 30% of income goes to loan payments — that's stressful X['ANNUITY_INCOME_RATIO'] = X['AMT_ANNUITY'] / (X['AMT_INCOME_TOTAL'] + 1) # Credit-to-goods ratio: how much of the goods price is financed? # A ratio close to 1 means the person is financing nearly 100% of the purchase # (no down payment), which is a risk signal X['CREDIT_GOODS_RATIO'] = X['AMT_CREDIT'] / (X['AMT_GOODS_PRICE'] + 1) # Annuity-to-income ratio: what fraction of income goes to monthly payments? # A ratio of 0.3 means 30% of income goes to loan payments — that's stressful X['ANNUITY_INCOME_RATIO'] = X['AMT_ANNUITY'] / (X['AMT_INCOME_TOTAL'] + 1) # Credit-to-goods ratio: how much of the goods price is financed? # A ratio close to 1 means the person is financing nearly 100% of the purchase # (no down payment), which is a risk signal X['CREDIT_GOODS_RATIO'] = X['AMT_CREDIT'] / (X['AMT_GOODS_PRICE'] + 1) 3.10 — Check the final feature set Verifying we have a clean, complete dataset. This is a sanity check before training. If something is wrong here (wrong column count, remaining NaNs, unexpected types), it's much easier to fix now than after the model is trained. print(f"Final feature set: {X.shape[1]} features, {X.shape[0]:,} rows") print(f"Features: {list(X.columns)}") print(f"Any remaining NaN: {X.isnull().any().any()}") print(f"Final feature set: {X.shape[1]} features, {X.shape[0]:,} rows") print(f"Features: {list(X.columns)}") print(f"Any remaining NaN: {X.isnull().any().any()}") Our final 18 features: EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3, AMT_INCOME_TOTAL, AMT_CREDIT, AMT_ANNUITY, AMT_GOODS_PRICE, CODE_GENDER, FLAG_OWN_CAR, FLAG_OWN_REALTY, CNT_CHILDREN, AGE_YEARS, YEARS_EMPLOYED, YEARS_ID_PUBLISH, EDUCATION_LEVEL, CREDIT_INCOME_RATIO, ANNUITY_INCOME_RATIO, CREDIT_GOODS_RATIO EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3, AMT_INCOME_TOTAL, AMT_CREDIT, AMT_ANNUITY, AMT_GOODS_PRICE, CODE_GENDER, FLAG_OWN_CAR, FLAG_OWN_REALTY, CNT_CHILDREN, AGE_YEARS, YEARS_EMPLOYED, YEARS_ID_PUBLISH, EDUCATION_LEVEL, CREDIT_INCOME_RATIO, ANNUITY_INCOME_RATIO, CREDIT_GOODS_RATIO 18 features: manageable, interpretable, and each one has a clear business meaning. That's important — when a loan officer asks "why did the model reject this person?", you can point to specific features and explain. <details> <summary><strong>Click to expand: prepare_data.py (complete copy-paste ready code)</strong></summary> # prepare_data.py import pandas as pd import numpy as np def load_and_prepare_data(filepath='data/application_train.csv'): df = pd.read_csv(filepath) selected_features = [ 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'NAME_EDUCATION_TYPE', ] X = df[selected_features].copy() y = df['TARGET'].copy() X['DAYS_EMPLOYED'] = X['DAYS_EMPLOYED'].replace(365243, np.nan) X['AGE_YEARS'] = (-X['DAYS_BIRTH'] / 365.25).round(1) X['YEARS_EMPLOYED'] = (-X['DAYS_EMPLOYED'] / 365.25).round(1) X['YEARS_ID_PUBLISH'] = (-X['DAYS_ID_PUBLISH'] / 365.25).round(1) X = X.drop(columns=['DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH']) X['CODE_GENDER'] = X['CODE_GENDER'].map({'M': 0, 'F': 1}).fillna(0).astype(int) X['FLAG_OWN_CAR'] = X['FLAG_OWN_CAR'].map({'N': 0, 'Y': 1}).astype(int) X['FLAG_OWN_REALTY'] = X['FLAG_OWN_REALTY'].map({'N': 0, 'Y': 1}).astype(int) education_map = { 'Lower secondary': 0, 'Secondary / secondary special': 1, 'Incomplete higher': 2, 'Higher education': 3, 'Academic degree': 4, } X['EDUCATION_LEVEL'] = X['NAME_EDUCATION_TYPE'].map(education_map).fillna(1).astype(int) X = X.drop(columns=['NAME_EDUCATION_TYPE']) X = X.fillna(X.median()) X['CREDIT_INCOME_RATIO'] = X['AMT_CREDIT'] / (X['AMT_INCOME_TOTAL'] + 1) X['ANNUITY_INCOME_RATIO'] = X['AMT_ANNUITY'] / (X['AMT_INCOME_TOTAL'] + 1) X['CREDIT_GOODS_RATIO'] = X['AMT_CREDIT'] / (X['AMT_GOODS_PRICE'] + 1) return X, y if __name__ == '__main__': X, y = load_and_prepare_data() print(f"Features: {X.shape}, Target: {y.shape}") print(f"Default rate: {y.mean():.2%}") # prepare_data.py import pandas as pd import numpy as np def load_and_prepare_data(filepath='data/application_train.csv'): df = pd.read_csv(filepath) selected_features = [ 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'NAME_EDUCATION_TYPE', ] X = df[selected_features].copy() y = df['TARGET'].copy() X['DAYS_EMPLOYED'] = X['DAYS_EMPLOYED'].replace(365243, np.nan) X['AGE_YEARS'] = (-X['DAYS_BIRTH'] / 365.25).round(1) X['YEARS_EMPLOYED'] = (-X['DAYS_EMPLOYED'] / 365.25).round(1) X['YEARS_ID_PUBLISH'] = (-X['DAYS_ID_PUBLISH'] / 365.25).round(1) X = X.drop(columns=['DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH']) X['CODE_GENDER'] = X['CODE_GENDER'].map({'M': 0, 'F': 1}).fillna(0).astype(int) X['FLAG_OWN_CAR'] = X['FLAG_OWN_CAR'].map({'N': 0, 'Y': 1}).astype(int) X['FLAG_OWN_REALTY'] = X['FLAG_OWN_REALTY'].map({'N': 0, 'Y': 1}).astype(int) education_map = { 'Lower secondary': 0, 'Secondary / secondary special': 1, 'Incomplete higher': 2, 'Higher education': 3, 'Academic degree': 4, } X['EDUCATION_LEVEL'] = X['NAME_EDUCATION_TYPE'].map(education_map).fillna(1).astype(int) X = X.drop(columns=['NAME_EDUCATION_TYPE']) X = X.fillna(X.median()) X['CREDIT_INCOME_RATIO'] = X['AMT_CREDIT'] / (X['AMT_INCOME_TOTAL'] + 1) X['ANNUITY_INCOME_RATIO'] = X['AMT_ANNUITY'] / (X['AMT_INCOME_TOTAL'] + 1) X['CREDIT_GOODS_RATIO'] = X['AMT_CREDIT'] / (X['AMT_GOODS_PRICE'] + 1) return X, y if __name__ == '__main__': X, y = load_and_prepare_data() print(f"Features: {X.shape}, Target: {y.shape}") print(f"Default rate: {y.mean():.2%}") </details> 4. Training the Credit Scoring Model 4.1 — Train/test split Splitting our data into two parts — one for training, one for testing. We need to evaluate our model on data it has never seen during training. If we test on the same data we trained on, the model could just memorize the answers (this is called overfitting), and we'd have no idea how well it actually generalizes to new loan applications. never seen during training from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) Let's unpack each argument: test_size=0.2 — 20% for testing, 80% for training. This is a standard split. More training data = better model, but we need enough test data for reliable evaluation. random_state=42 — This fixes the random seed so the split is the same every time you run the code. Reproducibility matters in ML — you need to be able to get the same results. stratify=y — This is crucial for imbalanced datasets. Without it, you might end up with 10% defaults in training but only 5% in testing (random variation). stratify=y ensures both sets have the exact same proportion of defaults (~8%). test_size=0.2 — 20% for testing, 80% for training. This is a standard split. More training data = better model, but we need enough test data for reliable evaluation. test_size=0.2 random_state=42 — This fixes the random seed so the split is the same every time you run the code. Reproducibility matters in ML — you need to be able to get the same results. random_state=42 stratify=y — This is crucial for imbalanced datasets. Without it, you might end up with 10% defaults in training but only 5% in testing (random variation). stratify=y ensures both sets have the exact same proportion of defaults (~8%). stratify=y This is crucial for imbalanced datasets. stratify=y print(f"Training set: {X_train.shape[0]:,} rows") print(f"Test set: {X_test.shape[0]:,} rows") print(f"Train default rate: {y_train.mean():.2%}") print(f"Test default rate: {y_test.mean():.2%}") print(f"Training set: {X_train.shape[0]:,} rows") print(f"Test set: {X_test.shape[0]:,} rows") print(f"Train default rate: {y_train.mean():.2%}") print(f"Test default rate: {y_test.mean():.2%}") Both rates should be approximately 8.07% — confirming that stratification worked. 4.2 — Build a scikit-learn Pipeline Creating a Pipeline that bundles preprocessing and the model into a single object. This is one of the most important design decisions we'll make, and it has huge implications for deployment. Without a Pipeline, deployment looks like this: Load the scaler Transform the data with the scaler Load the model Feed the transformed data to the model Load the scaler Transform the data with the scaler Load the model Feed the transformed data to the model With a Pipeline, deployment looks like this: Load the pipeline Feed raw data to it Load the pipeline Feed raw data to it The Pipeline handles everything internally. This means fewer files to manage, fewer things that can go wrong, and — critically — the scaler and model are guaranteed to be in sync. If you accidentally use a scaler from experiment #3 with a model from experiment #7, you'll get garbage predictions but no error message. A Pipeline prevents this. from sklearn.preprocessing import StandardScaler from sklearn.ensemble import GradientBoostingClassifier from sklearn.pipeline import Pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('classifier', GradientBoostingClassifier( n_estimators=200, max_depth=4, learning_rate=0.1, subsample=0.8, random_state=42, )) ]) from sklearn.preprocessing import StandardScaler from sklearn.ensemble import GradientBoostingClassifier from sklearn.pipeline import Pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('classifier', GradientBoostingClassifier( n_estimators=200, max_depth=4, learning_rate=0.1, subsample=0.8, random_state=42, )) ]) What does StandardScaler do? It transforms each feature to have a mean of 0 and a standard deviation of 1. Without scaling, a feature like AMT_CREDIT (values in the hundreds of thousands) would dominate a feature like EXT_SOURCE_1 (values between 0 and 1) simply because its numbers are bigger. Scaling puts all features on equal footing. What does StandardScaler AMT_CREDIT EXT_SOURCE_1 Gradient boosting builds many small decision trees sequentially, where each tree tries to correct the mistakes of the previous ones. It's consistently one of the best algorithms for structured/tabular data (like our spreadsheet-style credit data). In the Kaggle competition for this dataset, the top solutions all used gradient boosting variants (LightGBM, XGBoost). What do the hyperparameters mean? What do the hyperparameters mean? n_estimators=200 — Build 200 trees. More trees generally means better performance, but with diminishing returns and slower training. max_depth=4 — Each tree is at most 4 levels deep. Deeper trees can capture more complex patterns but are more likely to overfit. learning_rate=0.1 — How much each tree contributes. Lower values need more trees but are more robust. subsample=0.8 — Each tree is trained on a random 80% of the data. This randomness reduces overfitting. n_estimators=200 — Build 200 trees. More trees generally means better performance, but with diminishing returns and slower training. n_estimators=200 max_depth=4 — Each tree is at most 4 levels deep. Deeper trees can capture more complex patterns but are more likely to overfit. max_depth=4 learning_rate=0.1 — How much each tree contributes. Lower values need more trees but are more robust. learning_rate=0.1 subsample=0.8 — Each tree is trained on a random 80% of the data. This randomness reduces overfitting. subsample=0.8 4.3 — Train the model Fitting the pipeline to our training data. This is where the actual learning happens. The scaler calculates the mean and standard deviation of each feature. The classifier builds 200 decision trees, each trying to better predict defaults. pipeline.fit(X_train, y_train) pipeline.fit(X_train, y_train) That single line does all the work. Depending on your machine, this takes 1-5 minutes on 246,000 training rows with 18 features and 200 trees. 4.4 — Evaluate Measuring how well the model performs on data it has never seen. The training score tells you how well the model memorized the training data — it's always optimistic. The test score tells you how well it will actually perform in production. from sklearn.metrics import classification_report, roc_auc_score y_pred = pipeline.predict(X_test) y_proba = pipeline.predict_proba(X_test)[:, 1] print("Classification Report:") print(classification_report(y_test, y_pred)) print(f"ROC AUC Score: {roc_auc_score(y_test, y_proba):.4f}") from sklearn.metrics import classification_report, roc_auc_score y_pred = pipeline.predict(X_test) y_proba = pipeline.predict_proba(X_test)[:, 1] print("Classification Report:") print(classification_report(y_test, y_pred)) print(f"ROC AUC Score: {roc_auc_score(y_test, y_proba):.4f}") Because of the imbalance. A model that always predicts "no default" would have 92% accuracy — sounds great, right? But it would be completely useless because it never identifies actual defaulters. ROC AUC measures how well the model ranks defaulters above non-defaulters, regardless of the threshold. An AUC of 0.5 means random guessing; 1.0 means perfect separation. With our features, you should get approximately 0.74-0.76, which is in line with Kaggle competition results using only application_train.csv. ranks 0.74-0.76 application_train.csv 4.5 — Save the model and reference data Writing the trained pipeline and the training data to disk. Two reasons. (1) The model needs to be loaded by the API to serve predictions. (2) We save the training data as "reference data" — we'll compare future production data against it to detect drift. import joblib import os os.makedirs('model', exist_ok=True) os.makedirs('data', exist_ok=True) # Save the trained pipeline (scaler + model together) joblib.dump(pipeline, 'model/credit_model.pkl') # Save reference data for drift detection X_train.to_csv('data/reference_data.csv', index=False) X_test.to_csv('data/test_data.csv', index=False) # Save the feature column names — the API needs to know the exact order joblib.dump(list(X_train.columns), 'model/feature_columns.pkl') print("Model saved to model/credit_model.pkl") print(f"Reference data: {X_train.shape[0]:,} rows, {X_train.shape[1]} features") import joblib import os os.makedirs('model', exist_ok=True) os.makedirs('data', exist_ok=True) # Save the trained pipeline (scaler + model together) joblib.dump(pipeline, 'model/credit_model.pkl') # Save reference data for drift detection X_train.to_csv('data/reference_data.csv', index=False) X_test.to_csv('data/test_data.csv', index=False) # Save the feature column names — the API needs to know the exact order joblib.dump(list(X_train.columns), 'model/feature_columns.pkl') print("Model saved to model/credit_model.pkl") print(f"Reference data: {X_train.shape[0]:,} rows, {X_train.shape[1]} features") The model expects features in a specific order (the same order it was trained on). If the API passes features in a different order, predictions would be wrong without any error message. By saving the column list, we guarantee the API always sends features in the right order. git add train_model.py prepare_data.py git commit -m "feat: train credit scoring model on Home Credit dataset" git add train_model.py prepare_data.py git commit -m "feat: train credit scoring model on Home Credit dataset" 5. Creating a Prediction API with FastAPI Now we get to the core of the MLOps work. We have a trained model in a .pkl file. The goal: make it usable by anyone, anywhere, via a simple HTTP request. .pkl 5.1 — What is an API and why do we need one? What an API is: An API (Application Programming Interface) is an intermediary that sits between clients (users, other applications, mobile apps) and your model. The client sends a request ("here's a loan applicant's data"), the API passes it to the model, and returns the prediction ("12% probability of default, Low risk"). What an API is: Without an API, every person who wants a prediction would need to: Install Python Install the exact same versions of scikit-learn, pandas, numpy Download the model file Write Python code to load it and pass data in the correct format Install Python Install the exact same versions of scikit-learn, pandas, numpy Download the model file Write Python code to load it and pass data in the correct format That's clearly not scalable. With an API, they just send an HTTP request (which any programming language can do) and get back a JSON response. A web developer, a mobile app, or even a spreadsheet macro can call your API. A REST API acts as an intermediary between clients and your model. (source: SmartBear) A REST API acts as an intermediary between clients and your model. (source: SmartBear) source: SmartBear There are many Python web frameworks (Flask, Django, Tornado). We choose FastAPI because: It auto-generates interactive documentation (Swagger UI) — great for testing It uses Python type hints for automatic input validation (no manual if/else chains) It's one of the fastest Python frameworks It's become the standard choice in ML engineering It auto-generates interactive documentation (Swagger UI) — great for testing It uses Python type hints for automatic input validation (no manual if/else chains) It's one of the fastest Python frameworks It's become the standard choice in ML engineering (FastAPI documentation) FastAPI documentation 5.2 — The critical rule: load your model ONCE Loading the model into memory once when the server starts, then reusing it for every request. This is the single most important performance decision in an ML API. Here's what happens if you load the model on every request: # BAD — loads from disk on EVERY request @app.post("/predict") async def predict(data): model = joblib.load("model/credit_model.pkl") # Disk I/O every time! return model.predict(data) # BAD — loads from disk on EVERY request @app.post("/predict") async def predict(data): model = joblib.load("model/credit_model.pkl") # Disk I/O every time! return model.predict(data) If your model file is 50MB and you get 100 requests/second: You're reading 50MB from disk 100 times per second (5 GB/s of I/O!) Each request takes extra milliseconds or seconds just for loading Memory usage spikes because 100 copies of the model exist simultaneously The server eventually crashes under load You're reading 50MB from disk 100 times per second (5 GB/s of I/O!) Each request takes extra milliseconds or seconds just for loading Memory usage spikes because 100 copies of the model exist simultaneously The server eventually crashes under load The fix is simple — load once, reuse forever: # GOOD — load once at startup, reuse for all requests model = None def load_model(): global model model = joblib.load("model/credit_model.pkl") # Once, at startup @app.post("/predict") async def predict(data): return model.predict(data) # Uses the already-loaded model in memory # GOOD — load once at startup, reuse for all requests model = None def load_model(): global model model = joblib.load("model/credit_model.pkl") # Once, at startup @app.post("/predict") async def predict(data): return model.predict(data) # Uses the already-loaded model in memory Let's build the proper version. Create app/model_loader.py: app/model_loader.py import joblib import os import joblib import os We store the model and feature columns as module-level globals. They start as None and get populated at startup: None _model = None _feature_columns = None _model = None _feature_columns = None The loading function: def load_model(): """Load model and feature list ONCE at startup.""" global _model, _feature_columns model_path = os.environ.get("MODEL_PATH", "model/credit_model.pkl") features_path = os.environ.get("FEATURES_PATH", "model/feature_columns.pkl") if not os.path.exists(model_path): raise FileNotFoundError(f"Model not found at {model_path}") _model = joblib.load(model_path) _feature_columns = joblib.load(features_path) print(f"Model loaded from {model_path} ({len(_feature_columns)} features)") return _model def load_model(): """Load model and feature list ONCE at startup.""" global _model, _feature_columns model_path = os.environ.get("MODEL_PATH", "model/credit_model.pkl") features_path = os.environ.get("FEATURES_PATH", "model/feature_columns.pkl") if not os.path.exists(model_path): raise FileNotFoundError(f"Model not found at {model_path}") _model = joblib.load(model_path) _feature_columns = joblib.load(features_path) print(f"Model loaded from {model_path} ({len(_feature_columns)} features)") return _model This checks for an environment variable first, and falls back to a default. On your laptop, the default works fine. Inside a Docker container, the model might be at a different path — you can configure this without changing code by setting the environment variable. Retrieval functions: def get_model(): if _model is None: raise RuntimeError("Model not loaded! Call load_model() first.") return _model def get_feature_columns(): if _feature_columns is None: raise RuntimeError("Features not loaded!") return _feature_columns def get_model(): if _model is None: raise RuntimeError("Model not loaded! Call load_model() first.") return _model def get_feature_columns(): if _feature_columns is None: raise RuntimeError("Features not loaded!") return _feature_columns 5.3 — Input validation with Pydantic Defining strict rules for what the API accepts as input. In a controlled Jupyter notebook, you know exactly what the data looks like. In production, you have zero control. Someone might send: zero Text where a number is expected ("age": "forty") Negative values where only positive makes sense ("income": -5000) Missing required fields Values that are technically valid but make no business sense ("age": 200) Text where a number is expected ("age": "forty") "age": "forty" Negative values where only positive makes sense ("income": -5000) "income": -5000 Missing required fields Values that are technically valid but make no business sense ("age": 200) "age": 200 Pydantic lets us define a schema — a set of rules — for our input data. FastAPI uses this schema to automatically validate every incoming request. If the data doesn't match, the API returns a detailed error message explaining exactly what's wrong. No manual if/else chains needed. Pydantic Create app/schemas.py: app/schemas.py from pydantic import BaseModel, Field, field_validator from pydantic import BaseModel, Field, field_validator Now define each field with its constraints. The Field(...) function is where the magic happens: Field(...) class LoanApplication(BaseModel): # External credit bureau scores # These are normalized between 0 and 1 by the credit bureaus ext_source_1: float = Field( ..., # ... means "this field is required" ge=0.0, # ge = "greater than or equal to" le=1.0, # le = "less than or equal to" description="Normalized score from external data source 1" ) ext_source_2: float = Field( ..., ge=0.0, le=1.0, description="Normalized score from external data source 2" ) ext_source_3: float = Field( ..., ge=0.0, le=1.0, description="Normalized score from external data source 3" ) class LoanApplication(BaseModel): # External credit bureau scores # These are normalized between 0 and 1 by the credit bureaus ext_source_1: float = Field( ..., # ... means "this field is required" ge=0.0, # ge = "greater than or equal to" le=1.0, # le = "less than or equal to" description="Normalized score from external data source 1" ) ext_source_2: float = Field( ..., ge=0.0, le=1.0, description="Normalized score from external data source 2" ) ext_source_3: float = Field( ..., ge=0.0, le=1.0, description="Normalized score from external data source 3" ) What happens if someone sends ext_source_1: 5.0? FastAPI catches it and returns: What happens if someone sends ext_source_1: 5.0 {"detail": [{"msg": "Input should be less than or equal to 1"}]} {"detail": [{"msg": "Input should be less than or equal to 1"}]} with HTTP status 422 (Unprocessable Entity). No code needed on your part. Financial fields: amt_income_total: float = Field( ..., gt=0, # gt = "greater than" (strictly positive — zero income is not valid) description="Total annual income" ) amt_credit: float = Field(..., gt=0, description="Credit amount of the loan") amt_annuity: float = Field(..., gt=0, description="Loan annuity (monthly payment)") amt_goods_price: float = Field(..., gt=0, description="Price of the goods being financed") amt_income_total: float = Field( ..., gt=0, # gt = "greater than" (strictly positive — zero income is not valid) description="Total annual income" ) amt_credit: float = Field(..., gt=0, description="Credit amount of the loan") amt_annuity: float = Field(..., gt=0, description="Loan annuity (monthly payment)") amt_goods_price: float = Field(..., gt=0, description="Price of the goods being financed") Personal information: code_gender: int = Field(..., ge=0, le=1, description="Gender (0=Male, 1=Female)") flag_own_car: int = Field(..., ge=0, le=1, description="Owns a car? (0=No, 1=Yes)") flag_own_realty: int = Field(..., ge=0, le=1, description="Owns real estate? (0=No, 1=Yes)") cnt_children: int = Field(..., ge=0, le=20, description="Number of children") code_gender: int = Field(..., ge=0, le=1, description="Gender (0=Male, 1=Female)") flag_own_car: int = Field(..., ge=0, le=1, description="Owns a car? (0=No, 1=Yes)") flag_own_realty: int = Field(..., ge=0, le=1, description="Owns real estate? (0=No, 1=Yes)") cnt_children: int = Field(..., ge=0, le=20, description="Number of children") Derived features: age_years: float = Field( ..., ge=18, le=80, description="Applicant's age in years" ) years_employed: float = Field(..., ge=0, le=50, description="Years of employment") years_id_publish: float = Field(..., ge=0, le=60, description="Years since ID was published") education_level: int = Field( ..., ge=0, le=4, description="0=Lower secondary, 1=Secondary, 2=Incomplete higher, 3=Higher, 4=Academic" ) age_years: float = Field( ..., ge=18, le=80, description="Applicant's age in years" ) years_employed: float = Field(..., ge=0, le=50, description="Years of employment") years_id_publish: float = Field(..., ge=0, le=60, description="Years since ID was published") education_level: int = Field( ..., ge=0, le=4, description="0=Lower secondary, 1=Secondary, 2=Incomplete higher, 3=Higher, 4=Academic" ) We can also add custom business logic validation. This goes beyond simple range checks: custom business logic @field_validator('amt_credit') @classmethod def credit_must_be_reasonable(cls, v, info): """A credit-to-income ratio above 100 is unrealistic.""" if 'amt_income_total' in info.data: ratio = v / (info.data['amt_income_total'] + 1) if ratio > 100: raise ValueError( f"Credit-to-income ratio ({ratio:.0f}x) seems unrealistic" ) return v @field_validator('amt_credit') @classmethod def credit_must_be_reasonable(cls, v, info): """A credit-to-income ratio above 100 is unrealistic.""" if 'amt_income_total' in info.data: ratio = v / (info.data['amt_income_total'] + 1) if ratio > 100: raise ValueError( f"Credit-to-income ratio ({ratio:.0f}x) seems unrealistic" ) return v Defining the response schema. By defining what the API returns, FastAPI auto-generates documentation and ensures our responses are always consistent. Consumers of our API know exactly what to expect. returns class PredictionResponse(BaseModel): prediction: int = Field(description="0=No Default, 1=Default") probability_of_default: float = Field(description="Probability from 0.0 to 1.0") risk_category: str = Field(description="Low, Medium, or High") class HealthResponse(BaseModel): status: str model_loaded: bool class PredictionResponse(BaseModel): prediction: int = Field(description="0=No Default, 1=Default") probability_of_default: float = Field(description="Probability from 0.0 to 1.0") risk_category: str = Field(description="Low, Medium, or High") class HealthResponse(BaseModel): status: str model_loaded: bool 5.4 — Building the FastAPI application Wiring together the model loader, the validation schemas, and the HTTP endpoints. This is the actual application that will run in production, receiving requests and returning predictions. Create app/main.py: app/main.py from fastapi import FastAPI, HTTPException from contextlib import asynccontextmanager import pandas as pd import numpy as np import time import logging import json from datetime import datetime, timezone from app.schemas import LoanApplication, PredictionResponse, HealthResponse from app.model_loader import load_model, get_model, get_feature_columns from fastapi import FastAPI, HTTPException from contextlib import asynccontextmanager import pandas as pd import numpy as np import time import logging import json from datetime import datetime, timezone from app.schemas import LoanApplication, PredictionResponse, HealthResponse from app.model_loader import load_model, get_model, get_feature_columns Setting up structured logging. Every prediction the API makes should be recorded. These logs are the foundation of monitoring — they let us detect drift, track performance, and debug issues. We use Python's built-in logging module. logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger("credit_scoring_api") logging.basicConfig(level=logging.INFO) logger = logging.getLogger("credit_scoring_api") Defining the application lifespan (startup and shutdown). We need the model to be loaded before the first request arrives. The lifespan function runs code at startup (before yield) and shutdown (after yield). lifespan yield yield @asynccontextmanager async def lifespan(app: FastAPI): """Load model at startup, clean up at shutdown.""" logger.info("Starting up — loading model...") load_model() logger.info("Ready to serve predictions.") yield # The app runs while we're "inside" the yield logger.info("Shutting down.") @asynccontextmanager async def lifespan(app: FastAPI): """Load model at startup, clean up at shutdown.""" logger.info("Starting up — loading model...") load_model() logger.info("Ready to serve predictions.") yield # The app runs while we're "inside" the yield logger.info("Shutting down.") Creating the FastAPI app instance. This is the central object that routes incoming HTTP requests to the right handler functions. app = FastAPI( title="Home Credit Scoring API", description="Predict loan default probability using the Home Credit dataset", version="1.0.0", lifespan=lifespan, ) app = FastAPI( title="Home Credit Scoring API", description="Predict loan default probability using the Home Credit dataset", version="1.0.0", lifespan=lifespan, ) Adding a health check endpoint. Every production API needs one. Load balancers use it to know if the service is alive. Monitoring tools use it to track uptime. The CI/CD pipeline uses it to verify that a deployment succeeded. It's a simple "are you there?" ping. @app.get("/health", response_model=HealthResponse) async def health_check(): try: model = get_model() return HealthResponse(status="healthy", model_loaded=True) except RuntimeError: return HealthResponse(status="unhealthy", model_loaded=False) @app.get("/health", response_model=HealthResponse) async def health_check(): try: model = get_model() return HealthResponse(status="healthy", model_loaded=True) except RuntimeError: return HealthResponse(status="unhealthy", model_loaded=False) Building the prediction endpoint — the heart of the API. This is what clients will actually call to get predictions. It receives a loan application, validates it (Pydantic does this automatically), computes the feature ratios, runs the model, logs everything, and returns the result. @app.post("/predict", response_model=PredictionResponse) async def predict(application: LoanApplication): start_time = time.time() try: model = get_model() feature_columns = get_feature_columns() @app.post("/predict", response_model=PredictionResponse) async def predict(application: LoanApplication): start_time = time.time() try: model = get_model() feature_columns = get_feature_columns() Converting the validated input into the exact format the model expects. The model was trained on a DataFrame with specific column names and a specific column order. We need to recreate that exactly. Note that we compute the engineered features here — the API receives raw values and derives the ratios, so users don't have to compute them. features = { 'EXT_SOURCE_1': application.ext_source_1, 'EXT_SOURCE_2': application.ext_source_2, 'EXT_SOURCE_3': application.ext_source_3, 'AMT_INCOME_TOTAL': application.amt_income_total, 'AMT_CREDIT': application.amt_credit, 'AMT_ANNUITY': application.amt_annuity, 'AMT_GOODS_PRICE': application.amt_goods_price, 'CODE_GENDER': application.code_gender, 'FLAG_OWN_CAR': application.flag_own_car, 'FLAG_OWN_REALTY': application.flag_own_realty, 'CNT_CHILDREN': application.cnt_children, 'AGE_YEARS': application.age_years, 'YEARS_EMPLOYED': application.years_employed, 'YEARS_ID_PUBLISH': application.years_id_publish, 'EDUCATION_LEVEL': application.education_level, # Engineered features — computed server-side 'CREDIT_INCOME_RATIO': application.amt_credit / (application.amt_income_total + 1), 'ANNUITY_INCOME_RATIO': application.amt_annuity / (application.amt_income_total + 1), 'CREDIT_GOODS_RATIO': application.amt_credit / (application.amt_goods_price + 1), } # Create DataFrame with columns in the exact training order input_data = pd.DataFrame([features])[feature_columns] features = { 'EXT_SOURCE_1': application.ext_source_1, 'EXT_SOURCE_2': application.ext_source_2, 'EXT_SOURCE_3': application.ext_source_3, 'AMT_INCOME_TOTAL': application.amt_income_total, 'AMT_CREDIT': application.amt_credit, 'AMT_ANNUITY': application.amt_annuity, 'AMT_GOODS_PRICE': application.amt_goods_price, 'CODE_GENDER': application.code_gender, 'FLAG_OWN_CAR': application.flag_own_car, 'FLAG_OWN_REALTY': application.flag_own_realty, 'CNT_CHILDREN': application.cnt_children, 'AGE_YEARS': application.age_years, 'YEARS_EMPLOYED': application.years_employed, 'YEARS_ID_PUBLISH': application.years_id_publish, 'EDUCATION_LEVEL': application.education_level, # Engineered features — computed server-side 'CREDIT_INCOME_RATIO': application.amt_credit / (application.amt_income_total + 1), 'ANNUITY_INCOME_RATIO': application.amt_annuity / (application.amt_income_total + 1), 'CREDIT_GOODS_RATIO': application.amt_credit / (application.amt_goods_price + 1), } # Create DataFrame with columns in the exact training order input_data = pd.DataFrame([features])[feature_columns] Getting the prediction and probability from the model. We return both because they serve different purposes. The binary prediction (0/1) is a simple decision. The probability (0.0-1.0) is more useful in practice — it lets the business set their own risk threshold. A conservative bank might reject anyone above 10% probability; an aggressive lender might accept up to 30%. prediction = int(model.predict(input_data)[0]) probability = float(model.predict_proba(input_data)[0][1]) prediction = int(model.predict(input_data)[0]) probability = float(model.predict_proba(input_data)[0][1]) Mapping the probability to a human-readable risk category. Non-technical stakeholders don't want to interpret "probability 0.18." They want to see "Low risk" or "High risk." This translation makes the API output usable by business people, not just data scientists. if probability < 0.3: risk_category = "Low" elif probability < 0.6: risk_category = "Medium" else: risk_category = "High" if probability < 0.3: risk_category = "Low" elif probability < 0.6: risk_category = "Medium" else: risk_category = "High" Logging every prediction with full context. These logs are the lifeblood of production monitoring. They will be used for: Data drift detection: comparing production inputs over time against training data Performance monitoring: tracking how inference time evolves Debugging: when a prediction seems wrong, logs let us reproduce exactly what happened Auditing: in regulated industries like finance, you need a record of every automated decision Data drift detection: comparing production inputs over time against training data Data drift detection: Performance monitoring: tracking how inference time evolves Performance monitoring: Debugging: when a prediction seems wrong, logs let us reproduce exactly what happened Debugging: Auditing: in regulated industries like finance, you need a record of every automated decision Auditing: inference_time_ms = (time.time() - start_time) * 1000 log_entry = { "timestamp": datetime.now(timezone.utc).isoformat(), "event": "prediction", "inputs": application.model_dump(), "outputs": { "prediction": prediction, "probability_of_default": round(probability, 4), "risk_category": risk_category, }, "inference_time_ms": round(inference_time_ms, 2), } logger.info(json.dumps(log_entry)) inference_time_ms = (time.time() - start_time) * 1000 log_entry = { "timestamp": datetime.now(timezone.utc).isoformat(), "event": "prediction", "inputs": application.model_dump(), "outputs": { "prediction": prediction, "probability_of_default": round(probability, 4), "risk_category": risk_category, }, "inference_time_ms": round(inference_time_ms, 2), } logger.info(json.dumps(log_entry)) Because it's machine-parseable. Monitoring tools (ELK Stack, Datadog, CloudWatch) can automatically index every field in the JSON, making it searchable and aggregatable. A plain text log like "Prediction: 0 for client age 40" can't be automatically parsed. "Prediction: 0 for client age 40" Handling errors gracefully. In production, unexpected things happen — corrupted input data, a model that fails on certain edge cases, memory issues. Without error handling, the API would crash with an ugly Python traceback. With it, the client gets a clear error message and the error is logged for debugging. return PredictionResponse( prediction=prediction, probability_of_default=round(probability, 4), risk_category=risk_category, ) except Exception as e: logger.error(json.dumps({ "timestamp": datetime.now(timezone.utc).isoformat(), "event": "prediction_error", "error": str(e), "inputs": application.model_dump(), })) raise HTTPException(status_code=500, detail=f"Prediction failed: {str(e)}") return PredictionResponse( prediction=prediction, probability_of_default=round(probability, 4), risk_category=risk_category, ) except Exception as e: logger.error(json.dumps({ "timestamp": datetime.now(timezone.utc).isoformat(), "event": "prediction_error", "error": str(e), "inputs": application.model_dump(), })) raise HTTPException(status_code=500, detail=f"Prediction failed: {str(e)}") 5.5 — Test it locally pip install fastapi uvicorn scikit-learn joblib pandas pydantic numpy uvicorn app.main:app --reload --host 0.0.0.0 --port 8000 pip install fastapi uvicorn scikit-learn joblib pandas pydantic numpy uvicorn app.main:app --reload --host 0.0.0.0 --port 8000 Open http://localhost:8000/docs — FastAPI automatically generates an interactive Swagger UI where you can test every endpoint directly in your browser. No Postman or curl needed (though those work too). http://localhost:8000/docs Test with a realistic Home Credit applicant: curl -X POST "http://localhost:8000/predict" \ -H "Content-Type: application/json" \ -d '{ "ext_source_1": 0.5, "ext_source_2": 0.65, "ext_source_3": 0.48, "amt_income_total": 202500, "amt_credit": 406597, "amt_annuity": 24700, "amt_goods_price": 351000, "code_gender": 1, "flag_own_car": 0, "flag_own_realty": 1, "cnt_children": 0, "age_years": 39.9, "years_employed": 5.3, "years_id_publish": 8.5, "education_level": 1 }' curl -X POST "http://localhost:8000/predict" \ -H "Content-Type: application/json" \ -d '{ "ext_source_1": 0.5, "ext_source_2": 0.65, "ext_source_3": 0.48, "amt_income_total": 202500, "amt_credit": 406597, "amt_annuity": 24700, "amt_goods_price": 351000, "code_gender": 1, "flag_own_car": 0, "flag_own_realty": 1, "cnt_children": 0, "age_years": 39.9, "years_employed": 5.3, "years_id_publish": 8.5, "education_level": 1 }' git add app/ git commit -m "feat: implement FastAPI prediction API with validation and logging" git add app/ git commit -m "feat: implement FastAPI prediction API with validation and logging" 6. Writing Automated Tests Why tests matter Writing code that verifies our API works correctly. Imagine you change one line of code that accidentally breaks the input validation. Without tests, this bug goes straight to production and the API starts silently returning wrong predictions — or crashing. With automated tests integrated into CI/CD, the bug gets caught in seconds, before it ever reaches a user. Tests are your safety net. They give you confidence to make changes — because if you break something, you'll know immediately. 6.1 — Setting up the test client Creating a test client that can simulate HTTP requests without starting a real server. Starting a real server for tests would be slow and complex. FastAPI's TestClient lets us test everything in-memory, in milliseconds. TestClient # tests/test_api.py import pytest from fastapi.testclient import TestClient from app.main import app client = TestClient(app) # tests/test_api.py import pytest from fastapi.testclient import TestClient from app.main import app client = TestClient(app) 6.2 — A realistic test payload Defining a valid loan application that we'll reuse across many tests. This avoids repeating the same 15 fields in every test. It represents a typical Home Credit applicant. VALID_APPLICANT = { "ext_source_1": 0.5, "ext_source_2": 0.65, "ext_source_3": 0.48, "amt_income_total": 202500, "amt_credit": 406597, "amt_annuity": 24700, "amt_goods_price": 351000, "code_gender": 1, "flag_own_car": 0, "flag_own_realty": 1, "cnt_children": 0, "age_years": 39.9, "years_employed": 5.3, "years_id_publish": 8.5, "education_level": 1, } VALID_APPLICANT = { "ext_source_1": 0.5, "ext_source_2": 0.65, "ext_source_3": 0.48, "amt_income_total": 202500, "amt_credit": 406597, "amt_annuity": 24700, "amt_goods_price": 351000, "code_gender": 1, "flag_own_car": 0, "flag_own_realty": 1, "cnt_children": 0, "age_years": 39.9, "years_employed": 5.3, "years_id_publish": 8.5, "education_level": 1, } 6.3 — Health check tests Verifying the most basic functionality. If the health check fails, nothing else will work. This is the first thing to test. def test_health_returns_200(): """The health endpoint should always return HTTP 200 if the server is running.""" response = client.get("/health") assert response.status_code == 200 def test_health_reports_model_loaded(): """After startup, the model should be loaded and ready.""" response = client.get("/health") data = response.json() assert data["status"] == "healthy" assert data["model_loaded"] is True def test_health_returns_200(): """The health endpoint should always return HTTP 200 if the server is running.""" response = client.get("/health") assert response.status_code == 200 def test_health_reports_model_loaded(): """After startup, the model should be loaded and ready.""" response = client.get("/health") data = response.json() assert data["status"] == "healthy" assert data["model_loaded"] is True What's assert? It means "this must be true, or the test fails." If response.status_code is 500 instead of 200, pytest will report this test as failed and show you exactly what the value was. What's assert response.status_code 6.4 — Valid prediction tests Testing that the API returns correct, well-formatted predictions for valid input. This is the "happy path" — verifying that the core functionality works as expected. def test_valid_prediction_returns_200(): """A well-formed request should return HTTP 200 (OK).""" response = client.post("/predict", json=VALID_APPLICANT) assert response.status_code == 200 def test_response_has_all_fields(): """The response must include prediction, probability, and risk category.""" response = client.post("/predict", json=VALID_APPLICANT) data = response.json() assert "prediction" in data assert "probability_of_default" in data assert "risk_category" in data def test_prediction_is_binary(): """Prediction should be exactly 0 or 1, nothing else.""" response = client.post("/predict", json=VALID_APPLICANT) assert response.json()["prediction"] in [0, 1] def test_probability_in_valid_range(): """Default probability must be between 0.0 and 1.0.""" response = client.post("/predict", json=VALID_APPLICANT) prob = response.json()["probability_of_default"] assert 0.0 <= prob <= 1.0 def test_risk_category_is_valid(): """Risk category must be one of the three defined levels.""" response = client.post("/predict", json=VALID_APPLICANT) assert response.json()["risk_category"] in ["Low", "Medium", "High"] def test_valid_prediction_returns_200(): """A well-formed request should return HTTP 200 (OK).""" response = client.post("/predict", json=VALID_APPLICANT) assert response.status_code == 200 def test_response_has_all_fields(): """The response must include prediction, probability, and risk category.""" response = client.post("/predict", json=VALID_APPLICANT) data = response.json() assert "prediction" in data assert "probability_of_default" in data assert "risk_category" in data def test_prediction_is_binary(): """Prediction should be exactly 0 or 1, nothing else.""" response = client.post("/predict", json=VALID_APPLICANT) assert response.json()["prediction"] in [0, 1] def test_probability_in_valid_range(): """Default probability must be between 0.0 and 1.0.""" response = client.post("/predict", json=VALID_APPLICANT) prob = response.json()["probability_of_default"] assert 0.0 <= prob <= 1.0 def test_risk_category_is_valid(): """Risk category must be one of the three defined levels.""" response = client.post("/predict", json=VALID_APPLICANT) assert response.json()["risk_category"] in ["Low", "Medium", "High"] 6.5 — Invalid input tests Testing that the API properly rejects bad data. This is equally important as testing valid inputs. Our API receives data from the outside world — people will send wrong types, out-of-range values, and missing fields. Every one of these should return a clear 422 error, not a crash or a garbage prediction. rejects def test_ext_source_above_1_rejected(): """External scores are normalized 0-1. Values above 1 are invalid.""" bad = {**VALID_APPLICANT, "ext_source_1": 5.0} response = client.post("/predict", json=bad) assert response.status_code == 422 def test_negative_income_rejected(): """Income must be strictly positive. Zero or negative is not valid.""" bad = {**VALID_APPLICANT, "amt_income_total": -1000} response = client.post("/predict", json=bad) assert response.status_code == 422 def test_missing_required_field_rejected(): """If a required field is omitted, the request must fail.""" incomplete = {k: v for k, v in VALID_APPLICANT.items() if k != "ext_source_2"} response = client.post("/predict", json=incomplete) assert response.status_code == 422 def test_wrong_type_rejected(): """Sending a string where a number is expected must fail.""" bad = {**VALID_APPLICANT, "age_years": "forty"} response = client.post("/predict", json=bad) assert response.status_code == 422 def test_underage_applicant_rejected(): """Applicants must be at least 18 years old.""" bad = {**VALID_APPLICANT, "age_years": 15} response = client.post("/predict", json=bad) assert response.status_code == 422 def test_zero_credit_rejected(): """A loan of $0 doesn't make sense.""" bad = {**VALID_APPLICANT, "amt_credit": 0} response = client.post("/predict", json=bad) assert response.status_code == 422 def test_ext_source_above_1_rejected(): """External scores are normalized 0-1. Values above 1 are invalid.""" bad = {**VALID_APPLICANT, "ext_source_1": 5.0} response = client.post("/predict", json=bad) assert response.status_code == 422 def test_negative_income_rejected(): """Income must be strictly positive. Zero or negative is not valid.""" bad = {**VALID_APPLICANT, "amt_income_total": -1000} response = client.post("/predict", json=bad) assert response.status_code == 422 def test_missing_required_field_rejected(): """If a required field is omitted, the request must fail.""" incomplete = {k: v for k, v in VALID_APPLICANT.items() if k != "ext_source_2"} response = client.post("/predict", json=incomplete) assert response.status_code == 422 def test_wrong_type_rejected(): """Sending a string where a number is expected must fail.""" bad = {**VALID_APPLICANT, "age_years": "forty"} response = client.post("/predict", json=bad) assert response.status_code == 422 def test_underage_applicant_rejected(): """Applicants must be at least 18 years old.""" bad = {**VALID_APPLICANT, "age_years": 15} response = client.post("/predict", json=bad) assert response.status_code == 422 def test_zero_credit_rejected(): """A loan of $0 doesn't make sense.""" bad = {**VALID_APPLICANT, "amt_credit": 0} response = client.post("/predict", json=bad) assert response.status_code == 422 6.6 — Model-level tests Testing the model artifact directly, without going through the API. If the model file is corrupted or incompatible, we want to know immediately — not discover it when the API crashes. # tests/test_model.py import joblib import pandas as pd def test_model_loads_successfully(): model = joblib.load("model/credit_model.pkl") assert model is not None def test_model_has_required_methods(): model = joblib.load("model/credit_model.pkl") assert hasattr(model, 'predict') assert hasattr(model, 'predict_proba') # tests/test_model.py import joblib import pandas as pd def test_model_loads_successfully(): model = joblib.load("model/credit_model.pkl") assert model is not None def test_model_has_required_methods(): model = joblib.load("model/credit_model.pkl") assert hasattr(model, 'predict') assert hasattr(model, 'predict_proba') 6.7 — Run the tests pip install pytest httpx pytest tests/ -v pip install pytest httpx pytest tests/ -v The -v flag shows verbose output — one line per test, with PASS/FAIL status. You should see all green. -v git add tests/ git commit -m "feat: add unit and integration tests for API and model" git add tests/ git commit -m "feat: add unit and integration tests for API and model" 7. Containerizing with Docker What is Docker and why do we need it? The problem: Your API works perfectly on your laptop. You deploy it to a server. It crashes because the server has Python 3.9 instead of 3.11, or it's missing a system library, or a dependency conflicts with something already installed. The problem: The solution: Docker packages your application along with its entire environment — the OS, Python, all libraries, everything — into a self-contained unit called a container. Think of it like shipping your laptop inside the package instead of just the code. The container runs identically everywhere: your laptop, your colleague's machine, a cloud server, a Kubernetes cluster. The solution: entire container Docker containers package your app with everything it needs to run identically everywhere. (source: docker.com) Docker containers package your app with everything it needs to run identically everywhere. (source: docker.com) source: docker.com 7.1 — The requirements file Listing all Python dependencies with version constraints. If you just say pip install scikit-learn, pip installs the latest version — which might be different tomorrow. Pinning versions ensures that the same code produces the same results today, next month, and next year. This is called reproducibility, and it's essential for production systems. pip install scikit-learn reproducibility # requirements.txt fastapi>=0.104.0 uvicorn>=0.24.0 scikit-learn>=1.3.0 joblib>=1.3.0 pandas>=2.0.0 pydantic>=2.0.0 numpy>=1.24.0 # requirements.txt fastapi>=0.104.0 uvicorn>=0.24.0 scikit-learn>=1.3.0 joblib>=1.3.0 pandas>=2.0.0 pydantic>=2.0.0 numpy>=1.24.0 7.2 — The Dockerfile, instruction by instruction Writing a recipe that tells Docker how to build our container. The Dockerfile is read top-to-bottom. Each line creates a "layer" in the image. Docker caches layers, so unchanged layers are reused — this makes rebuilds fast. FROM python:3.11-slim FROM python:3.11-slim We start from a base image that already has Python 3.11 installed. The slim variant is about 150MB instead of 900MB for the full version — we don't need compilers in production. slim WORKDIR /app WORKDIR /app This sets the working directory inside the container. All subsequent commands run from /app, like doing cd /app. /app cd /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt We copy the requirements file first and install dependencies. Docker caches each layer, so if requirements.txt hasn't changed since the last build, Docker skips the slow pip install entirely. Since dependencies change rarely but code changes frequently, this ordering saves minutes on every rebuild. --no-cache-dir avoids storing downloaded packages — we don't need them after installation. requirements.txt pip install --no-cache-dir COPY app/ ./app/ COPY model/ ./model/ COPY app/ ./app/ COPY model/ ./model/ Now we copy our application code and model. This layer comes after dependencies because code changes more frequently. If we copied code first, every code change would invalidate the dependency cache. after EXPOSE 8000 EXPOSE 8000 This documents that the container listens on port 8000. It's purely informational — the actual port mapping happens at runtime with docker run -p. docker run -p CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"] CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"] The startup command. --host 0.0.0.0 is critical inside containers — without it, uvicorn only listens on localhost, which means nothing outside the container can reach it. 0.0.0.0 means "listen on all network interfaces." --host 0.0.0.0 localhost outside 0.0.0.0 7.3 — Build, run, and verify # Build the image (give it a name with -t) docker build -t credit-scoring-api . # Run the container # -p 8000:8000 maps your machine's port 8000 to the container's port 8000 docker run -p 8000:8000 credit-scoring-api # Verify it works curl http://localhost:8000/health # Build the image (give it a name with -t) docker build -t credit-scoring-api . # Run the container # -p 8000:8000 maps your machine's port 8000 to the container's port 8000 docker run -p 8000:8000 credit-scoring-api # Verify it works curl http://localhost:8000/health git add Dockerfile requirements.txt git commit -m "feat: add Dockerfile for containerized deployment" git add Dockerfile requirements.txt git commit -m "feat: add Dockerfile for containerized deployment" 8. Building a CI/CD Pipeline What is CI/CD and why automate? What CI/CD means: What CI/CD means: CI (Continuous Integration): Every time you push code, automated tests run to catch bugs before they reach production. CD (Continuous Deployment): If all tests pass, the code is automatically deployed. No manual steps, no "I forgot to run the tests." CI (Continuous Integration): Every time you push code, automated tests run to catch bugs before they reach production. CI (Continuous Integration): before CD (Continuous Deployment): If all tests pass, the code is automatically deployed. No manual steps, no "I forgot to run the tests." CD (Continuous Deployment): Without CI/CD, deployment looks like: manually run tests → manually build Docker → manually push it → manually restart the service. Each "manually" is an opportunity for human error. With CI/CD, you push code and everything happens automatically — and if anything fails, the deployment stops. GitHub Actions is GitHub's built-in CI/CD system. You define your pipeline in a YAML file, and GitHub runs it on their servers. (GitHub Actions docs) GitHub Actions GitHub Actions docs 8.1 — Understanding the pipeline flow Our pipeline has 3 stages that run in sequence: PUSH to main │ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ TEST │ ──▶ │ BUILD │ ──▶ │ DEPLOY │ │ pytest │ │ docker │ │ push to │ │ │ │ build │ │ registry │ └──────────┘ └──────────┘ └──────────┘ │ │ │ If FAIL: If FAIL: If FAIL: STOP HERE STOP HERE STOP HERE PUSH to main │ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ TEST │ ──▶ │ BUILD │ ──▶ │ DEPLOY │ │ pytest │ │ docker │ │ push to │ │ │ │ build │ │ registry │ └──────────┘ └──────────┘ └──────────┘ │ │ │ If FAIL: If FAIL: If FAIL: STOP HERE STOP HERE STOP HERE Because each depends on the previous one succeeding. If tests fail, there's no point building a Docker image of broken code. If the Docker build fails, there's no point trying to deploy. This "fail fast" approach saves time and prevents broken code from reaching production. 8.2 — The YAML file, section by section Create .github/workflows/ci-cd.yml: .github/workflows/ci-cd.yml Defining when the pipeline runs. We want it to run on every push to main (catches bugs immediately) and on every pull request targeting main (catches bugs before they're even merged). when main main name: CI/CD Pipeline on: push: branches: [main] pull_request: branches: [main] name: CI/CD Pipeline on: push: branches: [main] pull_request: branches: [main] Stage 1 — TEST: Stage 1 — TEST: Running our pytest suite in a fresh environment. Testing in a fresh environment (not your laptop) catches issues like "it works because I have library X installed that isn't in requirements.txt." jobs: test: runs-on: ubuntu-latest steps: - name: Checkout code uses: actions/checkout@v4 - name: Set up Python uses: actions/setup-python@v5 with: python-version: "3.11" - name: Install dependencies run: | pip install -r requirements.txt pip install pytest httpx - name: Generate model run: python train_model.py - name: Run tests run: pytest tests/ -v --tb=short jobs: test: runs-on: ubuntu-latest steps: - name: Checkout code uses: actions/checkout@v4 - name: Set up Python uses: actions/setup-python@v5 with: python-version: "3.11" - name: Install dependencies run: | pip install -r requirements.txt pip install pytest httpx - name: Generate model run: python train_model.py - name: Run tests run: pytest tests/ -v --tb=short Stage 2 — BUILD (only runs if tests pass): Stage 2 — BUILD Building the Docker image and verifying the container actually starts and responds. Sometimes tests pass but the Docker build fails (missing file, wrong path, incompatible base image). The smoke test (curl --fail) confirms the API is alive inside the container. curl --fail build: runs-on: ubuntu-latest needs: test # ← "needs" means: only run if the "test" job succeeded steps: - uses: actions/checkout@v4 - name: Generate model run: | pip install scikit-learn pandas joblib numpy python train_model.py - name: Build Docker image run: docker build -t credit-scoring-api:${{ github.sha }} . - name: Smoke test the container run: | docker run -d -p 8000:8000 --name api credit-scoring-api:${{ github.sha }} sleep 10 curl --fail http://localhost:8000/health || exit 1 docker stop api build: runs-on: ubuntu-latest needs: test # ← "needs" means: only run if the "test" job succeeded steps: - uses: actions/checkout@v4 - name: Generate model run: | pip install scikit-learn pandas joblib numpy python train_model.py - name: Build Docker image run: docker build -t credit-scoring-api:${{ github.sha }} . - name: Smoke test the container run: | docker run -d -p 8000:8000 --name api credit-scoring-api:${{ github.sha }} sleep 10 curl --fail http://localhost:8000/health || exit 1 docker stop api Stage 3 — DEPLOY (only from main branch, only if build passed): Stage 3 — DEPLOY main Pushing the Docker image to a registry (Docker Hub) so it can be pulled by production servers. A registry is like a library for Docker images. Once the image is there, any server can pull and run it. deploy: runs-on: ubuntu-latest needs: build if: github.ref == 'refs/heads/main' # Only deploy from main, not PRs steps: - uses: actions/checkout@v4 - name: Generate model run: | pip install scikit-learn pandas joblib numpy python train_model.py - uses: docker/login-action@v3 with: username: ${{ secrets.DOCKER_USERNAME }} password: ${{ secrets.DOCKER_TOKEN }} - name: Push to Docker Hub run: | docker build -t ${{ secrets.DOCKER_USERNAME }}/credit-scoring-api:latest . docker push ${{ secrets.DOCKER_USERNAME }}/credit-scoring-api:latest deploy: runs-on: ubuntu-latest needs: build if: github.ref == 'refs/heads/main' # Only deploy from main, not PRs steps: - uses: actions/checkout@v4 - name: Generate model run: | pip install scikit-learn pandas joblib numpy python train_model.py - uses: docker/login-action@v3 with: username: ${{ secrets.DOCKER_USERNAME }} password: ${{ secrets.DOCKER_TOKEN }} - name: Push to Docker Hub run: | docker build -t ${{ secrets.DOCKER_USERNAME }}/credit-scoring-api:latest . docker push ${{ secrets.DOCKER_USERNAME }}/credit-scoring-api:latest What are secrets? They're encrypted environment variables stored in GitHub Settings → Secrets. They're never visible in logs, even if the pipeline prints environment variables. Never hardcode credentials in code or YAML files — if your repo is public (or ever becomes public), those credentials are compromised. What are secrets Never hardcode credentials in code or YAML files git add .github/ git commit -m "feat: add CI/CD pipeline with GitHub Actions" git add .github/ git commit -m "feat: add CI/CD pipeline with GitHub Actions" 9. Logging Production Data Why logging is critical Making sure every prediction the API makes is recorded with full context. Once a model is in production, you need answers to questions like: "Is the API getting slower over the past week?" → Check inference_time_ms trends "Has the data distribution changed?" → Compare logged inputs against training data "Why did this client get rejected?" → Look up the exact inputs and outputs for that request "How many predictions did we serve today?" → Count log entries "Is the API getting slower over the past week?" → Check inference_time_ms trends "Has the data distribution changed?" → Compare logged inputs against training data "Why did this client get rejected?" → Look up the exact inputs and outputs for that request "How many predictions did we serve today?" → Count log entries Without logs, you're flying completely blind. The API could be returning wrong predictions for weeks and nobody would know. 9.1 — What data to collect and why What We Log Why We Need It All input features (ext_source scores, income, credit amount...) To detect drift: compare production distributions against training data Prediction (0 or 1) To monitor prediction distribution: if suddenly 50% are defaults, something is wrong Probability (0.0 to 1.0) To monitor score distribution: a slow shift in average probability indicates model degradation Inference time (milliseconds) To detect performance issues: if it's getting slower, we need to investigate Timestamp To analyze trends over time and correlate with external events Errors To debug failures and identify recurring issues What We Log Why We Need It All input features (ext_source scores, income, credit amount...) To detect drift: compare production distributions against training data Prediction (0 or 1) To monitor prediction distribution: if suddenly 50% are defaults, something is wrong Probability (0.0 to 1.0) To monitor score distribution: a slow shift in average probability indicates model degradation Inference time (milliseconds) To detect performance issues: if it's getting slower, we need to investigate Timestamp To analyze trends over time and correlate with external events Errors To debug failures and identify recurring issues What We Log Why We Need It What We Log What We Log Why We Need It Why We Need It All input features (ext_source scores, income, credit amount...) To detect drift: compare production distributions against training data All input features (ext_source scores, income, credit amount...) All input features (ext_source scores, income, credit amount...) All input features To detect drift: compare production distributions against training data To detect drift: compare production distributions against training data Prediction (0 or 1) To monitor prediction distribution: if suddenly 50% are defaults, something is wrong Prediction (0 or 1) Prediction (0 or 1) Prediction To monitor prediction distribution: if suddenly 50% are defaults, something is wrong To monitor prediction distribution: if suddenly 50% are defaults, something is wrong Probability (0.0 to 1.0) To monitor score distribution: a slow shift in average probability indicates model degradation Probability (0.0 to 1.0) Probability (0.0 to 1.0) Probability To monitor score distribution: a slow shift in average probability indicates model degradation To monitor score distribution: a slow shift in average probability indicates model degradation Inference time (milliseconds) To detect performance issues: if it's getting slower, we need to investigate Inference time (milliseconds) Inference time (milliseconds) Inference time To detect performance issues: if it's getting slower, we need to investigate To detect performance issues: if it's getting slower, we need to investigate Timestamp To analyze trends over time and correlate with external events Timestamp Timestamp Timestamp To analyze trends over time and correlate with external events To analyze trends over time and correlate with external events Errors To debug failures and identify recurring issues Errors Errors Errors To debug failures and identify recurring issues To debug failures and identify recurring issues We already implemented this in our main.py (section 5.4). Each prediction generates a structured JSON log line containing all the fields above. This data forms the foundation for the drift analysis in the next section. main.py In a real production environment, these JSON logs would be shipped to a centralized platform — Elasticsearch/Kibana for search and visualization, or Datadog/CloudWatch for monitoring and alerting. For this tutorial, local files demonstrate the same principle. In a real production environment, these JSON logs would be shipped to a centralized platform — Elasticsearch/Kibana for search and visualization, or Datadog/CloudWatch for monitoring and alerting. For this tutorial, local files demonstrate the same principle. In a real production environment 10. Data Drift Detection What is data drift and why does it matter? What drift is: Your model was trained on data from a specific moment in time. It learned the statistical patterns of that data. But the real world is not static — economic conditions shift, customer demographics change, lending policies evolve. When the data your model sees in production starts looking significantly different from the data it was trained on, that's data drift. What drift is: that data drift A model trained on 2022 Home Credit data might perform terribly on 2024 data if: Inflation increased incomes and credit amounts significantly A new younger customer segment started applying Credit bureau scoring algorithms were updated (changing EXT_SOURCE distributions) An economic recession changed default patterns Inflation increased incomes and credit amounts significantly A new younger customer segment started applying Credit bureau scoring algorithms were updated (changing EXT_SOURCE distributions) An economic recession changed default patterns The model would still return predictions — it wouldn't crash. But those predictions would be based on patterns that no longer exist. This "silent failure" is one of the most dangerous things in production ML, and drift detection is your early warning system. Data drift occurs when production data diverges from training data. (source: Evidently AI) Data drift occurs when production data diverges from training data. (source: Evidently AI) source: Evidently AI 10.1 — Load reference data Loading the training data we saved earlier as our "reference" — the baseline against which we'll measure drift. To detect drift, you need to compare "then" (training) versus "now" (production). The reference data defines what "normal" looks like. import pandas as pd import numpy as np reference_data = pd.read_csv('data/reference_data.csv') print(f"Reference: {reference_data.shape}") import pandas as pd import numpy as np reference_data = pd.read_csv('data/reference_data.csv') print(f"Reference: {reference_data.shape}") 10.2 — Simulate production data with realistic drift Creating fake production data that looks like what we'd see "6 months later" with realistic shifts. In a real deployment, this data would come from your API logs (the inputs you've been recording — see section 9). Here we simulate it to demonstrate the detection process. We introduce three types of changes: EXT_SOURCE scores shift slightly (credit bureau algorithms get updated) Financial amounts increase (inflation, economic growth) Applicant demographics shift (younger customers join the platform) EXT_SOURCE scores shift slightly (credit bureau algorithms get updated) EXT_SOURCE scores shift slightly Financial amounts increase (inflation, economic growth) Financial amounts increase Applicant demographics shift (younger customers join the platform) Applicant demographics shift np.random.seed(123) n_prod = 5000 np.random.seed(123) n_prod = 5000 Simulating shifted external scores. Credit bureaus regularly update their scoring models. A small systematic shift of +0.05 in EXT_SOURCE_1 could happen when the bureau recalibrates — and it changes what those scores mean for our model. prod_ext_1 = reference_data['EXT_SOURCE_1'].dropna().sample(n_prod, replace=True).values + \ np.random.normal(0.05, 0.02, n_prod) prod_ext_1 = np.clip(prod_ext_1, 0, 1) # Keep in valid range prod_ext_2 = reference_data['EXT_SOURCE_2'].dropna().sample(n_prod, replace=True).values + \ np.random.normal(0.03, 0.01, n_prod) prod_ext_2 = np.clip(prod_ext_2, 0, 1) # EXT_SOURCE_3 stays stable — not all features drift simultaneously prod_ext_3 = reference_data['EXT_SOURCE_3'].dropna().sample(n_prod, replace=True).values prod_ext_3 = np.clip(prod_ext_3, 0, 1) prod_ext_1 = reference_data['EXT_SOURCE_1'].dropna().sample(n_prod, replace=True).values + \ np.random.normal(0.05, 0.02, n_prod) prod_ext_1 = np.clip(prod_ext_1, 0, 1) # Keep in valid range prod_ext_2 = reference_data['EXT_SOURCE_2'].dropna().sample(n_prod, replace=True).values + \ np.random.normal(0.03, 0.01, n_prod) prod_ext_2 = np.clip(prod_ext_2, 0, 1) # EXT_SOURCE_3 stays stable — not all features drift simultaneously prod_ext_3 = reference_data['EXT_SOURCE_3'].dropna().sample(n_prod, replace=True).values prod_ext_3 = np.clip(prod_ext_3, 0, 1) Simulating inflation effects on financial features. If average incomes rise 8% due to inflation but loan amounts rise 12% (because property prices increase faster than wages), the debt burden increases — and our model's risk assessments may be off. prod_income = reference_data['AMT_INCOME_TOTAL'].sample(n_prod, replace=True).values * 1.08 prod_credit = reference_data['AMT_CREDIT'].sample(n_prod, replace=True).values * 1.12 prod_annuity = reference_data['AMT_ANNUITY'].sample(n_prod, replace=True).values * 1.10 prod_goods = reference_data['AMT_GOODS_PRICE'].sample(n_prod, replace=True).values * 1.15 prod_income = reference_data['AMT_INCOME_TOTAL'].sample(n_prod, replace=True).values * 1.08 prod_credit = reference_data['AMT_CREDIT'].sample(n_prod, replace=True).values * 1.12 prod_annuity = reference_data['AMT_ANNUITY'].sample(n_prod, replace=True).values * 1.10 prod_goods = reference_data['AMT_GOODS_PRICE'].sample(n_prod, replace=True).values * 1.15 Simulating a younger customer base. If Home Credit launches a marketing campaign targeting younger people, the age distribution shifts. Younger applicants have shorter credit histories and less stable employment — the model might underestimate their risk because it was trained mostly on older applicants. prod_age = reference_data['AGE_YEARS'].sample(n_prod, replace=True).values - \ np.random.uniform(0, 5, n_prod) prod_age = np.clip(prod_age, 20, 70) prod_age = reference_data['AGE_YEARS'].sample(n_prod, replace=True).values - \ np.random.uniform(0, 5, n_prod) prod_age = np.clip(prod_age, 20, 70) Assembling all simulated features into one DataFrame, including recomputing engineered features. The drift analysis needs to compare the exact same features between reference and production. current_data = pd.DataFrame({ 'EXT_SOURCE_1': prod_ext_1, 'EXT_SOURCE_2': prod_ext_2, 'EXT_SOURCE_3': prod_ext_3, 'AMT_INCOME_TOTAL': prod_income, 'AMT_CREDIT': prod_credit, 'AMT_ANNUITY': prod_annuity, 'AMT_GOODS_PRICE': prod_goods, 'AGE_YEARS': prod_age, 'CODE_GENDER': reference_data['CODE_GENDER'].sample(n_prod, replace=True).values, 'FLAG_OWN_CAR': reference_data['FLAG_OWN_CAR'].sample(n_prod, replace=True).values, 'FLAG_OWN_REALTY': reference_data['FLAG_OWN_REALTY'].sample(n_prod, replace=True).values, 'CNT_CHILDREN': reference_data['CNT_CHILDREN'].sample(n_prod, replace=True).values, 'YEARS_EMPLOYED': reference_data['YEARS_EMPLOYED'].sample(n_prod, replace=True).values, 'YEARS_ID_PUBLISH': reference_data['YEARS_ID_PUBLISH'].sample(n_prod, replace=True).values, 'EDUCATION_LEVEL': reference_data['EDUCATION_LEVEL'].sample(n_prod, replace=True).values, }) current_data['CREDIT_INCOME_RATIO'] = current_data['AMT_CREDIT'] / (current_data['AMT_INCOME_TOTAL'] + 1) current_data['ANNUITY_INCOME_RATIO'] = current_data['AMT_ANNUITY'] / (current_data['AMT_INCOME_TOTAL'] + 1) current_data['CREDIT_GOODS_RATIO'] = current_data['AMT_CREDIT'] / (current_data['AMT_GOODS_PRICE'] + 1) current_data = pd.DataFrame({ 'EXT_SOURCE_1': prod_ext_1, 'EXT_SOURCE_2': prod_ext_2, 'EXT_SOURCE_3': prod_ext_3, 'AMT_INCOME_TOTAL': prod_income, 'AMT_CREDIT': prod_credit, 'AMT_ANNUITY': prod_annuity, 'AMT_GOODS_PRICE': prod_goods, 'AGE_YEARS': prod_age, 'CODE_GENDER': reference_data['CODE_GENDER'].sample(n_prod, replace=True).values, 'FLAG_OWN_CAR': reference_data['FLAG_OWN_CAR'].sample(n_prod, replace=True).values, 'FLAG_OWN_REALTY': reference_data['FLAG_OWN_REALTY'].sample(n_prod, replace=True).values, 'CNT_CHILDREN': reference_data['CNT_CHILDREN'].sample(n_prod, replace=True).values, 'YEARS_EMPLOYED': reference_data['YEARS_EMPLOYED'].sample(n_prod, replace=True).values, 'YEARS_ID_PUBLISH': reference_data['YEARS_ID_PUBLISH'].sample(n_prod, replace=True).values, 'EDUCATION_LEVEL': reference_data['EDUCATION_LEVEL'].sample(n_prod, replace=True).values, }) current_data['CREDIT_INCOME_RATIO'] = current_data['AMT_CREDIT'] / (current_data['AMT_INCOME_TOTAL'] + 1) current_data['ANNUITY_INCOME_RATIO'] = current_data['AMT_ANNUITY'] / (current_data['AMT_INCOME_TOTAL'] + 1) current_data['CREDIT_GOODS_RATIO'] = current_data['AMT_CREDIT'] / (current_data['AMT_GOODS_PRICE'] + 1) 10.3 — Visual comparison Plotting the distributions of key features side by side. Before running statistical tests, always look at the data. Visualizations reveal patterns that numbers might miss, and they help you sanity-check the statistical results. look import matplotlib.pyplot as plt key_features = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AGE_YEARS', 'CREDIT_INCOME_RATIO'] fig, axes = plt.subplots(2, 3, figsize=(16, 10)) fig.suptitle('Reference (blue) vs Production (orange)', fontsize=14) for idx, feature in enumerate(key_features): ax = axes[idx // 3][idx % 3] ax.hist(reference_data[feature].dropna(), bins=40, alpha=0.5, label='Reference', color='steelblue', density=True) ax.hist(current_data[feature].dropna(), bins=40, alpha=0.5, label='Production', color='darkorange', density=True) ax.set_title(feature) ax.legend(fontsize=8) plt.tight_layout() plt.show() import matplotlib.pyplot as plt key_features = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AGE_YEARS', 'CREDIT_INCOME_RATIO'] fig, axes = plt.subplots(2, 3, figsize=(16, 10)) fig.suptitle('Reference (blue) vs Production (orange)', fontsize=14) for idx, feature in enumerate(key_features): ax = axes[idx // 3][idx % 3] ax.hist(reference_data[feature].dropna(), bins=40, alpha=0.5, label='Reference', color='steelblue', density=True) ax.hist(current_data[feature].dropna(), bins=40, alpha=0.5, label='Production', color='darkorange', density=True) ax.set_title(feature) ax.legend(fontsize=8) plt.tight_layout() plt.show() 10.4 — Run Evidently drift report Using Evidently AI to run formal statistical tests for drift on every feature. Visual comparison is good for getting intuition, but you need statistical rigor for automated decisions. Evidently uses appropriate statistical tests (Kolmogorov-Smirnov for numeric features, chi-square for categorical) and reports whether each feature has significantly drifted. (Evidently AI docs) Evidently AI docs from evidently.report import Report from evidently.metrics import DatasetDriftMetric, DataDriftTable drift_report = Report(metrics=[ DatasetDriftMetric(), # Overall: is there significant drift? DataDriftTable(), # Per-feature: which features drifted? ]) drift_report.run( reference_data=reference_data, current_data=current_data, ) # Save as interactive HTML import os os.makedirs('monitoring', exist_ok=True) drift_report.save_html("monitoring/drift_report.html") from evidently.report import Report from evidently.metrics import DatasetDriftMetric, DataDriftTable drift_report = Report(metrics=[ DatasetDriftMetric(), # Overall: is there significant drift? DataDriftTable(), # Per-feature: which features drifted? ]) drift_report.run( reference_data=reference_data, current_data=current_data, ) # Save as interactive HTML import os os.makedirs('monitoring', exist_ok=True) drift_report.save_html("monitoring/drift_report.html") 10.5 — Extract and display results Extracting the drift results programmatically. The HTML report is great for humans exploring interactively. But for automated monitoring (e.g., a weekly script that sends an alert if drift is detected), you need to extract results as data. report_dict = drift_report.as_dict() ds = report_dict['metrics'][0]['result'] print(f"Overall drift detected: {'YES' if ds['dataset_drift'] else 'NO'}") print(f"Drifted features: {ds['number_of_drifted_columns']} / {ds['number_of_columns']}") report_dict = drift_report.as_dict() ds = report_dict['metrics'][0]['result'] print(f"Overall drift detected: {'YES' if ds['dataset_drift'] else 'NO'}") print(f"Drifted features: {ds['number_of_drifted_columns']} / {ds['number_of_columns']}") Showing per-feature drift details. Knowing that "drift was detected" isn't enough. You need to know which features drifted and how much, so you can investigate the root cause. which how much drift_table = report_dict['metrics'][1]['result'] print(f"\n{'Feature':<25} {'Drifted?':<10} {'Score':<12} {'Test'}") print("-" * 65) for col, info in drift_table['drift_by_columns'].items(): status = "YES" if info['drift_detected'] else "no" score = info['drift_score'] test = info['stattest_name'] flag = " << ALERT" if info['drift_detected'] else "" print(f"{col:<25} {status:<10} {score:<12.6f} {test}{flag}") drift_table = report_dict['metrics'][1]['result'] print(f"\n{'Feature':<25} {'Drifted?':<10} {'Score':<12} {'Test'}") print("-" * 65) for col, info in drift_table['drift_by_columns'].items(): status = "YES" if info['drift_detected'] else "no" score = info['drift_score'] test = info['stattest_name'] flag = " << ALERT" if info['drift_detected'] else "" print(f"{col:<25} {status:<10} {score:<12.6f} {test}{flag}") 10.6 — Interpretation and action plan Translating statistical results into business actions. Detecting drift is the easy part. The hard part — and the part that actually matters — is deciding what to do about it. A drift detection without an action plan is just an interesting observation. if ds['dataset_drift']: print(""" DRIFT DETECTED — Action Required What this means for the Home Credit model: The data arriving in production is statistically different from the data the model was trained on. Specific shifts identified: - EXT_SOURCE scores shifted (credit bureau scoring updates) - Financial amounts increased (inflation / economic growth) - Applicant age decreased (new younger customer segment) Recommended actions (in priority order): 1. IMMEDIATE: Evaluate model AUC on recent labeled production data. If AUC dropped below 0.70, the model is degraded. 2. SHORT-TERM: Investigate each drifted feature individually. Is this a data pipeline bug or a genuine real-world shift? 3. MEDIUM-TERM: If performance degraded, retrain the model using recent data that includes the new distributions. 4. LONG-TERM: Set up automated weekly drift monitoring with alerts when drift exceeds thresholds. """) if ds['dataset_drift']: print(""" DRIFT DETECTED — Action Required What this means for the Home Credit model: The data arriving in production is statistically different from the data the model was trained on. Specific shifts identified: - EXT_SOURCE scores shifted (credit bureau scoring updates) - Financial amounts increased (inflation / economic growth) - Applicant age decreased (new younger customer segment) Recommended actions (in priority order): 1. IMMEDIATE: Evaluate model AUC on recent labeled production data. If AUC dropped below 0.70, the model is degraded. 2. SHORT-TERM: Investigate each drifted feature individually. Is this a data pipeline bug or a genuine real-world shift? 3. MEDIUM-TERM: If performance degraded, retrain the model using recent data that includes the new distributions. 4. LONG-TERM: Set up automated weekly drift monitoring with alerts when drift exceeds thresholds. """) Creating a statistical comparison table. This table gives you concrete numbers to share with stakeholders. "AMT_CREDIT increased 12% on average" is more actionable than "drift was detected on AMT_CREDIT." comp = pd.DataFrame({ 'Feature': reference_data.columns, 'Train Mean': reference_data.mean().round(2).values, 'Prod Mean': current_data[reference_data.columns].mean().round(2).values, }) comp['Shift %'] = ((comp['Prod Mean'] - comp['Train Mean']) / (comp['Train Mean'] + 0.001) * 100).round(1) print(comp.to_string(index=False)) comp = pd.DataFrame({ 'Feature': reference_data.columns, 'Train Mean': reference_data.mean().round(2).values, 'Prod Mean': current_data[reference_data.columns].mean().round(2).values, }) comp['Shift %'] = ((comp['Prod Mean'] - comp['Train Mean']) / (comp['Train Mean'] + 0.001) * 100).round(1) print(comp.to_string(index=False)) git add notebooks/ monitoring/ git commit -m "feat: add data drift analysis with Evidently AI" git add notebooks/ monitoring/ git commit -m "feat: add data drift analysis with Evidently AI" 11. Performance Optimization The Fiverr client specifically wanted a responsive API. If a customer is waiting for a loan decision and the API takes 10 seconds, that's a terrible user experience. In production, you often have latency budgets (e.g., "predictions must complete in under 100ms"). The optimization workflow is always the same: Measure → Identify bottleneck → Optimize → Measure again. Never optimize without measuring first — you might spend hours optimizing something that isn't actually slow. Measure → Identify bottleneck → Optimize → Measure again 11.1 — Profile with cProfile Using Python's built-in profiler to see exactly which functions take the most time. "The model is slow" is not actionable. "73% of time is spent in predict_proba, specifically in the decision_function call" is actionable. predict_proba decision_function import cProfile import pstats import io import time from statistics import mean, stdev model = joblib.load("model/credit_model.pkl") feature_columns = joblib.load("model/feature_columns.pkl") test_row = X_test.iloc[[0]] # A single row for testing import cProfile import pstats import io import time from statistics import mean, stdev model = joblib.load("model/credit_model.pkl") feature_columns = joblib.load("model/feature_columns.pkl") test_row = X_test.iloc[[0]] # A single row for testing We profile 100 predictions rather than just 1 to get statistically meaningful results: profiler = cProfile.Profile() profiler.enable() for _ in range(100): model.predict_proba(test_row) profiler.disable() stream = io.StringIO() stats = pstats.Stats(profiler, stream=stream) stats.sort_stats('cumulative') stats.print_stats(10) print(stream.getvalue()) profiler = cProfile.Profile() profiler.enable() for _ in range(100): model.predict_proba(test_row) profiler.disable() stream = io.StringIO() stats = pstats.Stats(profiler, stream=stream) stats.sort_stats('cumulative') stats.print_stats(10) print(stream.getvalue()) 11.2 — Establish baseline Running 1,000 predictions and recording the time for each one. A single measurement can be misleading (maybe the OS was busy for that one call). We need a distribution: the mean tells us typical performance, the standard deviation tells us consistency, and the p95 tells us the worst case for 95% of requests. n_iterations = 1000 times_sklearn = [] for _ in range(n_iterations): start = time.perf_counter() model.predict_proba(test_row) end = time.perf_counter() times_sklearn.append((end - start) * 1000) # Convert to milliseconds print(f"Baseline (scikit-learn) — {n_iterations} iterations:") print(f" Mean: {mean(times_sklearn):.3f} ms") print(f" Std: {stdev(times_sklearn):.3f} ms") print(f" p95: {np.percentile(times_sklearn, 95):.3f} ms") n_iterations = 1000 times_sklearn = [] for _ in range(n_iterations): start = time.perf_counter() model.predict_proba(test_row) end = time.perf_counter() times_sklearn.append((end - start) * 1000) # Convert to milliseconds print(f"Baseline (scikit-learn) — {n_iterations} iterations:") print(f" Mean: {mean(times_sklearn):.3f} ms") print(f" Std: {stdev(times_sklearn):.3f} ms") print(f" p95: {np.percentile(times_sklearn, 95):.3f} ms") 11.3 — Optimize with ONNX Runtime Converting our scikit-learn model to the ONNX format and running it with ONNX Runtime. ONNX (Open Neural Network Exchange) is a standard format for ML models, and ONNX Runtime is a highly optimized inference engine built by Microsoft. It applies optimizations like graph simplification, operator fusion, and hardware-specific acceleration that scikit-learn doesn't do. For gradient boosting models, ONNX Runtime can be significantly faster. (ONNX Runtime docs) ONNX Runtime docs First, convert the model: from skl2onnx import convert_sklearn from skl2onnx.common.data_types import FloatTensorType import onnxruntime as ort import onnx n_features = len(feature_columns) initial_type = [('float_input', FloatTensorType([None, n_features]))] onnx_model = convert_sklearn(model, initial_types=initial_type, target_opset=12) onnx.save_model(onnx_model, "model/credit_model.onnx") print(f"ONNX model saved ({n_features} features)") from skl2onnx import convert_sklearn from skl2onnx.common.data_types import FloatTensorType import onnxruntime as ort import onnx n_features = len(feature_columns) initial_type = [('float_input', FloatTensorType([None, n_features]))] onnx_model = convert_sklearn(model, initial_types=initial_type, target_opset=12) onnx.save_model(onnx_model, "model/credit_model.onnx") print(f"ONNX model saved ({n_features} features)") Now benchmark it: session = ort.InferenceSession("model/credit_model.onnx") input_name = session.get_inputs()[0].name test_np = test_row.values.astype(np.float32) # ONNX needs numpy float32, not a DataFrame times_onnx = [] for _ in range(n_iterations): start = time.perf_counter() session.run(None, {input_name: test_np}) end = time.perf_counter() times_onnx.append((end - start) * 1000) print(f"ONNX Runtime — {n_iterations} iterations:") print(f" Mean: {mean(times_onnx):.3f} ms") print(f" p95: {np.percentile(times_onnx, 95):.3f} ms") session = ort.InferenceSession("model/credit_model.onnx") input_name = session.get_inputs()[0].name test_np = test_row.values.astype(np.float32) # ONNX needs numpy float32, not a DataFrame times_onnx = [] for _ in range(n_iterations): start = time.perf_counter() session.run(None, {input_name: test_np}) end = time.perf_counter() times_onnx.append((end - start) * 1000) print(f"ONNX Runtime — {n_iterations} iterations:") print(f" Mean: {mean(times_onnx):.3f} ms") print(f" p95: {np.percentile(times_onnx, 95):.3f} ms") 11.4 — Compare and verify Comparing the two approaches and — critically — verifying that the optimized model produces the same predictions. Speed is worthless if accuracy changes. An optimization that makes the model 10x faster but changes predictions by even 0.1% could have real financial consequences in production. speedup = mean(times_sklearn) / mean(times_onnx) improvement = (1 - mean(times_onnx) / mean(times_sklearn)) * 100 print(f"sklearn: {mean(times_sklearn):.3f} ms") print(f"ONNX: {mean(times_onnx):.3f} ms") print(f"Speedup: {speedup:.2f}x") print(f"Improvement: {improvement:.1f}%") # CRITICAL: verify predictions match sklearn_proba = model.predict_proba(test_row)[0] onnx_result = session.run(None, {input_name: test_np}) print(f"\nsklearn probas: {sklearn_proba}") print("Predictions match — safe to deploy the optimized version.") speedup = mean(times_sklearn) / mean(times_onnx) improvement = (1 - mean(times_onnx) / mean(times_sklearn)) * 100 print(f"sklearn: {mean(times_sklearn):.3f} ms") print(f"ONNX: {mean(times_onnx):.3f} ms") print(f"Speedup: {speedup:.2f}x") print(f"Improvement: {improvement:.1f}%") # CRITICAL: verify predictions match sklearn_proba = model.predict_proba(test_row)[0] onnx_result = session.run(None, {input_name: test_np}) print(f"\nsklearn probas: {sklearn_proba}") print("Predictions match — safe to deploy the optimized version.") git add optimization/ git commit -m "feat: add performance profiling and ONNX optimization" git add optimization/ git commit -m "feat: add performance profiling and ONNX optimization" 12. The Final Architecture Let's step back and look at the complete system we've built: ┌──────────────────────┐ │ Developer │ │ (pushes to Git) │ └──────────┬───────────┘ │ ▼ ┌──────────────────────┐ │ GitHub + CI/CD │ │ Test → Build → Push │ └──────────┬───────────┘ │ ▼ ┌──────────────────────┐ │ Docker Container │ │ ┌────────────────┐ │ │ │ FastAPI API │ │ │ │ + GBM Model │ │ │ │ + JSON Logs │ │ │ └────────────────┘ │ └──────────┬───────────┘ │ ┌──────────┴───────────┐ ▼ ▼ ┌─────────────┐ ┌───────────────┐ │ Predictions │ │ Log Storage │ │ (score, │ │ (inputs, │ │ proba, │ │ outputs, │ │ risk) │ │ latency) │ └─────────────┘ └───────┬───────┘ ▼ ┌───────────────┐ │ Drift Analysis │ │ (Evidently AI) │ │ + Performance │ │ Profiling │ └───────────────┘ ┌──────────────────────┐ │ Developer │ │ (pushes to Git) │ └──────────┬───────────┘ │ ▼ ┌──────────────────────┐ │ GitHub + CI/CD │ │ Test → Build → Push │ └──────────┬───────────┘ │ ▼ ┌──────────────────────┐ │ Docker Container │ │ ┌────────────────┐ │ │ │ FastAPI API │ │ │ │ + GBM Model │ │ │ │ + JSON Logs │ │ │ └────────────────┘ │ └──────────┬───────────┘ │ ┌──────────┴───────────┐ ▼ ▼ ┌─────────────┐ ┌───────────────┐ │ Predictions │ │ Log Storage │ │ (score, │ │ (inputs, │ │ proba, │ │ outputs, │ │ risk) │ │ latency) │ └─────────────┘ └───────┬───────┘ ▼ ┌───────────────┐ │ Drift Analysis │ │ (Evidently AI) │ │ + Performance │ │ Profiling │ └───────────────┘ Component Problem It Solves Git + GitHub "What changed, when, and why?" FastAPI + Pydantic "How do others use the model safely?" Pytest "Will this code change break something?" Docker "It works on my machine" → "It works everywhere" GitHub Actions CI/CD "Did someone forget to run tests?" JSON Logging "What's happening in production?" Evidently AI "Is the model still relevant?" ONNX Runtime "Can we make it faster?" Component Problem It Solves Git + GitHub "What changed, when, and why?" FastAPI + Pydantic "How do others use the model safely?" Pytest "Will this code change break something?" Docker "It works on my machine" → "It works everywhere" GitHub Actions CI/CD "Did someone forget to run tests?" JSON Logging "What's happening in production?" Evidently AI "Is the model still relevant?" ONNX Runtime "Can we make it faster?" Component Problem It Solves Component Component Problem It Solves Problem It Solves Git + GitHub "What changed, when, and why?" Git + GitHub Git + GitHub "What changed, when, and why?" "What changed, when, and why?" FastAPI + Pydantic "How do others use the model safely?" FastAPI + Pydantic FastAPI + Pydantic "How do others use the model safely?" "How do others use the model safely?" Pytest "Will this code change break something?" Pytest Pytest "Will this code change break something?" "Will this code change break something?" Docker "It works on my machine" → "It works everywhere" Docker Docker "It works on my machine" → "It works everywhere" "It works on my machine" → "It works everywhere" GitHub Actions CI/CD "Did someone forget to run tests?" GitHub Actions CI/CD GitHub Actions CI/CD "Did someone forget to run tests?" "Did someone forget to run tests?" JSON Logging "What's happening in production?" JSON Logging JSON Logging "What's happening in production?" "What's happening in production?" Evidently AI "Is the model still relevant?" Evidently AI Evidently AI "Is the model still relevant?" "Is the model still relevant?" ONNX Runtime "Can we make it faster?" ONNX Runtime ONNX Runtime "Can we make it faster?" "Can we make it faster?" 13. Key Takeaways After spending a weekend on this (and impressing the Fiverr client), here's what stuck: Data preparation is more than half the work. The Home Credit dataset has 122 columns, anomalous values (365243 in DAYS_EMPLOYED), heavy class imbalance (92%/8%), and lots of missing values. Real data is always messy. Embrace it. Data preparation is more than half the work. Load models once at startup, never per request. This single design choice can make your API 100x faster. It seems obvious in hindsight, but it's the #1 performance mistake in ML APIs. Load models once at startup, never per request. Validate everything at the boundary. Production data is unpredictable. Pydantic caught edge cases I never would have thought of — negative incomes, ages of 200, strings where numbers should be. Validate everything at the boundary. Test invalid inputs, not just valid ones. Half our tests verify that the API properly rejects bad data. This is what prevents silent failures. Test invalid inputs, not just valid ones. rejects Docker eliminates "works on my machine." It's non-negotiable for production ML. Learn it well. Docker eliminates "works on my machine." CI/CD is your automated quality gate. It makes it physically impossible to deploy code that doesn't pass tests. CI/CD is your automated quality gate. Monitor for drift, or your model will silently degrade. The Home Credit data captures a specific moment. In production, everything shifts — incomes, demographics, bureau scores. Without drift detection, you won't know until someone notices bad predictions manually. Monitor for drift, or your model will silently degrade. Profile before optimizing. Don't guess where bottlenecks are — measure them. And always verify that optimization doesn't change predictions. Profile before optimizing. Start simple, iterate. Get a basic API working. Then add Docker. Then CI/CD. Then monitoring. Each layer builds on the previous one. Trying to do everything at once leads to nothing working. Start simple, iterate. Resources Home Credit Default Risk — Kaggle Competition FastAPI Documentation Docker Get Started GitHub Actions Documentation Evidently AI Documentation ONNX Runtime Pydantic V2 Documentation scikit-learn Pipeline Documentation Pytest Documentation Home Credit Default Risk — Kaggle Competition Home Credit Default Risk — Kaggle Competition FastAPI Documentation FastAPI Documentation Docker Get Started Docker Get Started GitHub Actions Documentation GitHub Actions Documentation Evidently AI Documentation Evidently AI Documentation ONNX Runtime ONNX Runtime Pydantic V2 Documentation Pydantic V2 Documentation scikit-learn Pipeline Documentation scikit-learn Pipeline Documentation Pytest Documentation Pytest Documentation If you found this useful, clap or share. I'm always happy to discuss MLOps and the messy reality of putting ML models into production. If you found this useful, clap or share. I'm always happy to discuss MLOps and the messy reality of putting ML models into production. The complete code is available on GitHub. The complete code is available on GitHub. GitHub