The 7-Step ML Workflow for Imbalanced Clinical Risk Prediction

Hypertension risk prediction sounds straightforward until you touch real clinical-style data. Labels are often imbalanced, features can be messy, and it is easy to report great metrics that disappear the moment the model meets a new cohort.

My work sits at the intersection of healthcare AI, predictive modelling, and practical data delivery. In this post, I will focus on one pattern that consistently performs well for tabular medical risk prediction: stacked tree-based ensembles combined with SMOTE Tomek for imbalance handling, evaluated with sensitivity first thinking and strict leakage control.

This is the same mindset I apply when supporting high-volume healthcare operational datasets where data quality, validation checks, and documentation matter as much as the model itself.

Why class imbalance changes everything

In many hypertension datasets, the positive class is smaller than the negative class. If you optimise for accuracy, you can build a model that looks “good” while missing a large fraction of the patients you actually care about identifying.

That is why I treat the modelling goal as a decision problem, not just a score maximisation problem. For screening style use cases, the metrics that matter are:

Sensitivity (recall) for the positive class
Precision at a chosen operating threshold
AUPRC, since it is more informative than AUROC under imbalance
Calibration if probabilities will be used for triage

Step 1: Define the label and prevent leakage

Leakage is the fastest way to get impressive results that fail in practice. In medical risk prediction, leakage can come from:

Features that indirectly encode the label
Measurements taken after diagnosis or treatment
Duplicate patient records across the train and test
Time leakage when predicting future outcomes

A simple rule that saves projects is this: do not let the model see information that would not exist at the time the prediction is made. If you have repeated encounters per patient, split by patient so the same person cannot appear in both training and validation.

Step 2: Build baselines before stacking

Before stacking, I want one or two strong baselines. For tabular healthcare risk prediction, tree-based methods are often effective because they capture nonlinear interactions and handle mixed feature types.

A typical baseline set:

Logistic regression with class weights for a sanity check
Random forest or extra trees
Gradient boosting, such as XGBoost, when available

Baselines also tell you what is hard about the dataset and whether the target is learnable.

Step 3: Handle imbalance with SMOTE Tomek carefully

SMOTE Tomek combines:

SMOTE, which synthesises minority class examples
Tomek links, which remove borderline overlap cases

It can improve minority class recall, but only if done correctly. The key constraint is simple:

Resampling must happen only on the training fold inside cross-validation.

If you oversample before splitting, you risk leakage and inflated validation scores.

Stacking helps when different base models make different errors. A small stack is often enough:

Base learners: two or three tree ensembles
Meta learner: logistic regression for stability and interpretability

The safe way to train stacking is to ensure the meta learner is trained on out of fold predictions, not predictions from models that saw the same rows.

Step 5: Choose a decision threshold on purpose

Healthcare style prediction is rarely about “probability above 0.5.” It is about aligning the model with operational reality.

I typically choose a threshold by targeting a minimum sensitivity, then checking the resulting precision and alert volume. That makes the model usable for screening and makes the evaluation honest.

Stacking + SMOTE Tomek + sensitivity first thresholding

This is still compact enough for a blog post, but long enough to be practical. It shows:

SMOTE Tomek inside the pipeline
Stratified cross-validation
Out-of-fold probability estimates
Threshold selection to hit a sensitivity target
A simple stacked ensemble

import numpy as np

from imblearn.pipeline import Pipeline
from imblearn.combine import SMOTETomek

from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.metrics import (
    roc_auc_score,
    average_precision_score,
    precision_recall_curve,
    confusion_matrix,
)
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression


def choose_threshold_for_min_sensitivity(y_true, y_prob, min_sens=0.85):
    """
    Pick a threshold that achieves at least min_sens (recall for positive class),
    and among those thresholds choose the one with the best precision.
    """
    precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
    thresholds = np.r_[thresholds, 1.0]  # align lengths

    ok = recall >= min_sens
    if not np.any(ok):
        # if target sensitivity cannot be reached, default to a low threshold
        return 0.1

    best = np.argmax(precision[ok])
    return float(thresholds[ok][best])


def report_at_threshold(y_true, y_prob, threshold):
    y_pred = (y_prob >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()

    sensitivity = tp / (tp + fn) if (tp + fn) else 0.0
    specificity = tn / (tn + fp) if (tn + fp) else 0.0
    precision = tp / (tp + fp) if (tp + fp) else 0.0

    return {
        "threshold": threshold,
        "tp": int(tp), "fp": int(fp), "tn": int(tn), "fn": int(fn),
        "sensitivity": float(sensitivity),
        "specificity": float(specificity),
        "precision": float(precision),
    }


# X: feature matrix (pandas DataFrame or numpy array)
# y: binary labels (0/1)
# Replace with your dataset:
# X, y = ...

rf = RandomForestClassifier(
    n_estimators=400,
    random_state=42,
    class_weight="balanced_subsample",
    n_jobs=-1,
)

et = ExtraTreesClassifier(
    n_estimators=600,
    random_state=42,
    class_weight="balanced",
    n_jobs=-1,
)

meta = LogisticRegression(max_iter=2000)

stack = StackingClassifier(
    estimators=[("rf", rf), ("et", et)],
    final_estimator=meta,
    stack_method="predict_proba",
    n_jobs=-1,
)

model = Pipeline(steps=[
    ("balance", SMOTETomek(random_state=42)),
    ("clf", stack),
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Out of fold predicted probabilities (honest performance estimate)
oof_prob = cross_val_predict(
    model,
    X, y,
    cv=cv,
    method="predict_proba",
    n_jobs=-1,
)[:, 1]

auroc = roc_auc_score(y, oof_prob)
auprc = average_precision_score(y, oof_prob)

threshold = choose_threshold_for_min_sensitivity(y, oof_prob, min_sens=0.90)
summary = report_at_threshold(y, oof_prob, threshold)

print("AUROC:", round(auroc, 4))
print("AUPRC:", round(auprc, 4))
print("Threshold:", summary["threshold"])
print("Sensitivity:", round(summary["sensitivity"], 4))
print("Specificity:", round(summary["specificity"], 4))
print("Precision:", round(summary["precision"], 4))
print("Confusion (tp, fp, tn, fn):", summary["tp"], summary["fp"], summary["tn"], summary["fn"])

What this code is doing correctly:

SMOTE Tomek is applied inside the pipeline, so it happens only on training folds
Evaluation is done using out-of-fold predictions, which reduces overly optimistic estimates
The threshold is selected to meet a sensitivity target, which fits the hypertension screening style objectives

Step 6: Validate generalisation, not just performance

A single split is rarely enough. I look for stability across folds, and I pay special attention to false negatives. In a hypertension setting, false negatives are the cases you most want to understand because they represent missed risk.

If subgroup information exists, evaluate across groups. If the dataset spans time, evaluate across time windows. If multiple cohorts exist, validate across cohorts. This is where many models fail, and it is better to find that out early.

Step 7: Treat documentation and monitoring as part of the model

In healthcare adjacent contexts, a model is not just a notebook. It is an artefact with assumptions.

I document:

label definition and inclusion criteria
preprocessing steps and feature list
validation approach and threshold logic
model versioning and reproducibility details

If deployed, I monitor:

input drift
prediction drift
performance proxies when ground truth is delayed

Conclusion

Hypertension prediction on imbalanced data is not solved by a single algorithm. It is solved by discipline: leakage control, sensitivity-aligned evaluation, proper handling of imbalances, and validation that reflects how the model will be used.

Stacked tree-based ensembles combined with SMOTE Tomek can be a strong approach when the goal is to improve recall for high-risk patients while maintaining acceptable precision. The real value is not in the model choice alone, but in the workflow that makes the results trustworthy.