Hypertension risk prediction sounds straightforward until you touch real clinical-style data. Labels are often imbalanced, features can be messy, and it is easy to report great metrics that disappear the moment the model meets a new cohort.
My work sits at the intersection of healthcare AI, predictive modelling, and practical data delivery. In this post, I will focus on one pattern that consistently performs well for tabular medical risk prediction: stacked tree-based ensembles combined with SMOTE Tomek for imbalance handling, evaluated with sensitivity first thinking and strict leakage control.
This is the same mindset I apply when supporting high-volume healthcare operational datasets where data quality, validation checks, and documentation matter as much as the model itself.
Why class imbalance changes everything
In many hypertension datasets, the positive class is smaller than the negative class. If you optimise for accuracy, you can build a model that looks “good” while missing a large fraction of the patients you actually care about identifying.
That is why I treat the modelling goal as a decision problem, not just a score maximisation problem. For screening style use cases, the metrics that matter are:
-
Sensitivity (recall) for the positive class
-
Precision at a chosen operating threshold
-
AUPRC, since it is more informative than AUROC under imbalance
-
Calibration if probabilities will be used for triage
Step 1: Define the label and prevent leakage
Leakage is the fastest way to get impressive results that fail in practice. In medical risk prediction, leakage can come from:
- Features that indirectly encode the label
- Measurements taken after diagnosis or treatment
- Duplicate patient records across the train and test
- Time leakage when predicting future outcomes
A simple rule that saves projects is this: do not let the model see information that would not exist at the time the prediction is made. If you have repeated encounters per patient, split by patient so the same person cannot appear in both training and validation.
Step 2: Build baselines before stacking
Before stacking, I want one or two strong baselines. For tabular healthcare risk prediction, tree-based methods are often effective because they capture nonlinear interactions and handle mixed feature types.
A typical baseline set:
- Logistic regression with class weights for a sanity check
- Random forest or extra trees
- Gradient boosting, such as XGBoost, when available
Baselines also tell you what is hard about the dataset and whether the target is learnable.
Step 3: Handle imbalance with SMOTE Tomek carefully
SMOTE Tomek combines:
- SMOTE, which synthesises minority class examples
- Tomek links, which remove borderline overlap cases
It can improve minority class recall, but only if done correctly. The key constraint is simple:
Resampling must happen only on the training fold inside cross-validation.
If you oversample before splitting, you risk leakage and inflated validation scores.
Step 4: Use stacking to reduce model-specific blind spots
Stacking helps when different base models make different errors. A small stack is often enough:
- Base learners: two or three tree ensembles
- Meta learner: logistic regression for stability and interpretability
The safe way to train stacking is to ensure the meta learner is trained on out of fold predictions, not predictions from models that saw the same rows.
Step 5: Choose a decision threshold on purpose
Healthcare style prediction is rarely about “probability above 0.5.” It is about aligning the model with operational reality.
I typically choose a threshold by targeting a minimum sensitivity, then checking the resulting precision and alert volume. That makes the model usable for screening and makes the evaluation honest.
Stacking + SMOTE Tomek + sensitivity first thresholding
This is still compact enough for a blog post, but long enough to be practical. It shows:
- SMOTE Tomek inside the pipeline
- Stratified cross-validation
- Out-of-fold probability estimates
- Threshold selection to hit a sensitivity target
- A simple stacked ensemble
import numpy as np
from imblearn.pipeline import Pipeline
from imblearn.combine import SMOTETomek
from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.metrics import (
roc_auc_score,
average_precision_score,
precision_recall_curve,
confusion_matrix,
)
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
def choose_threshold_for_min_sensitivity(y_true, y_prob, min_sens=0.85):
"""
Pick a threshold that achieves at least min_sens (recall for positive class),
and among those thresholds choose the one with the best precision.
"""
precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
thresholds = np.r_[thresholds, 1.0] # align lengths
ok = recall >= min_sens
if not np.any(ok):
# if target sensitivity cannot be reached, default to a low threshold
return 0.1
best = np.argmax(precision[ok])
return float(thresholds[ok][best])
def report_at_threshold(y_true, y_prob, threshold):
y_pred = (y_prob >= threshold).astype(int)
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
sensitivity = tp / (tp + fn) if (tp + fn) else 0.0
specificity = tn / (tn + fp) if (tn + fp) else 0.0
precision = tp / (tp + fp) if (tp + fp) else 0.0
return {
"threshold": threshold,
"tp": int(tp), "fp": int(fp), "tn": int(tn), "fn": int(fn),
"sensitivity": float(sensitivity),
"specificity": float(specificity),
"precision": float(precision),
}
# X: feature matrix (pandas DataFrame or numpy array)
# y: binary labels (0/1)
# Replace with your dataset:
# X, y = ...
rf = RandomForestClassifier(
n_estimators=400,
random_state=42,
class_weight="balanced_subsample",
n_jobs=-1,
)
et = ExtraTreesClassifier(
n_estimators=600,
random_state=42,
class_weight="balanced",
n_jobs=-1,
)
meta = LogisticRegression(max_iter=2000)
stack = StackingClassifier(
estimators=[("rf", rf), ("et", et)],
final_estimator=meta,
stack_method="predict_proba",
n_jobs=-1,
)
model = Pipeline(steps=[
("balance", SMOTETomek(random_state=42)),
("clf", stack),
])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Out of fold predicted probabilities (honest performance estimate)
oof_prob = cross_val_predict(
model,
X, y,
cv=cv,
method="predict_proba",
n_jobs=-1,
)[:, 1]
auroc = roc_auc_score(y, oof_prob)
auprc = average_precision_score(y, oof_prob)
threshold = choose_threshold_for_min_sensitivity(y, oof_prob, min_sens=0.90)
summary = report_at_threshold(y, oof_prob, threshold)
print("AUROC:", round(auroc, 4))
print("AUPRC:", round(auprc, 4))
print("Threshold:", summary["threshold"])
print("Sensitivity:", round(summary["sensitivity"], 4))
print("Specificity:", round(summary["specificity"], 4))
print("Precision:", round(summary["precision"], 4))
print("Confusion (tp, fp, tn, fn):", summary["tp"], summary["fp"], summary["tn"], summary["fn"])
What this code is doing correctly:
-
SMOTE Tomek is applied inside the pipeline, so it happens only on training folds
-
Evaluation is done using out-of-fold predictions, which reduces overly optimistic estimates
-
The threshold is selected to meet a sensitivity target, which fits the hypertension screening style objectives
Step 6: Validate generalisation, not just performance
A single split is rarely enough. I look for stability across folds, and I pay special attention to false negatives. In a hypertension setting, false negatives are the cases you most want to understand because they represent missed risk.
If subgroup information exists, evaluate across groups. If the dataset spans time, evaluate across time windows. If multiple cohorts exist, validate across cohorts. This is where many models fail, and it is better to find that out early.
Step 7: Treat documentation and monitoring as part of the model
In healthcare adjacent contexts, a model is not just a notebook. It is an artefact with assumptions.
I document:
- label definition and inclusion criteria
- preprocessing steps and feature list
- validation approach and threshold logic
- model versioning and reproducibility details
If deployed, I monitor:
-
input drift
-
prediction drift
-
performance proxies when ground truth is delayed
Conclusion
Hypertension prediction on imbalanced data is not solved by a single algorithm. It is solved by discipline: leakage control, sensitivity-aligned evaluation, proper handling of imbalances, and validation that reflects how the model will be used.
Stacked tree-based ensembles combined with SMOTE Tomek can be a strong approach when the goal is to improve recall for high-risk patients while maintaining acceptable precision. The real value is not in the model choice alone, but in the workflow that makes the results trustworthy.
