How to Turn Messy Healthcare Ops Data Into ML-Ready Features

Healthcare operations data is rarely “a dataset.” It is a living system. Forms change, codes evolve, staff enter data differently across sites, and upstream systems get patched without warning. If you train a model on top of that without guardrails, you do not have an ML pipeline. You have a one-time experiment.

This post is a concise, real-world pipeline for turning messy healthcare ops data into ML-ready features you can trust, rerun, and explain.

Treat data quality as product requirements

Start by writing a simple data contract for your use case. Not what the database allows, but what reality allows.

Examples that show up in real ops workflows:

Referral date cannot be after discharge date
Appointment outcome must come from a known set
Age and vital signs must be in plausible ranges
One event should not appear twice under different IDs
Records missing critical fields must go to a “cannot score” bucket

This is what prevents silent corruption. It also makes conversations with stakeholders easier because the rules are explicit.

Validate early with checks that catch breakage fast

Most production issues are obvious if you measure the right things. You do not need complex tooling to catch 80 per cent of problems.

Run three types of checks on every refresh:

Volume and completeness
Row count changes, missingness spikes in key columns
Validity
Allowed values, plausible ranges, date ordering rules
Duplication
Duplicate keys, repeated events, sudden increases in duplicates

Here is a small, reusable pattern:

import pandas as pd

def dq_report(df: pd.DataFrame):
    report = {
        "rows": int(len(df)),
        "missing_pct_top": (df.isna().mean().sort_values(ascending=False).head(8) * 100).round(2).to_dict(),
        "violations": {}
    }

    if "age" in df.columns:
        bad = df["age"].notna() & ((df["age"] < 0) | (df["age"] > 120))
        report["violations"]["age_out_of_range"] = int(bad.sum())

    if "appointment_status" in df.columns:
        allowed = {"Completed", "Cancelled", "Did Not Attend", "Rescheduled"}
        bad = df["appointment_status"].notna() & (~df["appointment_status"].isin(allowed))
        report["violations"]["status_invalid"] = int(bad.sum())

    if {"referral_date", "discharge_date"}.issubset(df.columns):
        r = pd.to_datetime(df["referral_date"], errors="coerce")
        d = pd.to_datetime(df["discharge_date"], errors="coerce")
        bad = r.notna() & d.notna() & (r > d)
        report["violations"]["referral_after_discharge"] = int(bad.sum())

    return report

The key is not the exact rules. The key is that you run them consistently and store the report so you can spot trends.

Engineer features that survive workflow changes

Ops data changes, so fragile features break. I prioritise robust, explainable features that remain meaningful across system updates:

Time deltas: referral to first contact, referral to appointment, appointment to discharge
Counts and rates: visits in last 30 days, cancellations in last 90 days, DNA rate
Rolling aggregates: 7 day and 30 day activity windows
Missingness flags: “missing” is often a signal in real datasets
Last known state: last appointment outcome, last contact method

A simple test: if a feature could change because someone renamed a code list, add a validation check for that feature’s inputs or do not use it.

Make it reproducible and audit-friendly

In healthcare adjacent work, “trust me” does not scale. Your pipeline should be able to answer:

Which sources created this dataset?
What filters and exclusions were applied?
What features were derived and how?
Which version of the code produced this output?
What changed since the last run?

Practical habits:

log row counts after each transformation
Version your feature definitions
Store data dictionaries
Keep a “cannot score” path for invalid records

Only then should you worry about the model

Good modelling cannot rescue bad inputs. Once your data checks are stable and your features are reproducible, you can move on to modelling with confidence, whether that is stacked ensembles for risk prediction or deep learning for sequential signals.

The real win is not a slightly higher AUROC. The win is a pipeline that keeps producing reliable features next month when the upstream workflow changes.

Closing

If you can turn messy healthcare ops data into stable, validated, explainable features, you have done the hardest part of healthcare ML. Everything else becomes a choice, not a gamble.