Detecting Medicare Provider Fraud with Machine Learning by@mazhar4ai

Detecting Medicare Provider Fraud with Machine Learning

The goal of this project is to “predict the potentially fraud providers” based on the claims filed by them using machine learning algorithm. The total Medicare spending increased exponentially due to frauds in Medicare claims. Healthcare fraud is an organized crime which involves peers of providers, physicians, beneficiaries acting together to make fraud claims. We will study fraudulent patterns in the provider's claims to understand the future behaviour of providers. We are considering Inpatient and Outpatient claims and Beneficiary details of customers. Inpatient claims can be filed for many services like surgeries, emergency services, observations, therapies etc.
Mazhar Ali HackerNoon profile picture

Mazhar Ali

Aspirent Machine Learning and Deep Learning Engineer...

Project Objectives

Provider Fraud is one of the biggest problems facing Medicare. According to the government, the total Medicare spending increased exponentially due to fraud in Medicare claims.

Healthcare fraud is an organized crime that involves peers of providers, physicians, and beneficiaries acting together to make fraud claims. Rigorous analysis of Medicare data has yielded many physicians who indulge in fraud. They adopt ways in which an ambiguous diagnosis code is used to adopt the costliest procedures and drugs.

Insurance companies are the most vulnerable institutions impacted due to these bad practices. Due to this reason, insurance companies increased their insurance premiums, and as a result healthcare is becoming costlier day by day. Healthcare fraud and abuse take many forms. Some of the most common types of fraud by providers are:

a) Billing for services that were not provided.

b) Duplicate submission of a claim for the same service.

c) Misrepresenting the service provided.

d) Charging for more complex or expensive services than was actually provided.

e) Billing for a covered service when the service actually provided was not covered.

Provider (Healthcare Provider)- Provider term refers to an individual doctor or hospital facility that provides medical services to the patient.

Medicare- Medicare is a affordable US federal health insurance program for older age and disabled peoples.

Claims- In US, healthcare provider submits an electronic file for each patient who has health insurance. These files are submitted to the respective health insurance. These files contain patient’s conditions, services and its charges provided to patient, diagnosis of the patient. It also contains patient’s and provider’s details. Every information about patient diagnosis and provided services are coded as per American Medical Standard.

Problem Statement

The goal of this project is to “predict the potentially fraud providers” based on the claims filed by them using a machine learning algorithm. Along with this, we will also discover important features helpful in detecting the behavior of potentially fraud providers. Further, we will study fraudulent patterns in the provider's claims to understand the future behavior of providers.

Introduction to the Dataset

For the purpose of this project, we are considering Inpatient and Outpatient claims and Beneficiary details of customers. Let's see their details:

A) Inpatient Data

This data provides insights into the claims filed for those patients who are admitted to the hospitals. It contains the services provided by the provider. It also provides additional details like their admission and discharge dates and admit diagnosis codes.

B) Outpatient Data

This data provides details about the claims filed for those patients who visit hospitals and were not admitted. Outpatient claims can be filed for many services like surgeries, emergency services, observations, therapies, etc.

C) Beneficiary Details Data

This data contains patients’ details like DOB, DOD, Gender, race, pre-existing health conditions and diseases indicator, the region they belong to, etc.

D) Train/Test Data

There is also a train data that contains the list of providers with potential fraud and none fraud labels. We will use this to label the above datasets for supervised learning. And test dataset has only provider lists. Our task is to predict the potential fraud providers.

Related Research Paper

  1. Title- Predicting Healthcare Fraud in Medicaid: A Multidimensional Data Model and Analysis Techniques for Fraud Detection

This research paper is for medicaid healthcare provider fraud detection. For effective fraud detection, one has to look at the data beyond the transaction-level. This paper builds upon fraud type classifications and the Medicaid environment and to develop a Medicaid multidimensional schema and provide a set of multidimensional data models and analysis techniques that help to predict the likelihood of fraudulent activities.

Within the healthcare system three main parties commit fraud: healthcare providers, beneficiaries (patients), and insurance carriers. According to Sparrow there are two different types of fraud: “hit-and-run” and “steal a little, all the time”. “Hit-and-run” perpetrators simply submit many fraudulent claims, receive payment, and disappear. “Steal a little, all the time” perpetrators work to ensure fraud goes unnoticed and bill fraudulently over a long period of time. The provider may hide false claims within large batches of valid claims and, when caught, will claim it an error, repay the money, and continue the behavior.

The FBI highlights and categorizes some of the most prevalent known Medicaid fraud schemes:

  • Phantom Billing – Submitting claims for services not provided.
  • Duplicate Billing – Submitting similar claims more than once.
  • Bill Padding – Submitting claims for unneeded ancillary services to Medicaid.
  • Upcoding – Billing for a service with a higher reimbursement rate than the service provided.
  • Unbundling – Submitting several claims for various services that should only be billed as one service.
  • Excessive or Unnecessary Services – Provides medically excessive or unnecessary services to a patient.
  • Kickbacks – A kickback is a form of negotiated bribery in which a commission is paid to the bribe-taker (provider or patient) as a quid pro quo for services rendered.

Sparrow proposes that for effective fraud detection one has to look at the data beyond the transaction level, defining seven levels of healthcare fraud control.

Levels of healthcare fraud control, adapted, from Sparrow:


A general core that each claim exists can be extracted among the different claim forms: patient, provider, diagnoses, procedures, and amounts charged. For each claim-line, a type field links to type-specific detailed information. Based on the desired views, the following dimensions are included: date (claim filed, service, paid), provider (executing, referring, billing), patient, insurer policy, treatment, diagnosis, claim type, drug, outcome, location.

The following numeric facts can be distinguished, some computed by the other facts: Covered charges ($), Non-covered charges ($), Total charges ($), Units of service, Number of days between claim filled and paid, Number of days between service and claim paid, Distance between provider and patient, Number of days between service and claim filled, Covered price per unit, Total price per unit, and Treatment duration. Figure shows the resulting multidimensional schema.


Data models addressing levels of fraud:








Using this paper we can relate that all these levels are also useful in detecting fraud and we can make this level data by grouping to these levels, like group by provider, patient, patient, and provider, etc.

Some of the levels can not be built because of dataset limitations, e.g. for grouping to make level 4 we need insurer information which is unavailable in the dataset. We will implement levels based on the availability of features in the dataset.

  1. Title- Electronic Fraud Detection in the U.S. Medicaid Healthcare Program: Lessons Learned from other Industries

In this paper, we conduct a systematic literature study to analyze the applicability of existing electronic fraud detection techniques in similar industries to the US Medicaid program. However, it has been researched on Medicaid but the idea can be generalized to Medicare also as it is also a government-funded healthcare program.

Healthcare fraud in the United States is a severe problem that costs the government billions of dollars each year. Roughly one-third of all US healthcare costs are attributable to fraud, waste, and abuse. Third-party payers for healthcare services (insurance companies and government-run programs) must deal with fraudulent practitioners, organized criminal schemes, and honest providers who make unintended mistakes while billing for their legitimate services.

Medicaid and Medicare are two government programs that provide medical and health-related services to specific groups of people in the United States. Medicare is a federal program that has consistent rules across the fifty states and covers almost everyone 65 years of age or older and disabled person. Medicaid is a state-administered program in which each state provides a unique health care program for people with low income or no income.

Type of fraud in the Healthcare industry:



The structured literature review about fraud detection systems in several industries resulted in an overview of applied fraud detection techniques:


The credit card and telecommunications industries possess real-time data, resolve reported cases of fraud quickly, and, as such, are able to maintain high-quality databases of labeled data that can be used for supervised learning. Medicaid data is dispersed and unlabeled, and there are no signals that this will change in the near future. Multiple stakeholders at the federal and local levels, misaligned incentives, and fragmented responsibility hamper the process of labeling and sharing data. Thus, supervised learning techniques are severely restricted.

Supervised classification models are particularly appropriate for use in health care fraud, as they can be trained and adjusted to detect sophisticated and evolving fraud schemes. In the credit card industry, supervised classification techniques like neural networks, support vector machines, and random forests form the basis for sophisticated and effective fraud detection. The drawback to these techniques is that new fraud schemes are not immediately detectable due to the lag of discovering and labeling new fraud in training data. In the telecommunications industry, unsupervised techniques such as profiling and anomaly detection are applied to complement supervised learning. In the telecommunications industry, extensive, high-quality data is available that is used to construct accurate profiles.

  1. Title- Gradient Boosted Decision Tree Algorithms for Medicare Fraud Detection

Insurance fraud spuriously inflates the cost of Healthcare. We use Medicare claims data as input to various algorithms to gauge their performance in fraud detection. The claims data contain categorical features, some of which have thousands of possible values. To the best of our knowledge, this is the first study on using CatBoost and LightGBM to encode categorical data for Medicare fraud detection.

The data source is publicly available claims data CMS publishes annually. Moreover, we have data on providers that are prohibited from billing Medicare. These data are known as the List of Excluded Individuals and Entities (LEIE). In research, LEIE data was used to label claims data as possible fraud and non-fraud. Dataset is highly imbalanced so they used under-sampling techniques to address class imbalance.

Here we use three different types of classifiers. In previous research, we obtain strong results with GBDTs so we use three types of GBDTs: CatBoost, XGBoost and LightGBM. To provide a context for comparison, we include Random Forest and Logistic Regression.

Since CatBoost and LightGBM automatically handle categorical features without needing to pre-process data, we are able to easily use the HCPCS code, provider gender, provider type, provider state, drug brand name or drug generic name as features for CatBoost and LightGBM. As a result, we are able to use a dataset having fewer features.

To make a fair baseline comparison, we decided to use hyper-parameters as close to default values as possible for all classifiers. We do not attempt any hyper-parameter tuning. We use the Python application programming interfaces (API’s) for all classifiers. The Python version we use is 3.7.3. We use one GPU to fit both CatBoost and XGBoost models. We use Python Scikit-learn [28] for stratified fivefold cross-validation and AUC calculation.

Results and Comparisons of different classifiers:


Existing Solution

There are many resources on this topic for this dataset I used this resource for a detailed summary as it covers all other solutions available.

Data Preprocessing

  • In Beneficiary data there are two columns - one for DOB and the other for DOD. Both were used to find the age of the patient and also an indicator of whether the patient is dead or not.

  • In Inpatient data there are two columns, claim start date and column end date. Using these columns a new column is created for finding the number of days patients are admitted to the hospital.

  • Since Inpatient and outpatient data have similar columns so both these are merged into a single dataset

  • Beneficiary data is also merged with the above dataset based on BeneID. Now all three datasets is merged into a single dataset that contains a union of all columns.

  • Now, this dataset is merged with Train data on ProviderID to be labeled with 0 or 1 (non_fraud and fraud). Train data contains provider lists with labels of fraud and non_fraud.


  • Percent Distribution of fraud class in –
    • transactional data(merged data)-


    • train data


  • State-wise beneficiary distribution
  • Top 10 procedures involved in Healthcare fraud


  • Top 20 attending physicians involved in healthcare fraud


Some Feature Engineering

  • Merged train and test data to engineer some features like average more accurately. But only test data will be used for evaluation

  • This merged data is grouped by the provider and find average on some columns. For example-

    • InscClaimAmtReimbursed

    • DeductibleAmtPaid

    • IPAnnualReimbursementAmt

    • PerProviderAvg_OPAnnualReimbursementAmt

    • PerProviderAvg_NoOfMonths_PartACov

    • AdmitForDays

  • In the same way merged data is grouped by some other columns- BeneID, Attending Physician, Operating physician, Diagnosis code group_1/2/3..,  Claim procedure code_1/2/3...

  • These averaging is done to impute values for large categorical features

  • This merged data is also grouped by two or more columns as fraud is done by the involvement of provider, customer, attending physician etc. So this averaging may give some importance in fraud detection. Some grouped by columns are-

    • Provider and BeneID

    • Provider and Attending physician

    • Provider and diagnosis code group 1, 2, 3..

    • Provider, BeneID and Attending physician

  • Grouping by 1st two-character or diagnosis code may not be a wise decision as there are 120+ groups created so there will be 120+ dimensions for each diagnosis code column. So it will increase the computational complexity

Some data preprocessing after above engineering

  • All the numerical columns filled with 0s for NA values

  • Now some columns are removed from data as either they are not useful or converted to averaging feature or some other feature. Such list of columns are –

remove_these_columns=['BeneID', 'ClaimID', 'ClaimStartDt','ClaimEndDt','AttendingPhysician', 
'OperatingPhysician', 'OtherPhysician', 'ClmDiagnosisCode_1', 'ClmDiagnosisCode_2', 
'ClmDiagnosisCode_3', 'ClmDiagnosisCode_4', 'ClmDiagnosisCode_5', 'ClmDiagnosisCode_6', 
'ClmDiagnosisCode_7', 'ClmDiagnosisCode_8', 'ClmDiagnosisCode_9', 'ClmDiagnosisCode_10',
'ClmProcedureCode_1', 'ClmProcedureCode_2', 'ClmProcedureCode_3','ClmProcedureCode_4', 
'ClmProcedureCode_5', 'ClmProcedureCode_6', 'ClmAdmitDiagnosisCode', 'AdmissionDt',
'DischargeDt', 'DiagnosisGroupCode','DOB', 'DOD', 'State', 'County']

  • Gender and Race are categorical so converted to feature as indicators with 0 or 1

  • The target value is also replaced with 0 and 1 instead of yes and no

  • Now ‘sum’ aggregations are done on the group by provider and potential fraud

  • Features and Labels are split into x and y

  • Data standardizations are done on the entire data

  • X and y data is split into train and validation data


There are many classifiers used and compared their performance.

  • Logistic Regression

Data is trained on Logistic regression with balanced weight and cross-validation as data is imbalanced. See the performance of the model in terms of various metrics.

Accuracy Train:         0.922365988909427
Accuracy Val:           0.9125077017868145
Sensitivity Train :     0.7627118644067796
Sensitivity Val:        0.6776315789473685
Specificity Train:      0.9388290125254879
Specificity Val:        0.9367777022433719
Kappa Value :           0.5438304105142315
AUC         :           0.8072046405953702 (threshold 0.60)
F1-Score Train  :       0.6474820143884892
F1-Score Val  :         0.5919540229885056

We can see accuracy is very high for Logistic regression.

  • Random Forest

This time an ensemble Random forest is trained on the same dataset. Random classifier is initiated with 500 base learners and depth 4. These are the scores of the Random forest evaluation.

Accuracy Train :               0.8885661473461843
Accuracy Test :                0.8712261244608749
Sensitivity :                  0.8157894736842105
Specificity :                  0.8769544527532291
Kappa Value :                  0.47733173495472203
AUC         :                  0.8463719632187199
F1-Score Train                 0.6026365348399246
F1-Score Validation :          0.5426695842450766

Accuracy and F1 score are not better than LR but sensitivity and specificity are better. All over if we say here LR performs better.

These are the top20 important features and scores-

Variable: PerProviderAvg_InscClaimAmtReimbursed                 Importance: 0.08
Variable: InscClaimAmtReimbursed                                 Importance: 0.07
Variable: PerAttendingPhysicianAvg_InscClaimAmtReimbursed       Importance: 0.07
Variable: PerOperatingPhysicianAvg_InscClaimAmtReimbursed       Importance: 0.06
Variable: PerClmAdmitDiagnosisCodeAvg_InscClaimAmtReimbursed     Importance: 0.04
Variable: PerClmAdmitDiagnosisCodeAvg_DeductibleAmtPaid          Importance: 0.04
Variable: PerClmDiagnosisCode_1Avg_DeductibleAmtPaid             Importance: 0.04
Variable: PerOperatingPhysicianAvg_IPAnnualReimbursementAmt       Importance: 0.03
Variable: ClmCount_Provider_ClmDiagnosisCode_7                   Importance: 0.03
Variable: ClmCount_Provider_ClmDiagnosisCode_8                   Importance: 0.03
Variable: ClmCount_Provider_ClmDiagnosisCode_9                   Importance: 0.03
Variable: DeductibleAmtPaid                                      Importance: 0.02
Variable: AdmitForDays                                           Importance: 0.02
Variable: PerProviderAvg_DeductibleAmtPaid                       Importance: 0.02
Variable: PerAttendingPhysicianAvg_DeductibleAmtPaid             Importance: 0.02

Some more classifiers are tried such as-

clfs = {
    'svm1': SVC(C=0.01,kernel='linear',probability=1),
    'svm2': SVC(C=0.01,kernel='rbf',probability=1),
    'svm3': SVC(C=.01,kernel='poly',degree=2,probability=1),
    'ada': AdaBoostClassifier(),
    'dtc': DecisionTreeClassifier(class_weight='balanced'),
    'gbc': GradientBoostingClassifier(),
    'lr': LogisticRegression(class_weight='balanced'),
    'xgb': XGBClassifier(booster='gbtree')

They all are trained on the same dataset and performance results are listed below-

{'ada': 0.5309090909090909,
 'dtc': 0.42857142857142855,
 'gbc': 0.5873015873015873,
 'lr': 0.5974683544303796,
 'svm1': 0.5069124423963134,
 'svm2': 0.5483870967741936,
 'svm3': 0.41791044776119407,
 'xgb': 0.5477178423236515}

We can see linear model LR beats all other models in terms of F1 score. Since the sensitivity score is also high so we can say that the trained model is not dumb.

In this solution, these were the experiments done. There is also a DL model used in the sample solution to train and evaluate the performance on the provider dataset. Since the case study is limited to Machine Learning so I am not discussing that model.

First Cut Approach

  1. Claim diagnosis code is very important in fraud activity. Like if a provider files a claim for a cold barely one can understand if the patient had a cold a few days ago. So providers may use this frequently for reimbursement. The key idea is that similar diseases have usually the same first two characters. So in feature engineering, I would group diagnosis code into such groups. Groups would be formed by the first two chars of dx code (diagnosis code). After that, I would create new features by transforming other numerical features into mean or some other statistical operation grouped by diagnosis group. Here I am showing two bar graph 1. For top 10 diagnosis groups involved and 2. For the top 10 diagnoses involved. So there are some overlaps but the top 3 are very different diagnosis code groups. So transforming features in terms of grouping diagnosis codes may also improve model performance.


  2. Since we will train and test by grouping providers so I have another feature engineering idea. It is like implementing the idea of TF-IDF in dx code. When we group by the provider, each provider row can be considered as a document and each dx code used by the provider would be considered as a word. For missing dx code we would use tf-idf value as 0.0. This transformation would be done separately for each dx code column. We can think of the total # of rows as the total # of documents after grouping by the provider and each dx code in each dx column would be considered as word. If there is a pattern of using any specific dx code by fraud provider’s tf-idf score would be similar among them. We can also generate more tf-idf like features on dx codes by grouping other features like group by the beneficiary, attending physician, etc (basic grouping by all entities which can possibly be involved in fraud activity- domain knowledge).

  3. See, below the example of a single provider. So this can be thought of as a single document. Each dx in dx code column 1 is a word. The total # of unique dx codes in dx code column 1 is the vocabulary of words. Similarly, we can do this for each column of dx codes. This will generate 10 new features for each grouping entity.


  4. In the previous solution, there were no features specific to each dx code. All the features were generated by grouping these dx codes and aggregating the mean on other statistical features

  5. All ML models should be fine-tuned with different hyper parameters and we could get better results.

  6. We can also train the model on entire data without grouping, there is more opportunity to learn patterns. When we are grouping by provider we have to aggregate all the features by some statistic like sum, mean, median, etc. So when we are doing so some information or pattern may be lost due to aggregation.

  7. We would implement a different level of fraud analysis from research paper 1 listed above. There are 7 levels discussed but we would implement those only whose data is available. As per available data up to level 4 can be implemented.

  8. As per 3rd research paper discussed above, we would try to implement GBDT’s different variations CatBoost and LightGB as well as some others like Random forest, SVM, and Logistic regression classifiers to compare the performance of each classifier.

Exploratory Data Analysis (EDA)

  • Plotting the frequencies of fraud and nonfraud classes in claim transactional data

Percent Distribution of Potential Fraud class:- 
0    61.878931
1    38.121069
Name: PotentialFraud, dtype: float64



  • There is class imbalance but data is enough for minor classes to learn for the model.
  • 38% of transactions are fraud which is very high.

  • Plotting the frequencies of fraud and non-fraud providers in provider-labeled data

    Percent Distribution of Potential Fraud class:- 
    No     90.64695
    Yes     9.35305
    Name: PotentialFraud, dtype: float64



  • There are highly imbalanced data in fraud provider labels data.
  • Only 9% of providers are listed as fraud which is very low.
  • 9% of fraud providers did fraud in 38% of transactions.
  • Since we are a training model at the transaction level so we don't need to worry about class imbalance.

Univariate and Bivariate Analysis

  • State-wise Beneficiary involved in Healthcare Fraud



    • State 5 has the most number of fraud transactions
    • State 45 has the most number of non-fraud transactions
    • State 5, 31, 49 and 46 have more # of fraud cases than nonfraud cases
    • These states’ transactions could help model predicting fraud cases
  • Race-wise Beneficiary involved in Healthcare Fraud



  • There is no Race 4 data

  • Race 1 beneficiary has most # of fraud and also non-fraud cases

  • There is no pattern of race in fraud and nonfraud cases, so this feature may have less importance

    Individual Beneficiaries invloved in Healthcare Fraud



  • Beneficiary BENE42721 is not involved in fraud
  • Beneficiary BENE143400 is only involved in fraud
  • Some beneficiaries are mostly involved in fraud and some are mostly not involved
  • There is some pattern in each individual beneficiary so it may have good feature importance

  • Distribution of # of claims billed by fraud and nonfraud providers



  • There is a very different distribution between fraud and non-fraud providers, this feature will help model much more to predict a class
  • Non-fraud providers curve is very peaked which means these providers have very high density and fraud providers have very low density

  • Distribution of Average AdmitForDays for each fraud and nonfraud providers



  • It also has different distribution between fraud and non-fraud providers, however, there is some overlap but still, this feature will help model much more to predict class
  • The fraud providers curve is very peaked it means these providers have very high density and non-fraud providers have very low density
  • Fraud providers tend to bill most of the claims with an average range of 4-9 admit days

  • Distribution of average DeductibleAmtPaid of claims billed by fraud and non fraud providers