Detecting Medicare Provider Fraud with Machine Learningby@mazhar4ai
338 reads
338 reads

Detecting Medicare Provider Fraud with Machine Learning

by Mazhar AliJuly 9th, 2022
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

The goal of this project is to “predict the potentially fraud providers” based on the claims filed by them using machine learning algorithm. The total Medicare spending increased exponentially due to frauds in Medicare claims. Healthcare fraud is an organized crime which involves peers of providers, physicians, beneficiaries acting together to make fraud claims. We will study fraudulent patterns in the provider's claims to understand the future behaviour of providers. We are considering Inpatient and Outpatient claims and Beneficiary details of customers. Inpatient claims can be filed for many services like surgeries, emergency services, observations, therapies etc.
featured image - Detecting Medicare Provider Fraud with Machine Learning
Mazhar Ali HackerNoon profile picture

Project Objectives

Provider Fraud is one of the biggest problems facing Medicare. According to the government, the total Medicare spending increased exponentially due to fraud in Medicare claims.

Healthcare fraud is an organized crime that involves peers of providers, physicians, and beneficiaries acting together to make fraud claims. Rigorous analysis of Medicare data has yielded many physicians who indulge in fraud. They adopt ways in which an ambiguous diagnosis code is used to adopt the costliest procedures and drugs.

Insurance companies are the most vulnerable institutions impacted due to these bad practices. Due to this reason, insurance companies increased their insurance premiums, and as a result healthcare is becoming costlier day by day. Healthcare fraud and abuse take many forms. Some of the most common types of fraud by providers are:

a) Billing for services that were not provided.

b) Duplicate submission of a claim for the same service.

c) Misrepresenting the service provided.

d) Charging for more complex or expensive services than was actually provided.

e) Billing for a covered service when the service actually provided was not covered.

Provider (Healthcare Provider)- Provider term refers to an individual doctor or hospital facility that provides medical services to the patient.

Medicare- Medicare is a affordable US federal health insurance program for older age and disabled peoples.

Claims- In US, healthcare provider submits an electronic file for each patient who has health insurance. These files are submitted to the respective health insurance. These files contain patient’s conditions, services and its charges provided to patient, diagnosis of the patient. It also contains patient’s and provider’s details. Every information about patient diagnosis and provided services are coded as per American Medical Standard.

Problem Statement

The goal of this project is to “predict the potentially fraud providers” based on the claims filed by them using a machine learning algorithm. Along with this, we will also discover important features helpful in detecting the behavior of potentially fraud providers. Further, we will study fraudulent patterns in the provider's claims to understand the future behavior of providers.

Introduction to the Dataset

For the purpose of this project, we are considering Inpatient and Outpatient claims and Beneficiary details of customers. Let's see their details:

A) Inpatient Data

This data provides insights into the claims filed for those patients who are admitted to the hospitals. It contains the services provided by the provider. It also provides additional details like their admission and discharge dates and admit diagnosis codes.

B) Outpatient Data

This data provides details about the claims filed for those patients who visit hospitals and were not admitted. Outpatient claims can be filed for many services like surgeries, emergency services, observations, therapies, etc.

C) Beneficiary Details Data

This data contains patients’ details like DOB, DOD, Gender, race, pre-existing health conditions and diseases indicator, the region they belong to, etc.

D) Train/Test Data

There is also a train data that contains the list of providers with potential fraud and none fraud labels. We will use this to label the above datasets for supervised learning. And test dataset has only provider lists. Our task is to predict the potential fraud providers.

Related Research Paper

  1. Title- Predicting Healthcare Fraud in Medicaid: A Multidimensional Data Model and Analysis Techniques for Fraud Detection

This research paper is for medicaid healthcare provider fraud detection. For effective fraud detection, one has to look at the data beyond the transaction-level. This paper builds upon fraud type classifications and the Medicaid environment and to develop a Medicaid multidimensional schema and provide a set of multidimensional data models and analysis techniques that help to predict the likelihood of fraudulent activities.

Within the healthcare system three main parties commit fraud: healthcare providers, beneficiaries (patients), and insurance carriers. According to Sparrow there are two different types of fraud: “hit-and-run” and “steal a little, all the time”. “Hit-and-run” perpetrators simply submit many fraudulent claims, receive payment, and disappear. “Steal a little, all the time” perpetrators work to ensure fraud goes unnoticed and bill fraudulently over a long period of time. The provider may hide false claims within large batches of valid claims and, when caught, will claim it an error, repay the money, and continue the behavior.

The FBI highlights and categorizes some of the most prevalent known Medicaid fraud schemes:

  • Phantom Billing – Submitting claims for services not provided.
  • Duplicate Billing – Submitting similar claims more than once.
  • Bill Padding – Submitting claims for unneeded ancillary services to Medicaid.
  • Upcoding – Billing for a service with a higher reimbursement rate than the service provided.
  • Unbundling – Submitting several claims for various services that should only be billed as one service.
  • Excessive or Unnecessary Services – Provides medically excessive or unnecessary services to a patient.
  • Kickbacks – A kickback is a form of negotiated bribery in which a commission is paid to the bribe-taker (provider or patient) as a quid pro quo for services rendered.

Sparrow proposes that for effective fraud detection one has to look at the data beyond the transaction level, defining seven levels of healthcare fraud control.

Levels of healthcare fraud control, adapted, from Sparrow:

A general core that each claim exists can be extracted among the different claim forms: patient, provider, diagnoses, procedures, and amounts charged. For each claim-line, a type field links to type-specific detailed information. Based on the desired views, the following dimensions are included: date (claim filed, service, paid), provider (executing, referring, billing), patient, insurer policy, treatment, diagnosis, claim type, drug, outcome, location.

The following numeric facts can be distinguished, some computed by the other facts: Covered charges ($), Non-covered charges ($), Total charges ($), Units of service, Number of days between claim filled and paid, Number of days between service and claim paid, Distance between provider and patient, Number of days between service and claim filled, Covered price per unit, Total price per unit, and Treatment duration. Figure shows the resulting multidimensional schema.

Data models addressing levels of fraud:

Using this paper we can relate that all these levels are also useful in detecting fraud and we can make this level data by grouping to these levels, like group by provider, patient, patient, and provider, etc.

Some of the levels can not be built because of dataset limitations, e.g. for grouping to make level 4 we need insurer information which is unavailable in the dataset. We will implement levels based on the availability of features in the dataset.

  1. Title- Electronic Fraud Detection in the U.S. Medicaid Healthcare Program: Lessons Learned from other Industries

In this paper, we conduct a systematic literature study to analyze the applicability of existing electronic fraud detection techniques in similar industries to the US Medicaid program. However, it has been researched on Medicaid but the idea can be generalized to Medicare also as it is also a government-funded healthcare program.

Healthcare fraud in the United States is a severe problem that costs the government billions of dollars each year. Roughly one-third of all US healthcare costs are attributable to fraud, waste, and abuse. Third-party payers for healthcare services (insurance companies and government-run programs) must deal with fraudulent practitioners, organized criminal schemes, and honest providers who make unintended mistakes while billing for their legitimate services.

Medicaid and Medicare are two government programs that provide medical and health-related services to specific groups of people in the United States. Medicare is a federal program that has consistent rules across the fifty states and covers almost everyone 65 years of age or older and disabled person. Medicaid is a state-administered program in which each state provides a unique health care program for people with low income or no income.

Type of fraud in the Healthcare industry:

The structured literature review about fraud detection systems in several industries resulted in an overview of applied fraud detection techniques:

The credit card and telecommunications industries possess real-time data, resolve reported cases of fraud quickly, and, as such, are able to maintain high-quality databases of labeled data that can be used for supervised learning. Medicaid data is dispersed and unlabeled, and there are no signals that this will change in the near future. Multiple stakeholders at the federal and local levels, misaligned incentives, and fragmented responsibility hamper the process of labeling and sharing data. Thus, supervised learning techniques are severely restricted.

Supervised classification models are particularly appropriate for use in health care fraud, as they can be trained and adjusted to detect sophisticated and evolving fraud schemes. In the credit card industry, supervised classification techniques like neural networks, support vector machines, and random forests form the basis for sophisticated and effective fraud detection. The drawback to these techniques is that new fraud schemes are not immediately detectable due to the lag of discovering and labeling new fraud in training data. In the telecommunications industry, unsupervised techniques such as profiling and anomaly detection are applied to complement supervised learning. In the telecommunications industry, extensive, high-quality data is available that is used to construct accurate profiles.

  1. Title- Gradient Boosted Decision Tree Algorithms for Medicare Fraud Detection

Insurance fraud spuriously inflates the cost of Healthcare. We use Medicare claims data as input to various algorithms to gauge their performance in fraud detection. The claims data contain categorical features, some of which have thousands of possible values. To the best of our knowledge, this is the first study on using CatBoost and LightGBM to encode categorical data for Medicare fraud detection.

The data source is publicly available claims data CMS publishes annually. Moreover, we have data on providers that are prohibited from billing Medicare. These data are known as the List of Excluded Individuals and Entities (LEIE). In research, LEIE data was used to label claims data as possible fraud and non-fraud. Dataset is highly imbalanced so they used under-sampling techniques to address class imbalance.

Here we use three different types of classifiers. In previous research, we obtain strong results with GBDTs so we use three types of GBDTs: CatBoost, XGBoost and LightGBM. To provide a context for comparison, we include Random Forest and Logistic Regression.

Since CatBoost and LightGBM automatically handle categorical features without needing to pre-process data, we are able to easily use the HCPCS code, provider gender, provider type, provider state, drug brand name or drug generic name as features for CatBoost and LightGBM. As a result, we are able to use a dataset having fewer features.

To make a fair baseline comparison, we decided to use hyper-parameters as close to default values as possible for all classifiers. We do not attempt any hyper-parameter tuning. We use the Python application programming interfaces (API’s) for all classifiers. The Python version we use is 3.7.3. We use one GPU to fit both CatBoost and XGBoost models. We use Python Scikit-learn [28] for stratified fivefold cross-validation and AUC calculation.

Results and Comparisons of different classifiers:

Existing Solution

There are many resources on this topic for this dataset I used this resource for a detailed summary as it covers all other solutions available.

Data Preprocessing

  • In Beneficiary data there are two columns - one for DOB and the other for DOD. Both were used to find the age of the patient and also an indicator of whether the patient is dead or not.

  • In Inpatient data there are two columns, claim start date and column end date. Using these columns a new column is created for finding the number of days patients are admitted to the hospital.

  • Since Inpatient and outpatient data have similar columns so both these are merged into a single dataset

  • Beneficiary data is also merged with the above dataset based on BeneID. Now all three datasets is merged into a single dataset that contains a union of all columns.

  • Now, this dataset is merged with Train data on ProviderID to be labeled with 0 or 1 (non_fraud and fraud). Train data contains provider lists with labels of fraud and non_fraud.


  • Percent Distribution of fraud class in –
    • transactional data(merged data)-

    • train data

  • State-wise beneficiary distribution
  • Top 10 procedures involved in Healthcare fraud

  • Top 20 attending physicians involved in healthcare fraud

Some Feature Engineering

  • Merged train and test data to engineer some features like average more accurately. But only test data will be used for evaluation

  • This merged data is grouped by the provider and find average on some columns. For example-

    • InscClaimAmtReimbursed

    • DeductibleAmtPaid

    • IPAnnualReimbursementAmt

    • PerProviderAvg_OPAnnualReimbursementAmt

    • PerProviderAvg_NoOfMonths_PartACov

    • AdmitForDays

  • In the same way merged data is grouped by some other columns- BeneID, Attending Physician, Operating physician, Diagnosis code group_1/2/3..,  Claim procedure code_1/2/3...

  • These averaging is done to impute values for large categorical features

  • This merged data is also grouped by two or more columns as fraud is done by the involvement of provider, customer, attending physician etc. So this averaging may give some importance in fraud detection. Some grouped by columns are-

    • Provider and BeneID

    • Provider and Attending physician

    • Provider and diagnosis code group 1, 2, 3..

    • Provider, BeneID and Attending physician

  • Grouping by 1st two-character or diagnosis code may not be a wise decision as there are 120+ groups created so there will be 120+ dimensions for each diagnosis code column. So it will increase the computational complexity

Some data preprocessing after above engineering

  • All the numerical columns filled with 0s for NA values

  • Now some columns are removed from data as either they are not useful or converted to averaging feature or some other feature. Such list of columns are –

remove_these_columns=['BeneID', 'ClaimID', 'ClaimStartDt','ClaimEndDt','AttendingPhysician', 
'OperatingPhysician', 'OtherPhysician', 'ClmDiagnosisCode_1', 'ClmDiagnosisCode_2', 
'ClmDiagnosisCode_3', 'ClmDiagnosisCode_4', 'ClmDiagnosisCode_5', 'ClmDiagnosisCode_6', 
'ClmDiagnosisCode_7', 'ClmDiagnosisCode_8', 'ClmDiagnosisCode_9', 'ClmDiagnosisCode_10',
'ClmProcedureCode_1', 'ClmProcedureCode_2', 'ClmProcedureCode_3','ClmProcedureCode_4', 
'ClmProcedureCode_5', 'ClmProcedureCode_6', 'ClmAdmitDiagnosisCode', 'AdmissionDt',
'DischargeDt', 'DiagnosisGroupCode','DOB', 'DOD', 'State', 'County']

  • Gender and Race are categorical so converted to feature as indicators with 0 or 1

  • The target value is also replaced with 0 and 1 instead of yes and no

  • Now ‘sum’ aggregations are done on the group by provider and potential fraud

  • Features and Labels are split into x and y

  • Data standardizations are done on the entire data

  • X and y data is split into train and validation data


There are many classifiers used and compared their performance.

  • Logistic Regression

Data is trained on Logistic regression with balanced weight and cross-validation as data is imbalanced. See the performance of the model in terms of various metrics.

Accuracy Train:         0.922365988909427
Accuracy Val:           0.9125077017868145
Sensitivity Train :     0.7627118644067796
Sensitivity Val:        0.6776315789473685
Specificity Train:      0.9388290125254879
Specificity Val:        0.9367777022433719
Kappa Value :           0.5438304105142315
AUC         :           0.8072046405953702 (threshold 0.60)
F1-Score Train  :       0.6474820143884892
F1-Score Val  :         0.5919540229885056

We can see accuracy is very high for Logistic regression.

  • Random Forest

This time an ensemble Random forest is trained on the same dataset. Random classifier is initiated with 500 base learners and depth 4. These are the scores of the Random forest evaluation.

Accuracy Train :               0.8885661473461843
Accuracy Test :                0.8712261244608749
Sensitivity :                  0.8157894736842105
Specificity :                  0.8769544527532291
Kappa Value :                  0.47733173495472203
AUC         :                  0.8463719632187199
F1-Score Train                 0.6026365348399246
F1-Score Validation :          0.5426695842450766

Accuracy and F1 score are not better than LR but sensitivity and specificity are better. All over if we say here LR performs better.

These are the top20 important features and scores-

Variable: PerProviderAvg_InscClaimAmtReimbursed                 Importance: 0.08
Variable: InscClaimAmtReimbursed                                 Importance: 0.07
Variable: PerAttendingPhysicianAvg_InscClaimAmtReimbursed       Importance: 0.07
Variable: PerOperatingPhysicianAvg_InscClaimAmtReimbursed       Importance: 0.06
Variable: PerClmAdmitDiagnosisCodeAvg_InscClaimAmtReimbursed     Importance: 0.04
Variable: PerClmAdmitDiagnosisCodeAvg_DeductibleAmtPaid          Importance: 0.04
Variable: PerClmDiagnosisCode_1Avg_DeductibleAmtPaid             Importance: 0.04
Variable: PerOperatingPhysicianAvg_IPAnnualReimbursementAmt       Importance: 0.03
Variable: ClmCount_Provider_ClmDiagnosisCode_7                   Importance: 0.03
Variable: ClmCount_Provider_ClmDiagnosisCode_8                   Importance: 0.03
Variable: ClmCount_Provider_ClmDiagnosisCode_9                   Importance: 0.03
Variable: DeductibleAmtPaid                                      Importance: 0.02
Variable: AdmitForDays                                           Importance: 0.02
Variable: PerProviderAvg_DeductibleAmtPaid                       Importance: 0.02
Variable: PerAttendingPhysicianAvg_DeductibleAmtPaid             Importance: 0.02

Some more classifiers are tried such as-

clfs = {
    'svm1': SVC(C=0.01,kernel='linear',probability=1),
    'svm2': SVC(C=0.01,kernel='rbf',probability=1),
    'svm3': SVC(C=.01,kernel='poly',degree=2,probability=1),
    'ada': AdaBoostClassifier(),
    'dtc': DecisionTreeClassifier(class_weight='balanced'),
    'gbc': GradientBoostingClassifier(),
    'lr': LogisticRegression(class_weight='balanced'),
    'xgb': XGBClassifier(booster='gbtree')

They all are trained on the same dataset and performance results are listed below-

{'ada': 0.5309090909090909,
 'dtc': 0.42857142857142855,
 'gbc': 0.5873015873015873,
 'lr': 0.5974683544303796,
 'svm1': 0.5069124423963134,
 'svm2': 0.5483870967741936,
 'svm3': 0.41791044776119407,
 'xgb': 0.5477178423236515}

We can see linear model LR beats all other models in terms of F1 score. Since the sensitivity score is also high so we can say that the trained model is not dumb.

In this solution, these were the experiments done. There is also a DL model used in the sample solution to train and evaluate the performance on the provider dataset. Since the case study is limited to Machine Learning so I am not discussing that model.

First Cut Approach

  1. Claim diagnosis code is very important in fraud activity. Like if a provider files a claim for a cold barely one can understand if the patient had a cold a few days ago. So providers may use this frequently for reimbursement. The key idea is that similar diseases have usually the same first two characters. So in feature engineering, I would group diagnosis code into such groups. Groups would be formed by the first two chars of dx code (diagnosis code). After that, I would create new features by transforming other numerical features into mean or some other statistical operation grouped by diagnosis group. Here I am showing two bar graph 1. For top 10 diagnosis groups involved and 2. For the top 10 diagnoses involved. So there are some overlaps but the top 3 are very different diagnosis code groups. So transforming features in terms of grouping diagnosis codes may also improve model performance.

  2. Since we will train and test by grouping providers so I have another feature engineering idea. It is like implementing the idea of TF-IDF in dx code. When we group by the provider, each provider row can be considered as a document and each dx code used by the provider would be considered as a word. For missing dx code we would use tf-idf value as 0.0. This transformation would be done separately for each dx code column. We can think of the total # of rows as the total # of documents after grouping by the provider and each dx code in each dx column would be considered as word. If there is a pattern of using any specific dx code by fraud provider’s tf-idf score would be similar among them. We can also generate more tf-idf like features on dx codes by grouping other features like group by the beneficiary, attending physician, etc (basic grouping by all entities which can possibly be involved in fraud activity- domain knowledge).

  3. See, below the example of a single provider. So this can be thought of as a single document. Each dx in dx code column 1 is a word. The total # of unique dx codes in dx code column 1 is the vocabulary of words. Similarly, we can do this for each column of dx codes. This will generate 10 new features for each grouping entity.

  4. In the previous solution, there were no features specific to each dx code. All the features were generated by grouping these dx codes and aggregating the mean on other statistical features

  5. All ML models should be fine-tuned with different hyper parameters and we could get better results.

  6. We can also train the model on entire data without grouping, there is more opportunity to learn patterns. When we are grouping by provider we have to aggregate all the features by some statistic like sum, mean, median, etc. So when we are doing so some information or pattern may be lost due to aggregation.

  7. We would implement a different level of fraud analysis from research paper 1 listed above. There are 7 levels discussed but we would implement those only whose data is available. As per available data up to level 4 can be implemented.

  8. As per 3rd research paper discussed above, we would try to implement GBDT’s different variations CatBoost and LightGB as well as some others like Random forest, SVM, and Logistic regression classifiers to compare the performance of each classifier.

Exploratory Data Analysis (EDA)

  • Plotting the frequencies of fraud and nonfraud classes in claim transactional data

Percent Distribution of Potential Fraud class:- 
0    61.878931
1    38.121069
Name: PotentialFraud, dtype: float64


  • There is class imbalance but data is enough for minor classes to learn for the model.
  • 38% of transactions are fraud which is very high.

  • Plotting the frequencies of fraud and non-fraud providers in provider-labeled data

    Percent Distribution of Potential Fraud class:- 
    No     90.64695
    Yes     9.35305
    Name: PotentialFraud, dtype: float64


  • There are highly imbalanced data in fraud provider labels data.
  • Only 9% of providers are listed as fraud which is very low.
  • 9% of fraud providers did fraud in 38% of transactions.
  • Since we are a training model at the transaction level so we don't need to worry about class imbalance.

Univariate and Bivariate Analysis

  • State-wise Beneficiary involved in Healthcare Fraud


    • State 5 has the most number of fraud transactions
    • State 45 has the most number of non-fraud transactions
    • State 5, 31, 49 and 46 have more # of fraud cases than nonfraud cases
    • These states’ transactions could help model predicting fraud cases
  • Race-wise Beneficiary involved in Healthcare Fraud


  • There is no Race 4 data

  • Race 1 beneficiary has most # of fraud and also non-fraud cases

  • There is no pattern of race in fraud and nonfraud cases, so this feature may have less importance

    Individual Beneficiaries invloved in Healthcare Fraud


  • Beneficiary BENE42721 is not involved in fraud
  • Beneficiary BENE143400 is only involved in fraud
  • Some beneficiaries are mostly involved in fraud and some are mostly not involved
  • There is some pattern in each individual beneficiary so it may have good feature importance

  • Distribution of # of claims billed by fraud and nonfraud providers


  • There is a very different distribution between fraud and non-fraud providers, this feature will help model much more to predict a class
  • Non-fraud providers curve is very peaked which means these providers have very high density and fraud providers have very low density

  • Distribution of Average AdmitForDays for each fraud and nonfraud providers


  • It also has different distribution between fraud and non-fraud providers, however, there is some overlap but still, this feature will help model much more to predict class
  • The fraud providers curve is very peaked it means these providers have very high density and non-fraud providers have very low density
  • Fraud providers tend to bill most of the claims with an average range of 4-9 admit days

  • Distribution of average DeductibleAmtPaid of claims billed by fraud and non fraud providers


  • It also has different distribution between fraud and non-fraud providers, however, there is some overlap but still, this feature will help model much more to predict a class
  • Non-Fraud providers curve is very peaked which means these providers have very high density and fraud providers have very low density
  • In most of the claims billed by Non-Fraud providers have average DeductibleAmtPaid is -150 to 200 $

  • Distribution of average Age of Beneficiary in claims billed by fraud and nonfraud providers


  • It also has different distribution between fraud and non-fraud providers, however, there is some overlap but still, this feature will help model much more to predict a class
  • The fraud providers curve is very peaked it means these providers have very high density and non-fraud providers have very low density
  • Most of the claims billed by Fraud providers have an average age of 70-80
  • It means fraud providers target mostly old aged beneficiaries (patients) for their fraud activity

Overall Observation:

  • In fraud provider labels data there is very few fraud providers(9%) but in transaction data, there is 38% of claims billed by these 9% of providers
  • Since we are training model by transaction data not at the provider level so we don't need to worry about data imbalance
  • In the bar graph, we observed some features have patterns toward classes and some do not show any observable pattern
  • In the distribution plot, we grouped data by providers and saw highly observable difference in their distribution
  • By seeing this much different distribution between different classes we are highly motivated for grouping data by some entities like Provider, BeneID, ClaimDxCode etc and aggregate to mean, count etc

Multivariate analysis

Observations: In the large # of datapoints this plot shows many fraud transactions linearly saperable. Like fraud providers have mostly 0 admit days and # of claims filled by them are high. There is also one more condition like fraud providers have # of admit days between 5-10 days mostly and # of claims are also filled by them are high. In simple language fraud providers bills most # of claims with 0 admit days or admit days between 5-10

Observations: Pair plots are plotted between 3 random features from dataset 3 times. Since # of features are high we cannot plot for all features.

After analyzing all the pair plots we can conclude that -

  • feature ClmProcedureCode_2TF and PerClmProcedureCode_1Avg_InscClaimAmtReimbursed shows good clustering of fraud and non_fraud providers

  • fraud providers bills claims for patient with ChronicCond_stroke=1 mostly

  • fraud providers tend to bills more # of claims than non_fraud providers using ClmDiagnosisCode_6

  • Some features shows random behaviors in plots but still me gives some importance in classification

Observations: Many fraud providers have a different IQR range than non_fraud, so this feature may help in classifying for some fraud providers

Observations: For this feature(ClmCount_Provider) fraud and non_fraud providers have a very different IQR range. So may be one of the most important features for classification

Observations: Heatmat is plotted for some random 10 features. Plotting for all features is not readable or observable. This heatmap shows a correlation between two corresponding features. Close to 1 means highly correlated positively, it means as x increase y will also increase and close to -1 means highly correlated in a negative manner, it means as x increases, y decreases.

  • 3rd feature on x-axis is highly correlated with 4th and 7th feature on y-axis

  • 4th feature on x-axis is highly correlated with 4th and 8th feature on y-axis

  • 7th feature on x-axis is highly correlated with 7h and 8th feature on y-axis

Observations: In 3D space of 3 features most of the fraud data points are linearly separable

Feature Engineering

We can always fabricate some other features using existing features by using some domain knowledge, statistical methods, mathematical methods etc. We have also fabricated many features using some statistics like average, count etc. See below some features fabricated in this case study:

#average feature grouped by provider 

We have also generated some features using diagnosis code and cpt. Considering each diagnosis code or cpt code as term and each provider as document when data is grouped by provider then calculating TF-IDF for each diagnosis code/cpt code since TF-IDF is also a way of extracting information.

Modeling - Machine Learning Algorithms

Logistic Regression

log = LogisticRegressionCV(cv=5,class_weight='balanced',random_state=40)    

Train Confusion Matrix : 
 [[125830  23127]
 [ 31012 210778]]
Test Confusion Matrix : 
 [[53924  9915]
 [13431 90194]]
Train Accuracy    :  0.8614474327378073
Test Accuracy     :  0.8605909329766398
Train Sensitivity :  0.8447404284457931
Test Sensitivity  :  0.8446874167828443
Train AUC         :  0.858240184031408
Test AUC          :  0.8575379182828576
Train F1-Score    :  0.8229588716771474
Test F1-Score     :  0.8220497888622955

Random Forest

## Lets Apply Random Forest 
rfc = RandomForestClassifier(n_estimators=256,class_weight='balanced',random_state=123,max_depth=6)   # We will set max_depth =4,y_train)  #fit the model

Train Confusion Matrix : 
 [[127246  21711]
 [ 25284 216506]]
Test Confusion Matrix : 
 [[54555  9284]
 [10906 92719]]
Train Accuracy    :  0.8797303626131486
Test Accuracy     :  0.879436774470931
Train Sensitivity :  0.8542465275213652
Test Sensitivity  :  0.8545716568241983
Train AUC         :  0.8748382230228522
Test AUC          :  0.8746633917414116
Train F1-Score    :  0.8441226321532936
Test F1-Score     :  0.8438515081206497
Top 10 features impacting model and their importance score :- 

Variable: ClmCount_Provider                                  Importance: 20.2
Variable: ClmDiagnosisCode_1TF-IDF                           Importance: 10.9
Variable: PerProviderAvg_AdmitForDays                        Importance: 8.6
Variable: ClmDiagnosisCode_1TF                               Importance: 8.5
Variable: PerProviderAvg_DeductibleAmtPaid                   Importance: 6.6
Variable: PerProviderAvg_InscClaimAmtReimbursed              Importance: 5.3
Variable: PerProviderAvg_NoOfMonths_PartACov                 Importance: 5.3
Variable: PerProviderAvg_NoOfMonths_PartBCov                 Importance: 4.4
Variable: PerProviderAvg_IPAnnualReimbursementAmt            Importance: 3.2
Variable: ClmCount_Provider_AttendingPhysician               Importance: 2.9

Adaboost, Decision Tree, XGBoost classifier

clfs = { 'ada': AdaBoostClassifier(), 'dtc': DecisionTreeClassifier(class_weight='balanced'), 'xgb': XGBClassifier(booster='gbtree'), }
## Lets Fit These models and check their performance 

f1_scores = dict()
for clf_name in clfs:
    clf = clfs[clf_name], y_train)
    y_pred =((clf.predict_proba(X_val)[:,1]>0.5).astype(bool))
    f1_scores[clf_name] = f1_score(y_pred, y_val)

Printing F1 score

{'ada': 0.8719151056686082,
 'dtc': 0.9997493812850475,
 'xgb': 0.9999843353488517}


from catboost import CatBoostClassifier, Pool
X_train_p = Pool(data=X_train, label=y_train)#, cat_features=X_train.columns.values)
model = CatBoostClassifier(n_estimators=100, learning_rate=1, depth=5, loss_function='Logloss')
catboost_numerical =

Train Confusion Matrix : 
 [[148912     45]
 [     1 241789]]
Test Confusion Matrix : 
 [[ 63813     26]
 [     3 103622]]
Train Accuracy    :  0.99988227676732
Test Accuracy     :  0.9998268284526823
Train Sensitivity :  0.9996978993937848
Test Sensitivity  :  0.99959272544996
Train AUC         :  0.9998468817867224
Test AUC          :  0.9997818874535686
Train F1-Score    :  0.9998455702151947
Test F1-Score     :  0.9997728251929028
Top 10 features impacting model and their importance score :- 

Variable: ClmCount_Provider                                  Importance: 40.0
Variable: PerProviderAvg_InscClaimAmtReimbursed              Importance: 10.3
Variable: PerProviderAvg_AdmitForDays                        Importance: 9.1
Variable: PerProviderAvg_DeductibleAmtPaid                   Importance: 6.2
Variable: PerProviderAvg_OPAnnualDeductibleAmt               Importance: 6.0
Variable: PerProviderAvg_IPAnnualDeductibleAmt               Importance: 5.6
Variable: PerProviderAvg_OPAnnualReimbursementAmt            Importance: 5.4
Variable: PerProviderAvg_Age                                 Importance: 5.2
Variable: PerProviderAvg_NoOfMonths_PartACov                 Importance: 4.4
Variable: PerProviderAvg_IPAnnualReimbursementAmt            Importance: 3.8


import lightgbm as lgbm
model = lgbm.LGBMClassifier()
lgbm_numerical =, y_train)

Train Confusion Matrix : 
 [[148022    935]
 [    20 241770]]
Test Confusion Matrix : 
 [[ 63403    436]
 [    15 103610]]
Train Accuracy    :  0.9975559633215354
Test Accuracy     :  0.9973068838675775
Train Sensitivity :  0.9937230207375283
Test Sensitivity  :  0.9931703190839456
Train AUC         :  0.9968201521653646
Test AUC          :  0.9965127831849161
Train F1-Score    :  0.9967845009579157
Test F1-Score     :  0.9964559906331283
Top 10 features impacting model and their importance score :- 

Variable: ClmCount_Provider                                  Importance: 515
Variable: PerProviderAvg_InscClaimAmtReimbursed              Importance: 355
Variable: PerProviderAvg_DeductibleAmtPaid                   Importance: 270
Variable: PerProviderAvg_OPAnnualReimbursementAmt            Importance: 270
Variable: PerProviderAvg_Age                                 Importance: 248
Variable: PerProviderAvg_OPAnnualDeductibleAmt               Importance: 242
Variable: PerProviderAvg_IPAnnualDeductibleAmt               Importance: 238
Variable: PerProviderAvg_AdmitForDays                        Importance: 211
Variable: PerProviderAvg_IPAnnualReimbursementAmt            Importance: 208
Variable: PerProviderAvg_NoOfMonths_PartBCov                 Importance: 205

Summary Table of all experimented classifiers


  • Of all the model used XGBoost gives best f1_score(.99998) on validation data
  • 2nd best model is CatBoost which gives f1_score .99977 on validation data
  • DecisionTree is the simplest model yet gives much better result than LR and RFC
  • Linear models do not performs well on these data
  • Tree-based boosting performs well on these data
  • Since XGBoost gives best results we will save this model for test and evaluation in final notebook

Future work

  • If data quality higher as there are some information which is dummy, we could build more trusted model on real data
  • If the provider information is real in the dataset we can use List of Excluded Individuals and Entities (LEIE) for labeling fraud provider
  • We can vectorize diagnosis and procedure columns through embedding
  • We can also train a DL model to get more opportunity to learn pattern
  • We can also use CMS Medicare PUF data sets, i.e. Part B, Part D, and DMEPOS for training on real datasets



Streamlit Deployed App