Provider Fraud is one of the biggest problems facing Medicare. According to the government, the total Medicare spending increased exponentially due to fraud in Medicare claims.
Healthcare fraud is an organized crime that involves peers of providers, physicians, and beneficiaries acting together to make fraud claims. Rigorous analysis of Medicare data has yielded many physicians who indulge in fraud. They adopt ways in which an ambiguous diagnosis code is used to adopt the costliest procedures and drugs.
Insurance companies are the most vulnerable institutions impacted due to these bad practices. Due to this reason, insurance companies increased their insurance premiums, and as a result healthcare is becoming costlier day by day. Healthcare fraud and abuse take many forms. Some of the most common types of fraud by providers are:
a) Billing for services that were not provided.
b) Duplicate submission of a claim for the same service.
c) Misrepresenting the service provided.
d) Charging for more complex or expensive services than was actually provided.
e) Billing for a covered service when the service actually provided was not covered.
Provider (Healthcare Provider)- Provider term refers to an individual doctor or hospital facility that provides medical services to the patient.
Medicare- Medicare is a affordable US federal health insurance program for older age and disabled peoples.
Claims- In US, healthcare provider submits an electronic file for each patient who has health insurance. These files are submitted to the respective health insurance. These files contain patient’s conditions, services and its charges provided to patient, diagnosis of the patient. It also contains patient’s and provider’s details. Every information about patient diagnosis and provided services are coded as per American Medical Standard.
The goal of this project is to “predict the potentially fraud providers” based on the claims filed by them using a machine learning algorithm. Along with this, we will also discover important features helpful in detecting the behavior of potentially fraud providers. Further, we will study fraudulent patterns in the provider's claims to understand the future behavior of providers.
For the purpose of this project, we are considering Inpatient and Outpatient claims and Beneficiary details of customers. Let's see their details:
A) Inpatient Data
This data provides insights into the claims filed for those patients who are admitted to the hospitals. It contains the services provided by the provider. It also provides additional details like their admission and discharge dates and admit diagnosis codes.
B) Outpatient Data
This data provides details about the claims filed for those patients who visit hospitals and were not admitted. Outpatient claims can be filed for many services like surgeries, emergency services, observations, therapies, etc.
C) Beneficiary Details Data
This data contains patients’ details like DOB, DOD, Gender, race, pre-existing health conditions and diseases indicator, the region they belong to, etc.
D) Train/Test Data
There is also a train data that contains the list of providers with potential fraud and none fraud labels. We will use this to label the above datasets for supervised learning. And test dataset has only provider lists. Our task is to predict the potential fraud providers.
This research paper is for medicaid healthcare provider fraud detection. For effective fraud detection, one has to look at the data beyond the transaction-level. This paper builds upon fraud type classifications and the Medicaid environment and to develop a Medicaid multidimensional schema and provide a set of multidimensional data models and analysis techniques that help to predict the likelihood of fraudulent activities.
Within the healthcare system three main parties commit fraud: healthcare providers, beneficiaries (patients), and insurance carriers. According to Sparrow there are two different types of fraud: “hit-and-run” and “steal a little, all the time”. “Hit-and-run” perpetrators simply submit many fraudulent claims, receive payment, and disappear. “Steal a little, all the time” perpetrators work to ensure fraud goes unnoticed and bill fraudulently over a long period of time. The provider may hide false claims within large batches of valid claims and, when caught, will claim it an error, repay the money, and continue the behavior.
The FBI highlights and categorizes some of the most prevalent known Medicaid fraud schemes:
Sparrow proposes that for effective fraud detection one has to look at the data beyond the transaction level, defining seven levels of healthcare fraud control.
Levels of healthcare fraud control, adapted, from Sparrow:
A general core that each claim exists can be extracted among the different claim forms: patient, provider, diagnoses, procedures, and amounts charged. For each claim-line, a type field links to type-specific detailed information. Based on the desired views, the following dimensions are included: date (claim filed, service, paid), provider (executing, referring, billing), patient, insurer policy, treatment, diagnosis, claim type, drug, outcome, location.
The following numeric facts can be distinguished, some computed by the other facts: Covered charges ($), Non-covered charges ($), Total charges ($), Units of service, Number of days between claim filled and paid, Number of days between service and claim paid, Distance between provider and patient, Number of days between service and claim filled, Covered price per unit, Total price per unit, and Treatment duration. Figure shows the resulting multidimensional schema.
Data models addressing levels of fraud:
Using this paper we can relate that all these levels are also useful in detecting fraud and we can make this level data by grouping to these levels, like group by provider, patient, patient, and provider, etc.
Some of the levels can not be built because of dataset limitations, e.g. for grouping to make level 4 we need insurer information which is unavailable in the dataset. We will implement levels based on the availability of features in the dataset.
In this paper, we conduct a systematic literature study to analyze the applicability of existing electronic fraud detection techniques in similar industries to the US Medicaid program. However, it has been researched on Medicaid but the idea can be generalized to Medicare also as it is also a government-funded healthcare program.
Healthcare fraud in the United States is a severe problem that costs the government billions of dollars each year. Roughly one-third of all US healthcare costs are attributable to fraud, waste, and abuse. Third-party payers for healthcare services (insurance companies and government-run programs) must deal with fraudulent practitioners, organized criminal schemes, and honest providers who make unintended mistakes while billing for their legitimate services.
Medicaid and Medicare are two government programs that provide medical and health-related services to specific groups of people in the United States. Medicare is a federal program that has consistent rules across the fifty states and covers almost everyone 65 years of age or older and disabled person. Medicaid is a state-administered program in which each state provides a unique health care program for people with low income or no income.
Type of fraud in the Healthcare industry:
The structured literature review about fraud detection systems in several industries resulted in an overview of applied fraud detection techniques:
The credit card and telecommunications industries possess real-time data, resolve reported cases of fraud quickly, and, as such, are able to maintain high-quality databases of labeled data that can be used for supervised learning. Medicaid data is dispersed and unlabeled, and there are no signals that this will change in the near future. Multiple stakeholders at the federal and local levels, misaligned incentives, and fragmented responsibility hamper the process of labeling and sharing data. Thus, supervised learning techniques are severely restricted.
Supervised classification models are particularly appropriate for use in health care fraud, as they can be trained and adjusted to detect sophisticated and evolving fraud schemes. In the credit card industry, supervised classification techniques like neural networks, support vector machines, and random forests form the basis for sophisticated and effective fraud detection. The drawback to these techniques is that new fraud schemes are not immediately detectable due to the lag of discovering and labeling new fraud in training data. In the telecommunications industry, unsupervised techniques such as profiling and anomaly detection are applied to complement supervised learning. In the telecommunications industry, extensive, high-quality data is available that is used to construct accurate profiles.
Insurance fraud spuriously inflates the cost of Healthcare. We use Medicare claims data as input to various algorithms to gauge their performance in fraud detection. The claims data contain categorical features, some of which have thousands of possible values. To the best of our knowledge, this is the first study on using CatBoost and LightGBM to encode categorical data for Medicare fraud detection.
The data source is publicly available claims data CMS publishes annually. Moreover, we have data on providers that are prohibited from billing Medicare. These data are known as the List of Excluded Individuals and Entities (LEIE). In research, LEIE data was used to label claims data as possible fraud and non-fraud. Dataset is highly imbalanced so they used under-sampling techniques to address class imbalance.
Here we use three different types of classifiers. In previous research, we obtain strong results with GBDTs so we use three types of GBDTs: CatBoost, XGBoost and LightGBM. To provide a context for comparison, we include Random Forest and Logistic Regression.
Since CatBoost and LightGBM automatically handle categorical features without needing to pre-process data, we are able to easily use the HCPCS code, provider gender, provider type, provider state, drug brand name or drug generic name as features for CatBoost and LightGBM. As a result, we are able to use a dataset having fewer features.
To make a fair baseline comparison, we decided to use hyper-parameters as close to default values as possible for all classifiers. We do not attempt any hyper-parameter tuning. We use the Python application programming interfaces (API’s) for all classifiers. The Python version we use is 3.7.3. We use one GPU to fit both CatBoost and XGBoost models. We use Python Scikit-learn [28] for stratified fivefold cross-validation and AUC calculation.
Results and Comparisons of different classifiers:
https://www.kaggle.com/code/rohitrox/medical-provider-fraud-detection
There are many resources on this topic for this dataset I used this resource for a detailed summary as it covers all other solutions available.
In Beneficiary data there are two columns - one for DOB and the other for DOD. Both were used to find the age of the patient and also an indicator of whether the patient is dead or not.
In Inpatient data there are two columns, claim start date and column end date. Using these columns a new column is created for finding the number of days patients are admitted to the hospital.
Since Inpatient and outpatient data have similar columns so both these are merged into a single dataset
Beneficiary data is also merged with the above dataset based on BeneID. Now all three datasets is merged into a single dataset that contains a union of all columns.
Now, this dataset is merged with Train data on ProviderID to be labeled with 0 or 1 (non_fraud and fraud). Train data contains provider lists with labels of fraud and non_fraud.
transactional data(merged data)-
train data
Merged train and test data to engineer some features like average more accurately. But only test data will be used for evaluation
This merged data is grouped by the provider and find average on some columns. For example-
InscClaimAmtReimbursed
DeductibleAmtPaid
IPAnnualReimbursementAmt
PerProviderAvg_OPAnnualReimbursementAmt
PerProviderAvg_NoOfMonths_PartACov
AdmitForDays
In the same way merged data is grouped by some other columns- BeneID, Attending Physician, Operating physician, Diagnosis code group_1/2/3.., Claim procedure code_1/2/3...
These averaging is done to impute values for large categorical features
This merged data is also grouped by two or more columns as fraud is done by the involvement of provider, customer, attending physician etc. So this averaging may give some importance in fraud detection. Some grouped by columns are-
Provider and BeneID
Provider and Attending physician
Provider and diagnosis code group 1, 2, 3..
Provider, BeneID and Attending physician
Grouping by 1st two-character or diagnosis code may not be a wise decision as there are 120+ groups created so there will be 120+ dimensions for each diagnosis code column. So it will increase the computational complexity
All the numerical columns filled with 0s for NA values
Now some columns are removed from data as either they are not useful or converted to averaging feature or some other feature. Such list of columns are –
remove_these_columns=['BeneID', 'ClaimID', 'ClaimStartDt','ClaimEndDt','AttendingPhysician',
'OperatingPhysician', 'OtherPhysician', 'ClmDiagnosisCode_1', 'ClmDiagnosisCode_2',
'ClmDiagnosisCode_3', 'ClmDiagnosisCode_4', 'ClmDiagnosisCode_5', 'ClmDiagnosisCode_6',
'ClmDiagnosisCode_7', 'ClmDiagnosisCode_8', 'ClmDiagnosisCode_9', 'ClmDiagnosisCode_10',
'ClmProcedureCode_1', 'ClmProcedureCode_2', 'ClmProcedureCode_3','ClmProcedureCode_4',
'ClmProcedureCode_5', 'ClmProcedureCode_6', 'ClmAdmitDiagnosisCode', 'AdmissionDt',
'DischargeDt', 'DiagnosisGroupCode','DOB', 'DOD', 'State', 'County']
Gender and Race are categorical so converted to feature as indicators with 0 or 1
The target value is also replaced with 0 and 1 instead of yes and no
Now ‘sum’ aggregations are done on the group by provider and potential fraud
Features and Labels are split into x and y
Data standardizations are done on the entire data
X and y data is split into train and validation data
There are many classifiers used and compared their performance.
Logistic Regression
Data is trained on Logistic regression with balanced weight and cross-validation as data is imbalanced. See the performance of the model in terms of various metrics.
Accuracy Train: 0.922365988909427
Accuracy Val: 0.9125077017868145
Sensitivity Train : 0.7627118644067796
Sensitivity Val: 0.6776315789473685
Specificity Train: 0.9388290125254879
Specificity Val: 0.9367777022433719
Kappa Value : 0.5438304105142315
AUC : 0.8072046405953702 (threshold 0.60)
F1-Score Train : 0.6474820143884892
F1-Score Val : 0.5919540229885056
We can see accuracy is very high for Logistic regression.
Random Forest
This time an ensemble Random forest is trained on the same dataset. Random classifier is initiated with 500 base learners and depth 4. These are the scores of the Random forest evaluation.
Accuracy Train : 0.8885661473461843
Accuracy Test : 0.8712261244608749
Sensitivity : 0.8157894736842105
Specificity : 0.8769544527532291
Kappa Value : 0.47733173495472203
AUC : 0.8463719632187199
F1-Score Train 0.6026365348399246
F1-Score Validation : 0.5426695842450766
Accuracy and F1 score are not better than LR but sensitivity and specificity are better. All over if we say here LR performs better.
These are the top20 important features and scores-
Variable: PerProviderAvg_InscClaimAmtReimbursed Importance: 0.08
Variable: InscClaimAmtReimbursed Importance: 0.07
Variable: PerAttendingPhysicianAvg_InscClaimAmtReimbursed Importance: 0.07
Variable: PerOperatingPhysicianAvg_InscClaimAmtReimbursed Importance: 0.06
Variable: PerClmAdmitDiagnosisCodeAvg_InscClaimAmtReimbursed Importance: 0.04
Variable: PerClmAdmitDiagnosisCodeAvg_DeductibleAmtPaid Importance: 0.04
Variable: PerClmDiagnosisCode_1Avg_DeductibleAmtPaid Importance: 0.04
Variable: PerOperatingPhysicianAvg_IPAnnualReimbursementAmt Importance: 0.03
Variable: ClmCount_Provider_ClmDiagnosisCode_7 Importance: 0.03
Variable: ClmCount_Provider_ClmDiagnosisCode_8 Importance: 0.03
Variable: ClmCount_Provider_ClmDiagnosisCode_9 Importance: 0.03
Variable: DeductibleAmtPaid Importance: 0.02
Variable: AdmitForDays Importance: 0.02
Variable: PerProviderAvg_DeductibleAmtPaid Importance: 0.02
Variable: PerAttendingPhysicianAvg_DeductibleAmtPaid Importance: 0.02
Some more classifiers are tried such as-
clfs = {
'svm1': SVC(C=0.01,kernel='linear',probability=1),
'svm2': SVC(C=0.01,kernel='rbf',probability=1),
'svm3': SVC(C=.01,kernel='poly',degree=2,probability=1),
'ada': AdaBoostClassifier(),
'dtc': DecisionTreeClassifier(class_weight='balanced'),
'gbc': GradientBoostingClassifier(),
'lr': LogisticRegression(class_weight='balanced'),
'xgb': XGBClassifier(booster='gbtree')
}
They all are trained on the same dataset and performance results are listed below-
{'ada': 0.5309090909090909,
'dtc': 0.42857142857142855,
'gbc': 0.5873015873015873,
'lr': 0.5974683544303796,
'svm1': 0.5069124423963134,
'svm2': 0.5483870967741936,
'svm3': 0.41791044776119407,
'xgb': 0.5477178423236515}
We can see linear model LR beats all other models in terms of F1 score. Since the sensitivity score is also high so we can say that the trained model is not dumb.
In this solution, these were the experiments done. There is also a DL model used in the sample solution to train and evaluate the performance on the provider dataset. Since the case study is limited to Machine Learning so I am not discussing that model.
Claim diagnosis code is very important in fraud activity. Like if a provider files a claim for a cold barely one can understand if the patient had a cold a few days ago. So providers may use this frequently for reimbursement. The key idea is that similar diseases have usually the same first two characters. So in feature engineering, I would group diagnosis code into such groups. Groups would be formed by the first two chars of dx code (diagnosis code). After that, I would create new features by transforming other numerical features into mean or some other statistical operation grouped by diagnosis group. Here I am showing two bar graph 1. For top 10 diagnosis groups involved and 2. For the top 10 diagnoses involved. So there are some overlaps but the top 3 are very different diagnosis code groups. So transforming features in terms of grouping diagnosis codes may also improve model performance.
Since we will train and test by grouping providers so I have another feature engineering idea. It is like implementing the idea of TF-IDF in dx code. When we group by the provider, each provider row can be considered as a document and each dx code used by the provider would be considered as a word. For missing dx code we would use tf-idf value as 0.0. This transformation would be done separately for each dx code column. We can think of the total # of rows as the total # of documents after grouping by the provider and each dx code in each dx column would be considered as word. If there is a pattern of using any specific dx code by fraud provider’s tf-idf score would be similar among them. We can also generate more tf-idf like features on dx codes by grouping other features like group by the beneficiary, attending physician, etc (basic grouping by all entities which can possibly be involved in fraud activity- domain knowledge).
See, below the example of a single provider. So this can be thought of as a single document. Each dx in dx code column 1 is a word. The total # of unique dx codes in dx code column 1 is the vocabulary of words. Similarly, we can do this for each column of dx codes. This will generate 10 new features for each grouping entity.
In the previous solution, there were no features specific to each dx code. All the features were generated by grouping these dx codes and aggregating the mean on other statistical features
All ML models should be fine-tuned with different hyper parameters and we could get better results.
We can also train the model on entire data without grouping, there is more opportunity to learn patterns. When we are grouping by provider we have to aggregate all the features by some statistic like sum, mean, median, etc. So when we are doing so some information or pattern may be lost due to aggregation.
We would implement a different level of fraud analysis from research paper 1 listed above. There are 7 levels discussed but we would implement those only whose data is available. As per available data up to level 4 can be implemented.
As per 3rd research paper discussed above, we would try to implement GBDT’s different variations CatBoost and LightGB as well as some others like Random forest, SVM, and Logistic regression classifiers to compare the performance of each classifier.
Plotting the frequencies of fraud and nonfraud classes in claim transactional data
Percent Distribution of Potential Fraud class:-
0 61.878931
1 38.121069
Name: PotentialFraud, dtype: float64
Observation:
Plotting the frequencies of fraud and non-fraud providers in provider-labeled data
Percent Distribution of Potential Fraud class:-
No 90.64695
Yes 9.35305
Name: PotentialFraud, dtype: float64
Observation:
State-wise Beneficiary involved in Healthcare Fraud
Observation:
Race-wise Beneficiary involved in Healthcare Fraud
Observation:
There is no Race 4 data
Race 1 beneficiary has most # of fraud and also non-fraud cases
There is no pattern of race in fraud and nonfraud cases, so this feature may have less importance
Individual Beneficiaries invloved in Healthcare Fraud
Observation:
Observation:
Observation:
Observation:
Observation:
Overall Observation:
Observations: In the large # of datapoints this plot shows many fraud transactions linearly saperable. Like fraud providers have mostly 0 admit days and # of claims filled by them are high. There is also one more condition like fraud providers have # of admit days between 5-10 days mostly and # of claims are also filled by them are high. In simple language fraud providers bills most # of claims with 0 admit days or admit days between 5-10
Observations: Pair plots are plotted between 3 random features from dataset 3 times. Since # of features are high we cannot plot for all features.
After analyzing all the pair plots we can conclude that -
feature ClmProcedureCode_2TF and PerClmProcedureCode_1Avg_InscClaimAmtReimbursed shows good clustering of fraud and non_fraud providers
fraud providers bills claims for patient with ChronicCond_stroke=1 mostly
fraud providers tend to bills more # of claims than non_fraud providers using ClmDiagnosisCode_6
Some features shows random behaviors in plots but still me gives some importance in classification
Observations: Many fraud providers have a different IQR range than non_fraud, so this feature may help in classifying for some fraud providers
Observations: For this feature(ClmCount_Provider) fraud and non_fraud providers have a very different IQR range. So may be one of the most important features for classification
Observations: Heatmat is plotted for some random 10 features. Plotting for all features is not readable or observable. This heatmap shows a correlation between two corresponding features. Close to 1 means highly correlated positively, it means as x increase y will also increase and close to -1 means highly correlated in a negative manner, it means as x increases, y decreases.
3rd feature on x-axis is highly correlated with 4th and 7th feature on y-axis
4th feature on x-axis is highly correlated with 4th and 8th feature on y-axis
7th feature on x-axis is highly correlated with 7h and 8th feature on y-axis
Observations: In 3D space of 3 features most of the fraud data points are linearly separable
We can always fabricate some other features using existing features by using some domain knowledge, statistical methods, mathematical methods etc. We have also fabricated many features using some statistics like average, count etc. See below some features fabricated in this case study:
#average feature grouped by provider
raw_data["PerProviderAvg_InscClaimAmtReimbursed"]=
raw_data.groupby('Provider')['InscClaimAmtReimbursed'].transform('mean')
raw_data["PerProviderAvg_DeductibleAmtPaid"]=
raw_data.groupby('Provider')['DeductibleAmtPaid'].transform('mean')
raw_data["PerProviderAvg_IPAnnualReimbursementAmt"]=
raw_data.groupby('Provider')['IPAnnualReimbursementAmt'].transform('mean')
raw_data["PerProviderAvg_IPAnnualDeductibleAmt"]=
raw_data.groupby('Provider')['IPAnnualDeductibleAmt'].transform('mean')
raw_data["PerProviderAvg_OPAnnualReimbursementAmt"]=
raw_data.groupby('Provider')['OPAnnualReimbursementAmt'].transform('mean')
raw_data["PerProviderAvg_OPAnnualDeductibleAmt"]=
raw_data.groupby('Provider')['OPAnnualDeductibleAmt'].transform('mean')
raw_data["PerProviderAvg_Age"]=
raw_data.groupby('Provider')['Age'].transform('mean')
raw_data["PerProviderAvg_NoOfMonths_PartACov"]=
raw_data.groupby('Provider')['NoOfMonths_PartACov'].transform('mean')
raw_data["PerProviderAvg_NoOfMonths_PartBCov"]=
raw_data.groupby('Provider')['NoOfMonths_PartBCov'].transform('mean')
raw_data["PerProviderAvg_AdmitForDays"]=
raw_data.groupby('Provider')['AdmitForDays'].transform('mean')
We have also generated some features using diagnosis code and cpt. Considering each diagnosis code or cpt code as term and each provider as document when data is grouped by provider then calculating TF-IDF for each diagnosis code/cpt code since TF-IDF is also a way of extracting information.
Logistic Regression
log = LogisticRegressionCV(cv=5,class_weight='balanced',random_state=40)
log.fit(X_train,y_train)
Train Confusion Matrix :
[[125830 23127]
[ 31012 210778]]
Test Confusion Matrix :
[[53924 9915]
[13431 90194]]
Train Accuracy : 0.8614474327378073
Test Accuracy : 0.8605909329766398
Train Sensitivity : 0.8447404284457931
Test Sensitivity : 0.8446874167828443
Train AUC : 0.858240184031408
Test AUC : 0.8575379182828576
Train F1-Score : 0.8229588716771474
Test F1-Score : 0.8220497888622955
Random Forest
## Lets Apply Random Forest
rfc = RandomForestClassifier(n_estimators=256,class_weight='balanced',random_state=123,max_depth=6) # We will set max_depth =4
rfc.fit(X_train,y_train) #fit the model
Train Confusion Matrix :
[[127246 21711]
[ 25284 216506]]
Test Confusion Matrix :
[[54555 9284]
[10906 92719]]
Train Accuracy : 0.8797303626131486
Test Accuracy : 0.879436774470931
Train Sensitivity : 0.8542465275213652
Test Sensitivity : 0.8545716568241983
Train AUC : 0.8748382230228522
Test AUC : 0.8746633917414116
Train F1-Score : 0.8441226321532936
Test F1-Score : 0.8438515081206497
Top 10 features impacting model and their importance score :-
Variable: ClmCount_Provider Importance: 20.2
Variable: ClmDiagnosisCode_1TF-IDF Importance: 10.9
Variable: PerProviderAvg_AdmitForDays Importance: 8.6
Variable: ClmDiagnosisCode_1TF Importance: 8.5
Variable: PerProviderAvg_DeductibleAmtPaid Importance: 6.6
Variable: PerProviderAvg_InscClaimAmtReimbursed Importance: 5.3
Variable: PerProviderAvg_NoOfMonths_PartACov Importance: 5.3
Variable: PerProviderAvg_NoOfMonths_PartBCov Importance: 4.4
Variable: PerProviderAvg_IPAnnualReimbursementAmt Importance: 3.2
Variable: ClmCount_Provider_AttendingPhysician Importance: 2.9
Adaboost, Decision Tree, XGBoost classifier
clfs = { 'ada': AdaBoostClassifier(), 'dtc': DecisionTreeClassifier(class_weight='balanced'), 'xgb': XGBClassifier(booster='gbtree'), }
## Lets Fit These models and check their performance
f1_scores = dict()
for clf_name in clfs:
print(clf_name)
clf = clfs[clf_name]
clf.fit(X_train, y_train)
all_clfs.append(clf)
y_pred =((clf.predict_proba(X_val)[:,1]>0.5).astype(bool))
f1_scores[clf_name] = f1_score(y_pred, y_val)
Printing F1 score
{'ada': 0.8719151056686082,
'dtc': 0.9997493812850475,
'xgb': 0.9999843353488517}
CatBoost
from catboost import CatBoostClassifier, Pool
X_train_p = Pool(data=X_train, label=y_train)#, cat_features=X_train.columns.values)
model = CatBoostClassifier(n_estimators=100, learning_rate=1, depth=5, loss_function='Logloss')
catboost_numerical = model.fit(X_train_p)
Train Confusion Matrix :
[[148912 45]
[ 1 241789]]
Test Confusion Matrix :
[[ 63813 26]
[ 3 103622]]
Train Accuracy : 0.99988227676732
Test Accuracy : 0.9998268284526823
Train Sensitivity : 0.9996978993937848
Test Sensitivity : 0.99959272544996
Train AUC : 0.9998468817867224
Test AUC : 0.9997818874535686
Train F1-Score : 0.9998455702151947
Test F1-Score : 0.9997728251929028
Top 10 features impacting model and their importance score :-
Variable: ClmCount_Provider Importance: 40.0
Variable: PerProviderAvg_InscClaimAmtReimbursed Importance: 10.3
Variable: PerProviderAvg_AdmitForDays Importance: 9.1
Variable: PerProviderAvg_DeductibleAmtPaid Importance: 6.2
Variable: PerProviderAvg_OPAnnualDeductibleAmt Importance: 6.0
Variable: PerProviderAvg_IPAnnualDeductibleAmt Importance: 5.6
Variable: PerProviderAvg_OPAnnualReimbursementAmt Importance: 5.4
Variable: PerProviderAvg_Age Importance: 5.2
Variable: PerProviderAvg_NoOfMonths_PartACov Importance: 4.4
Variable: PerProviderAvg_IPAnnualReimbursementAmt Importance: 3.8
LightGBM
import lightgbm as lgbm
model = lgbm.LGBMClassifier()
lgbm_numerical = model.fit(X_train, y_train)
Train Confusion Matrix :
[[148022 935]
[ 20 241770]]
Test Confusion Matrix :
[[ 63403 436]
[ 15 103610]]
Train Accuracy : 0.9975559633215354
Test Accuracy : 0.9973068838675775
Train Sensitivity : 0.9937230207375283
Test Sensitivity : 0.9931703190839456
Train AUC : 0.9968201521653646
Test AUC : 0.9965127831849161
Train F1-Score : 0.9967845009579157
Test F1-Score : 0.9964559906331283
Top 10 features impacting model and their importance score :-
Variable: ClmCount_Provider Importance: 515
Variable: PerProviderAvg_InscClaimAmtReimbursed Importance: 355
Variable: PerProviderAvg_DeductibleAmtPaid Importance: 270
Variable: PerProviderAvg_OPAnnualReimbursementAmt Importance: 270
Variable: PerProviderAvg_Age Importance: 248
Variable: PerProviderAvg_OPAnnualDeductibleAmt Importance: 242
Variable: PerProviderAvg_IPAnnualDeductibleAmt Importance: 238
Variable: PerProviderAvg_AdmitForDays Importance: 211
Variable: PerProviderAvg_IPAnnualReimbursementAmt Importance: 208
Variable: PerProviderAvg_NoOfMonths_PartBCov Importance: 205
Observation:
https://github.com/microsoft/LightGBM/tree/master/examples/binary_classification
https://aisel.aisnet.org/cgi/viewcontent.cgi?article=1283&context=amcis2011_submissions
https://link.springer.com/content/pdf/10.1007/s42979-021-00655-z.pdf
https://github.com/mdmazharali786/Medicare_fraud_app
https://mdmazharali786-medicare-fraud-app-fraud-pred-app-lw6rji.streamlitapp.com/