This is the second and last part of my series which focuses on Anomaly Detection using Machine Learning. If you haven't already, I recommend you read my first article which will introduce you to Anomaly Detection and its applications in the business world. here In this article, I will take you through a case study focus on . It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. So the main task is to identify fraudulent credit card transactions by using Machine learning. We are going to use a Python library called which is specifically developed for anomaly detection purposes. Credit Card Fraud Detection PyOD What is the PyOD Library? PyOD is a comprehensive and scalable for in multivariate data. It has around 20 outlier detection algorithms (supervised and unsupervised). PyOD is developed with a comprehensive API to support multiple techniques and you can take a look at the official documentation of PyOD . Python toolkit detecting outlying objects here If you are an anomaly detection professional or you want to learn more about anomaly detection then I recommend you try using the PyOD Toolkit. Features of PyOD PyOD has useful features such as : Unified APIs, detailed documentation, and interactive examples across various algorithms. Advanced models, including Neural Networks/Deep Learning and Outlier Ensembles. Optimized performance with JIT and parallelization when possible, using and . numba joblib Compatible with both Python 2 & 3.(support for python 2 ended January 2020). Installing PyOD in Python Let’s first install PyOD on our machines. pip install pyod pip install --pre pyod # normal install # pre-release version for new features Alternatively, you could clone and run the setup.py file. git clone https://github.com/yzhao062/pyod.git cd pyod pip install . If you plan to use Neural Network-based Models in Pyod, you have to install Keras and other libraries manually in your machine. Credit Card Fraud Detection Case Study The dataset we will use contains transactions made by credit cards in September 2013 by European cardholders. The dataset has been collected and analyzed during a research collaboration of Worldline and the of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. Machine Learning Group Now let us see how we can use the PYOD library in this case study. We will start by importing important packages such as pandas, numpy,sklearn and pyod. pandas pd numpy np scipy sklearn matplotlib.pyplot plt seaborn sns sklearn.model_selection train_test_split sklearn.metrics classification_report, accuracy_score,confusion_matrix sklearn.ensemble IsolationForest sklearn.neighbors LocalOutlierFactor pyod.models.knn KNN pyod.models.ocsvm OCSVM pyod.utils.data evaluate_print pyod.utils.example visualize sklearn.preprocessing StandardScaler cf_matrix make_confusion_matrix %matplotlib inline warnings warnings.filterwarnings( ) np.random.seed( ) # Import important packages import as import as import import import as import as from import from import from import from import # Importing KNN module from PyOD from import from import # Import the utility function for model evaluation from import from import from import from import import 'ignore' # set seed 123 The dataset for this case study can be downloaded . here Let's load the dataset. data = pd.read_csv( ) # Load the dataset from csv file by using pandas "creditcard.csv" Check columns in the dataset. data.columns # show columns The dataset contains 31 columns, only 3 columns make sense which are Time, Amount, and Class (fraud or not fraud). The rest of the 28 columns were transformed using PCA dimensionality reduction in order to protect user identities. data.shape ( , ) # print the shape of the data 284807 31 The dataset contains 284807 rows and 31 columns as explained before. data.head() # show the first five rows You can see all transformed columns are named from V1 to V28. Let's check if we have any missing values in our dataset. data.isnull().sum() #check missing data We don't have any missing values in our dataset. Our target column is Class contains two classes which are fraud labeled as and not fraud labeled as 1 0. data.Class.value_counts(normalize= ) # determine number of fraud cases in our file True In this dataset, there are only 0.173% (total of 492 )of fraud transactions and 99.82% (total of 284,315) of valid transactions. We can observe if variables in the dataset are correlated to each other by using the heatmap plot implemented in the seaborn library. corr = data.corr() fig = plt.figure(figsize=( , )) sns.heatmap(corr, vmax= , square= ,annot= ) #find the correlation betweeen the variables 30 20 .8 True True The above correlation graph shows that has a strong positive correlation to the while the has a strong negative correlation to the V11 variable Class variable V17 variable Class variable. Because we have many valid transactions, we will use all 10,000 valid cases and 492 fraud cases to create our models. positive = data[data[ ]== ] negative = data[data[ ]== ] print( .format(len(positive))) print( .format(len(negative))) new_data = pd.concat([positive,negative[: ]]) new_data = new_data.sample(frac= ,random_state= ) new_data.shape # use sample of the dataset "Class" 1 "Class" 0 "positive:{}" "negative:{}" 10000 #shuffling our dataset 1 42 : 492 Positive : 284315 Negative (10492,31) Now we have a total of 10492 numbers of rows. We will standardize the Amount variable by using the method from sklearn. StandardSclaer transforms the data to where there is a mean of 0 and a standard deviation of 1, which means standardizing the data into a normal distribution. standardScaler new_data[ ] = StandardScaler().fit_transform(new_data[ ].values.reshape( , )) #Normalising the amount column. 'Amount' 'Amount' -1 1 Separate the dataset into independent variables and target variable (class variable). NB. we are not going to use the time variable in this article. X = new_data.drop([ , ], axis= ) y = new_data[ ] print( .format(X.shape)) print( .format(y.shape)) X shape: ( , ) y shape: ( ,) # split into independent variables and target variable 'Time' 'Class' 1 'Class' # show the shape of x and y "X shape: {}" "y shape: {}" 10492 29 10492 Split the dataset into train and test sets. We will only use 20% of the dataset for the test set and the rest will be the train set. X_train, X_test, y_train,y_test = train_test_split(X,y, test_size = , stratify=y, random_state= ) #split the data into train and test 0.2 42 Creating Models We will create two outlier detectors from PyOD library which are K-Nearest Neighbors Detector and One-class SVM detector. 1. k-Nearest Neighbors Detector In KNN detector for any observation, its distance to its kth nearest neighbor could be viewed as the outlying score. PyOD supports three detectors: kNN Uses the distance of the kth neighbor as the outlier score. Largest: Uses the average of all k neighbors as the outlier score. Mean: Uses the median of the distance to k neighbors as the outlier score. Median: clf_knn = KNN(contamination= , n_neighbors = ,n_jobs= ) clf_knn.fit(X_train) # create the KNN model 0.172 5 -1 The two parameters we passed into KNN() are The amount of anomalies in the data which for our case = 0.0172 contamination: Number of neighbors to consider for measuring the proximity. n_neighbors: After training our KNN Detector model, we can get the prediction labels on the training data and then get the outlier scores of the training data. The higher the scores are, the more abnormal. This indicates the overall abnormality in the data. These features make a great utility for anomaly detection tasks. PyOD y_train_pred = clf_knn.labels_ y_train_scores = clf_knn.decision_scores_ # Get the prediction labels of the training data # binary labels (0: inliers, 1: outliers) # Outlier scores We can evaluate with respect to the training data. provides a handy function for this task called The default metrics include and Precision @ n. We will pass class name, y_train values and y_train_scores(outlier scores as returned by a fitted model.) KNN() PyOD evaluate_print(). ROC valuate_print(‘KNN’, y_train, y_train_scores) # Evaluate on the training data e KNN ROC: 0.9566, precision @ rank n:0 0.5482. We see that the model has a good performance on the training data. Let’s plot the confusion matrix for the train set. KNN() scikitplot skplt skplt.metrics.plot_confusion_matrix(y_train,y_train_pred, normalize= ,title=”Consfusion Matrix on Train Set”) plt.show() import as # plot the comfusion matrix in the train set False 372 fraud cases were predicted correctly and only 22 cases were predicted incorrectly as valid cases in the train set. We will use decision_function to predict anomaly scores of the test set using the fitted detector(KNN Detector) and evaluate the results. y_test_scores = clf_knn.decision_function(X_test) evaluate_print( , y_test,y_test_scores) # outlier scores # Evaluate on the training data 'KNN' KNN ROC:0.9393, precision @ rank n:0.5408 Our model continues to perform well on the test set. Let’s plot the confusion matrix for the test set. KNN() y_preds = clf_knn.predict(X_test) skplt.metrics.plot_confusion_matrix(y_test,y_preds, normalize= , title= ) plt.show() # plot the comfusion matrix in the test set False "Consfusion Matrix on Test Set" 87 fraud cases were predicted correctly and only 11 cases were predicted incorrectly as valid cases in the test set. 2. One-class SVM detector This is an unsupervised Outlier detection algorithm and a wrapper of scikit-learn one-class SVM Class with more functionalities. Let's create a Once-class SVM model. clf_ocsvm = OCSVM(contamination= ) clf_ocsvm.fit(X_train) # create the OCSVM model 0.172 After training our OCSVM Detector model, we can get the prediction labels on the training data and then get the outlier scores of the training data. y_train_pred = clf_ocsvm.labels_ clf_name = y_train_scores = clf_ocsvm.decision_scores_ evaluate_print(clf_name, y_train, y_train_scores) # Get the prediction labels of the training data # binary labels (0: inliers, 1: outliers) 'OCSVM' # Outlier scores # Evaluate on the training data OCSVM ROC:0.9651, precision @ rant n:0.7132 OCSVM model performs better than KNN model on the train set. Let’s plot the confusion matrix for the train set. skplt.metrics.plot_confusion_matrix(y_train,y_train_pred, normalize= , title= ) plt.show() # plot the comfusion matrix in the train set False "Consfusion Matrix on Train Set" 373 fraud cases were predicted correctly and only 21 cases were predicted incorrectly as valid cases in the train set. We will use decision_function to predict anomaly scores of the test set using the fitted detector(OCSVM Detector) and evaluate the results. y_test_scores = clf_ocsvm.decision_function(X_test) evaluate_print(clf_name, y_test,y_test_scores) # outlier scores # Evaluate on the training data OCSVM ROC: 0.9571, precision @ rank n:0.6633 Our OCSVM model continues to perform well on the test set. Let’s plot the confusion matrix for the test set. y_preds = clf_ocsvm.predict(X_test) skplt.metrics.plot_confusion_matrix(y_test,y_preds, normalize= , title=”Consfusion Matrix on Test Set”, ) plt.show() # plot the comfusion matrix in the test set False 92 fraud cases were predicted correctly and only 6 cases were predicted incorrectly as valid cases in the test set. In general, when you compare these two models, we observed that the OCSVM model performs better than the KNN model. There is more that can be done to increase the performance of the best model (OCSVM) for detecting fraud transactions. You can also try to use other detector algorithms found on . PyOD documentation Conclusion It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. As a business owner, you can avoid serious headaches and unwanted publicity by recognizing potentially fraudulent use of credit cards in your payment environment. The source code for this article is available on Github. https://github.com/Davisy/Credit-Card-Fraud-Detection-using-PYOD-Library If you learned something new or enjoyed reading this article, please share it so that others can see it.I look forward to hearing your experience using PyOD Library as well. I can also be reached on Twitter @Davis_McDavid Also published at https://medium.com/analytics-vidhya/introduction-to-anomaly-detection-using-machine-learning-with-a-case-study-part-two-f78243f74d2f