Table of Content
Imbalanced data refers to classification problems where one class outnumbers other class by a substantial proportion. Imbalanced classification occurs more frequently in binary classification than in multi-level classification. For example, extreme imbalance data can be seen in banking or financial data where majority credit card uses are acceptable and very few credit card uses are fraudulent.
With an imbalance dataset, the information required to make an accurate prediction about the minority class cannot be obtained using an algorithm. So, it is recommended to use balanced classification dataset. In this blog, let us discuss tackling imbalanced classification problems using R.
A credit card transaction dataset, having total transactions of 284K with 492 fraudulent transactions and 31 columns, is used as a source file. For sample dataset, refer to References section.
- Time – Time (in seconds) elapsed between each transaction and the first transaction in the dataset.
- V1-V28 – Principal component variables obtained with PCA.
- Amount – Transaction amount.
- Class – Dependent (or) response variable with value as 1 in case of fraud and 0 in case of good.
- Performing exploratory data analysis
- Checking imbalance data
- Checking number of transactions by hour
- Checking mean using PCA variables
- Partitioning data
- Building model on training set
- Applying sampling methods to balance dataset
Performing Exploratory Data Analysis
Exploratory data analysis is carried out using R to summarize and visualize significant characteristics of the dataset.
Checking Imbalance Data
To find the imbalance in the dependent variable, perform the following:
- Group the data based on Class value using dplyr package containing “group by function”.
- Use ggplot to show the percentage of class category.
Checking Number of Transactions by Hour
To check the number of transactions by day and hour, normalize the time by day and categorize them into four quarters according to the time of the day.
The above graph shows the transactions of 2 days. It states that most of the fraudulent transactions occurred between 13 to 18 hours.
Checking Mean using PCA Variables
To find data anomalies, take mean of variables from V1 to V28 and check the variation.
The blue points with much variations are shown in the below plot:
In predictive modeling, data needs to be partitioned for training set (80% of data) and testing set (20% of data). After partitioning the data, feature scaling is applied to standardize the range of independent variables.
Building Model on Training Set
To build a model on the training set, perform the following:
- Apply logic classifier on the training set.
- Predict the test set.
- Check the predicted output on the imbalance data.
Using Confusion Matrix, the test result shows 99.9% accuracy due to much of class 1 records. So, let us neglect this accuracy. Using ROC curve, the test result shows 78% accuracy that is very low.
Applying Sampling Methods to Balance Dataset
Different sampling methods are used to balance the given data, apply model on the balanced data, and check the number of good and fraud transactions in the training set.
There are 227K good and 394 fraud transactions.
In R, Random Over Sampling Examples (ROSE) and DMwR packages are used to quickly perform sampling strategies. ROSE package is used to generate artificial data based on sampling methods and smoothed bootstrap approach. This package provides well-defined accuracy functions to quickly perform the tasks.
The different types of sampling methods are:
This method over instructs the algorithm to perform oversampling. As the original dataset had 227K good observations, this method is used to oversample minority class until it reaches 227K. The dataset has a total of 454K samples. This can be attained using method = “over”.
This method functions similar to the oversampling method and is done without replacement. In this method, good transactions are equal to fraud transactions. Hence, no significant information can be obtained from this sample. This can be attained using method = “under”.
This method is a combination of both oversampling and undersampling methods. Using this method, the majority class is undersampled without replacement and the minority class is oversampled with replacement. This can be attained using method = “both”.
ROSE sampling method generates data synthetically and provides a better estimate of original data.
Synthetic Minority Over-Sampling Technique (SMOTE) Sampling
This method is used to avoid overfitting when adding exact replicas of minority instances to the main dataset.
For example, a subset of data from the minority class is taken. New synthetic similar instances are created and added to the original dataset.
The count of each class records after applying sampling techniques is shown below:
Logistic classifier model is computed using each trained balanced data and the test data is predicted. Confusion Matrix accuracy is neglected as it is imbalanced data. roc.curve is used to capture roc metric using an inbuilt function.
In this blog, highest data accuracy is obtained using SMOTE method. As there is no much variation in these sampling methods, these methods when combined with a more robust algorithm such as random forest and boosting can provide exceptionally high data accuracy.
When dealing with the imbalanced dataset, experiment the dataset with all these methods to obtain the best-suited sampling method for your dataset. For better results, advanced sampling methods comprising synthetic sampling with boosting methods can be used.
These sampling methods can be implemented in the same way in Python too. For Python code, check the below References section.
- Sample Credit Card Transaction Data:
- Associated R and Python Code in GitHub: