Table of Content
Nowadays, there are numerous risks related to bank loans both for the banks and the borrowers getting the loans. The risk analysis about bank loans needs understanding about the risk and the risk level. Banks need to analyze their customers for loan eligibility so that they can specifically target those customers.
Banks wanted to automate the loan eligibility process (real time) based on customer details such as Gender, Marital Status, Age, Occupation, Income, debts, and others provided in their online application form. As the number of transactions in banking sector is rapidly growing and huge data volumes are available, the customers’ behavior can be easily analyzed and the risks around loan can be reduced. So, it is very important to predict the loan type and loan amount based on the banks’ data.
In this blog post, we will discuss about how Naive Bayes Classification model using R can be used to predict the loans.
Customer loan dataset has samples of about 100+ unique customer details, where each customer is represented in a unique row. The structure of the dataset is as follows:
These variables are called as predictors or independent variables.
- Customer Demographics (state, gender, age, race, marital status, occupation)
- Customer Financials (income, debts, credit score)
- Loan Product (loan type)
Data preprocessing involves data cleansing and data preparation. As part of data cleansing, check for missing values.
From the above diagram, you can clearly see no missing values.
Debt-To-Income (DTI) Ratio
Debt-To-Income ratio is defined as the ratio of all your monthly debt payments and your gross monthly income. Lenders look at this ratio while deciding on whether to lend money or extend credit.
A low DTI indicates that you have a good balance between debt and income. As you might guess, lenders need it to be low – generally to be below 36, but the lower it is, the greater the chances of getting loans or credit you seek.
DTI = (debts / income) * 100
Response or dependent variables (loan_decision_status) are required to predict loan approval or denial. loan_decision_type field is used to create dependent variables.
Loan status falls under any one of three types of categories such as ‘Approved’, ‘Denied’, and ‘Withdrawn’. Here, ‘Withdrawn’ means that the customer has withdrawn the loan due to varied reasons after the bank approved the loan. So, consider ‘Approved’, ‘Withdrawn’ as ’1′ and ‘Denied’ as ’0′.
Let us try to predict whether loan will be approved (1) or denied (0) and classify it accordingly.
- Convert the loan_decision_status field as factor as shown below:
- Exclude applicantId, state, and race from further processes as these fields will not affect the prediction value. Exclude income, debts, and loan decision type as DTI and loan decision status are included.
- Encode the categorical variable (gender, marital status, occupation, loan type) as factors.
Partitioning the Data
- In predictive modeling, the data needs to be partitioned into train and test sets. 70% of the data is partitioned for training purpose and 30% of the data for testing purpose.
- After data splitting, apply Feature scaling to standardize the range of independent variables.
Dimensionality Reduction using PCA
As there are more than two independent variables in customer data, it is difficult to plot chart as two dimensions are needed to better visualize how Machine Learning models work.
To reduce dimensions, perform the following:
- Apply Dimensionality Reduction technique using Principal Component Analysis (PCA) on customer dataset except on dependent variable and reduce it to two dimensions.
- Before applying PCA, install and load caret package.
Naive Bayes Classification
Multiple models can be executed on top of the customer dataset to compare their performance and error rate so as to choose the best model. In this blog post, Naive Bayes Classification Model with R is used.
To apply Naive Bayes classification model, perform the following:
- Install and load e1071 package before running Naive Bayes.
- Test the models built using train datasets through the test dataset.
- Using accuracy and error rate, understand how these models are behaving for the test dataset.
- Determine the best model using these measures.
- Use Confusion Matrix/ Misclassification Table to describe the performance of the classification model on a test data. This table is also used to cross-tabulate the actual value with the predicted value based on the count of correctly classified customers and wrongly classified customers.
With the choice of Naive Bayes Classification, it is evident that the accuracy for this model is evaluated as 71% and error rate as 29%. The accuracy of the model can be improved with other classification models using parameter tuning.
Visualizing Test Set Results
The chart indicates some red and green points, which are observation points from test set. Each point represents each customer loan status (i.e. approved or declined). Dependent variable for Green points is loan_decision_status=1 (Approved) and for Red points is loan_decision_status=0 (Declined).
Now, the goal is to classify the green and red points into correct prediction region. The points in green prediction region indicates that the loan will be approved and in red prediction region indicates that the loan will be decline. The smooth curve between these two regions is called prediction boundary. Few green points in the red prediction region and few red points in the green prediction region indicate wrong predictions.
- The customer loan dataset and the associated R code can be downloaded from the GitHub location: