Predict Lending Club Loan Default Using Seahorse and SparkR

Predict Lending Club Loan Default Using Seahorse and SparkR

Overview

Data scientists are using Python and R to solve data problems due to the ready availability of these packages. These languages are often limited as the data is processed on a single machine, where the movement of data from the development environment to production environment is time-consuming and requires extensive re-engineering. To overcome this problem, Spark provides a powerful, unified engine that is both fast (100x faster than Hadoop for large-scale data processing) and easy to use by the data scientists and data engineers. It is simple, scalable, and easy to integrate with other tools.

Seahorse, a scalable data analytics workbench, allows the data scientists to visually build Spark applications. It allows the data scientists to perform data preparation, data transformation, data modeling, data training, data analysis, and data visualization collaboratively. Seahorse has built-in operations to allow the data scientists to customize parameter values. In this blog, let us discuss predicting loan default of Lending Club. Lending Club is the world’s largest online marketplace to connect borrowers and investors.

Pre-requisites

  • VirtualBox (version 5.0.10)
  • Vagrant (version 1.8.1)
  • Google Chrome (60.0.3112.113)

Data Description

Loan data of Lending Club, from 2007-2011, with 40K records is used as the source file. Each loan has more than 100 characteristics of loan and borrower.

select

Use Case

  • Analyze loan data of Lending Club.
  • Predict loan default in Lending Club dataset by building data model using Logistic Regression.

Loan status falls under two categories such as Charged Off (default loan) and Fully Paid (desirable loan). Lending Club defines Charged Off loans as loans that are non-collectable and the lender has no hope of recovering money.

Synopsis

  • Read Data from Source
  • Prepare Data
  • Train and Evaluate Data Model
  • Visualize Data

Workflow Operations

In Seahorse, all the machine learning processes are made as operations. R Transformation operations are used to clean and prepare the data. The operations used for Lending Club loan data analysis are as follows:

  • Input / Output – Read DataFrame
  • Action – Fit, Transform, Evaluate
  • Set Operation – Split
  • Filtering – Filter Columns, Handle Missing Values
  • Transformation – SQL Transformation, R Transformation
  • Feature Conversion – String Indexer, One Hot Encoder, Assemble Vector
  • Machine Learning – Logistic Regression from Classification, Binary Classification Evaluator from Evaluation

select

Reading Data from Source

Seahorse supports three different file formats such as CSV, Parquet, and JSON from different types of data sources such as HDFS, Database, Local, and Google Spreadsheets. Read DataFrame operation is used to read the files from the data sources and upload it into Seahorse library.

select

Preparing Data

To prepare the data for analysis, perform the following:

  • Remove irrelevant data (loan ID, URL, and so on), poorly documented data (average current balance), and less important features (payment plan, home state) from the source data.
  • Use Filter Columns operation to select 17 key features from the dataset as shown in the below diagram:

select

  • Use R Transformation operation to write any custom function in R.
  • Convert string columns into numeric columns by removing special characters and duplicate data.
    For Example, convert int_rate and revol_util columns into numeric by removing special characters (%).
  • Derive new features from the date columns by applying feature engineering.
    For example, derive issue_month and issue_year from issue_d feature and for earliest_cr_line feature.
The derived features are shown in the below diagram:

select

After Preprocessing

After preprocessing, perform the following:

  • Use Handle Missing Values operation to find the rows with missing values and to handle them with the selected strategy such as remove row, remove column, custom value, and mode.
    For example, provide custom values for NAs and empty string.
  • Select numeric and string columns from the DataFrame and select remove row as strategy as shown in the below diagram:

select

  • Use String Indexer to map the categorical features into numbers.
  • Choose the columns from the DataFrame using name, index, or type.
  • Select string type columns from the DataFrame and apply string indexer to those columns.
    For example, after the String Indexer execution, Fully Paid will become 0 and Charge Off will become 1 in loan_status column.
  • Use One Hot Encoder operation to convert categorical values into numbers in a fixed range of values.
    A vector will be produced in each column corresponding to one possible value of the feature.

select

  • Use Assemble Vector operation to group all relevant columns together and to form a column with a single vector of all the features.
    For example, the loan_status column is prediction variable and all other columns are features.
  • Use excluding mode to select all the columns other than the prediction variable.

select

Training and Evaluating Data Model

To split the dataset into training set and validation set using Split operation based on the split ratio, perform the following:

  • Use 0.7 as a split ratio to split 70 percentage of data in the training set and 30 percentage of data in the validation set.

select

  • Use Logistic Regression and Fit operations to perform model training.
  • Use Fit operation to fit on estimator so as to produce Transformer.
  • In Fit Operation, select features columns and prediction variable.
  • Select maximum iterations and threshold value for the model.
    Fit operation provides prediction variable with predicted values and confidence score in raw prediction and probability columns.

select

  • Use Evaluate action with Binary Classification Evaluator to find the performance of the model.
  • Find AUC, F-Score, and Recall values from the Binary Classification Evaluator and select AUC as a metric for the model.

select

  • Use custom functions (R or Python Transformation) to find the confusion matrix of the model and derive the metrics for that model.
  • Use SQL Transformation to write custom Spark SQL query and to get correctly predicted values and wrongly predicted values from the DataFrame.

select

select

Visualizing Data

DataFrame Report

In DataFrame Report, every column has some plots based on the datatype.

select

Int_rate Column Visualization

For Continuous features, the bar chart is used for data visualization as shown in the below diagram:

select

Grade Column Visualization

For Discrete features, the pie chart is used for data visualization as shown in the below diagram:

select

To create a custom plot like the combination of two column values, use custom operations such as R, Python, SQL Transformation or Python or R Notebook.

References

2027 Views 1 Views Today
  • harjit singh

    Hello
    Thanks For Sharing This Much Information! it’s Very Useful
    Thanks For Taking the Time to Share your Info!

    http://taxiappdeveloper.com/