Predict Bad Loans with H2O Flow AutoML

Predict Bad Loans with H2O Flow AutoML

Overview

Machine learning algorithms play a key role in accurately predicting loan data of any bank. The greatest challenge in machine learning is to employ the best models and algorithms to accurately predict the probability of loan default in making the best financial decisions by both investors and borrowers. H2O Flow, a web-based interactive computational environment, is used for combining text, code execution, and rich media into a document.

H2O’s AutoML, an easy-to-use interface for advanced users, automates the machine learning workflow including training a large set of models. Stacked Ensembles are used to produce a top-performing model–a highly predictive ensemble model in AutoML Leaderboard. In this blog, let us accurately predict bad loan data in order to help the borrowers in making financial decisions and the investors in choosing the best investment strategy.

Pre-requisites

  • Install Python 2.7 or 3.5+
  • Install H2O Flow with the following packages:
    • pip install requests
    • pip install tabulate
    • pip install scikit-learn
    • pip install colorama
    • pip install future
    • pip install http://h2o-release.s3.amazonaws.com/h2o/rel-weierstrass/2/Python/h2o-3.14.0.2-py2.py3-none-any.whl
  • On successfully installing H2O, check Cluster connection using h2o.init().

Data Description

Loan data of Lending Club, from 2007-2011, with 163K rows and 15 columns is used as the source file. The Lending Club is a peer-to-peer loan platform for both the investors and borrowers.

Sample Dataset
select

Dataset Variables

  1. loan_amnt
  2. term
  3. int_rate
  4. addr_state
  5. dti
  6. revol_util
  7. delinq_2yrs
  8. emp_length
  9. annual_inc
  10. home_ownership
  11. purpose
  12. total_acc
  13. longest_credit_length
  14. verification_status
  15. Dependent variable

Use Case

  • Analyze Lending Club’s loan data.
  • Predict bad loan data in the dataset by using the distributed random forest model and the stacked ensembles in AutoML based on the borrower loan amount approval or rejection.

Based on the percentage of the bad loan data, the investors can very easily decide whether to finance the borrower for new loans or not. For example, a loan is considered rejected if the bad loan data is 1.

Synopsis

  • Import data from source
  • View parsing data
  • View job details and dataset summary
  • Visualize labels
  • Impute data
  • Split Data
  • Run AutoML
  • View Leaderboard
  • Compute Variable Importance
  • View Output

Importing Data from Source

To import the data from the source, perform the following:

  • Open H2O Flow.
  • Click Data –> Import Files to import the source files into H2O Flow as shown in the below diagram:

select

select

After importing the files, a summary displays the results of the import.

Viewing Parsing Data

On successfully importing these files, click Parse these files to parse the files and to view the details of the source data as shown in the below diagram:

select

The parse files contain column names and data types of all features. The data types will be assigned by default and can be changed if required. For example, in our use case, the data type of response column (bad loan) is changed from numeric to factor (Enum). After doing all changes, click Parse.

select

Viewing Job Details and Dataset Summary

After clicking the parse files, you can view the job details. Click View to view the summary of the DataFrame.

select

Loan Dataset Summary

select

From the above summary, the input columns show multiple label values. Each label data can be visualized by clicking their corresponding column names.

Visualizing Labels

In this section, let us visualize data of loan amount and employee length columns.

Loan Amount Data

select

Employee Length Data

select

Imputing Data

Missing values of labels, with aggregates computed on “na.rm’d” vector, are imputed using in-place imputation.

To impute the data, perform the following:

  • Choose the attribute with missing values.
  • Click Impute as shown in the below diagram:

select

  • Specify the following details:
    • Frame
    • Column
    • Method
    • Combine Method

select

On successfully imputing the column with the median values, the summary of the column will be displayed as shown in the below diagram:

select

Splitting Data

To split the dataset into a training set (70%) and a test set (30%), perform the following:

  • Click Assist Me and Split Frame (or click Data drop-down and select Split Frame) to split the DataFrame.
    It automatically adjusts the ratio values to one. On entering unsupported values, an error will be displayed.
  • Click Create to view the split frames.

select

select

Running AutoML

To run AutoML, perform the following:

  • Select Model –> RunAutoML as shown in the below diagram:

select

  • Provide the following details as shown in the below diagram:
    • Training Frame – Select the dataset to build the model.
    • Response Column – Select the column to be used as a dependent variable. Required only for GLM, GBM, DL, DRF, Naïve Bayes (classification model).
    • Fold Column – (Optional in AutoML) Select the column with the cross-validation fold index assignment / observation.
    • Weight Column – Weights are per row observation weights and do not increase data size. During data training, rows with higher weights matter more due to the larger loss function pre-factor.
    • Validation Frame – (optional) Select the dataset to evaluate the model accuracy.
    • Leaderboard Frame – Specify the Leaderboard frame when configuring AutoML run. If not specified, the Leaderboard frame will be created from the Training Frame. The output models with best results will be displayed on the Leaderboard.
    • Max Models – (AutoML) Specify the maximum number of models to be built in an AutoML run.
    • Max Runtime Secs – Controls execution time of AutoML run (default time is 3600 seconds).
    • Stopping Rounds – Stops training based on a simple moving average when the stopping_metric does not improve for a specified number of training rounds. Specify 0 to disable this feature.
    • Stopping Tolerance – Specify the tolerance value to improve a model before training ceases.

select

Viewing Leaderboard

The Leaderboard displays the models with the best results first as shown in the below diagram:

select

Model

select

ROC Curve – Training Metrics

select

Computing Variable Importance

The statistical significance of all variables affecting the model is computed depending on the algorithm and is listed in the order of most to least importance.
The percentage importance of all variables is scaled to 100. The scaled importance value of the variables is shown in the below diagram:

select

Viewing Output

Predicted Model of Loan Dataset

select

ROC Curve

select

Prediction Scores

select

Conclusion

In this blog, AutoML, the distributed random forest model, and the stacked ensembles are used to build and test the best model for predicting the loan default. The data is analyzed to obtain the cut-off value. The investors use this cut-off value to decide the best type of investment strategy for loan investment and to determine the applicants getting loans.

References

3006 Views 13 Views Today