Table of Content
- 1 Overview
- 2 Pre-requisites
- 3 Data Description
- 4 Use Case
- 5 Importing Data from Source
- 6 Viewing Parsing Data
- 7 Viewing Job Details and Dataset Summary
- 8 Visualizing Labels
- 9 Imputing Data
- 10 Splitting Data
- 11 Running AutoML
- 12 Viewing Leaderboard
- 13 Computing Variable Importance
- 14 Viewing Output
- 15 Conclusion
- 16 References
Machine learning algorithms play a key role in accurately predicting loan data of any bank. The greatest challenge in machine learning is to employ the best models and algorithms to accurately predict the probability of loan default in making the best financial decisions by both investors and borrowers. H2O Flow, a web-based interactive computational environment, is used for combining text, code execution, and rich media into a document.
H2O’s AutoML, an easy-to-use interface for advanced users, automates the machine learning workflow including training a large set of models. Stacked Ensembles are used to produce a top-performing model–a highly predictive ensemble model in AutoML Leaderboard. In this blog, let us accurately predict bad loan data in order to help the borrowers in making financial decisions and the investors in choosing the best investment strategy.
- Install Python 2.7 or 3.5+
- Install H2O Flow with the following packages:
- pip install requests
- pip install tabulate
- pip install scikit-learn
- pip install colorama
- pip install future
- pip install http://h2o-release.s3.amazonaws.com/h2o/rel-weierstrass/2/Python/h2o-18.104.22.168-py2.py3-none-any.whl
- On successfully installing H2O, check Cluster connection using h2o.init().
Loan data of Lending Club, from 2007-2011, with 163K rows and 15 columns is used as the source file. The Lending Club is a peer-to-peer loan platform for both the investors and borrowers.
- Dependent variable
- Analyze Lending Club’s loan data.
- Predict bad loan data in the dataset by using the distributed random forest model and the stacked ensembles in AutoML based on the borrower loan amount approval or rejection.
Based on the percentage of the bad loan data, the investors can very easily decide whether to finance the borrower for new loans or not. For example, a loan is considered rejected if the bad loan data is 1.
- Import data from source
- View parsing data
- View job details and dataset summary
- Visualize labels
- Impute data
- Split Data
- Run AutoML
- View Leaderboard
- Compute Variable Importance
- View Output
Importing Data from Source
To import the data from the source, perform the following:
- Open H2O Flow.
- Click Data –> Import Files to import the source files into H2O Flow as shown in the below diagram:
After importing the files, a summary displays the results of the import.
Viewing Parsing Data
On successfully importing these files, click Parse these files to parse the files and to view the details of the source data as shown in the below diagram:
The parse files contain column names and data types of all features. The data types will be assigned by default and can be changed if required. For example, in our use case, the data type of response column (bad loan) is changed from numeric to factor (Enum). After doing all changes, click Parse.
Viewing Job Details and Dataset Summary
After clicking the parse files, you can view the job details. Click View to view the summary of the DataFrame.
Loan Dataset Summary
From the above summary, the input columns show multiple label values. Each label data can be visualized by clicking their corresponding column names.
In this section, let us visualize data of loan amount and employee length columns.
Loan Amount Data
Employee Length Data
Missing values of labels, with aggregates computed on “na.rm’d” vector, are imputed using in-place imputation.
To impute the data, perform the following:
- Choose the attribute with missing values.
- Click Impute as shown in the below diagram:
- Specify the following details:
- Combine Method
On successfully imputing the column with the median values, the summary of the column will be displayed as shown in the below diagram:
To split the dataset into a training set (70%) and a test set (30%), perform the following:
- Click Assist Me and Split Frame (or click Data drop-down and select Split Frame) to split the DataFrame.
It automatically adjusts the ratio values to one. On entering unsupported values, an error will be displayed.
- Click Create to view the split frames.
To run AutoML, perform the following:
- Select Model –> RunAutoML as shown in the below diagram:
- Provide the following details as shown in the below diagram:
- Training Frame – Select the dataset to build the model.
- Response Column – Select the column to be used as a dependent variable. Required only for GLM, GBM, DL, DRF, Naïve Bayes (classification model).
- Fold Column – (Optional in AutoML) Select the column with the cross-validation fold index assignment / observation.
- Weight Column – Weights are per row observation weights and do not increase data size. During data training, rows with higher weights matter more due to the larger loss function pre-factor.
- Validation Frame – (optional) Select the dataset to evaluate the model accuracy.
- Leaderboard Frame – Specify the Leaderboard frame when configuring AutoML run. If not specified, the Leaderboard frame will be created from the Training Frame. The output models with best results will be displayed on the Leaderboard.
- Max Models – (AutoML) Specify the maximum number of models to be built in an AutoML run.
- Max Runtime Secs – Controls execution time of AutoML run (default time is 3600 seconds).
- Stopping Rounds – Stops training based on a simple moving average when the stopping_metric does not improve for a specified number of training rounds. Specify 0 to disable this feature.
- Stopping Tolerance – Specify the tolerance value to improve a model before training ceases.
The Leaderboard displays the models with the best results first as shown in the below diagram:
ROC Curve – Training Metrics
Computing Variable Importance
The statistical significance of all variables affecting the model is computed depending on the algorithm and is listed in the order of most to least importance.
The percentage importance of all variables is scaled to 100. The scaled importance value of the variables is shown in the below diagram:
Predicted Model of Loan Dataset
In this blog, AutoML, the distributed random forest model, and the stacked ensembles are used to build and test the best model for predicting the loan default. The data is analyzed to obtain the cut-off value. The investors use this cut-off value to decide the best type of investment strategy for loan investment and to determine the applicants getting loans.