Loan Application Metrics Using Apache Pig and Tez

Loan Application Metrics Using Apache Pig and Tez

Overview

In this blog, Pig is used to calculate and find loan application metrics using loan, application, and applicant mock data. This mock data is prepared with specific criteria in mind. The fields are included in each dataset as per the standard data model used in finance industry in United States (US).

Usually, applicants applying for loan belong to different states. The overall summary of loan application metrics by state will help the decision makers to find out performance (increase or decrease) of the branches by state. Few interesting calculations like average Depts. to Income (DTI), weighted average interest rate, and others by state were used.

About Apache Pig

Apache Pig is a platform used for analyzing large datasets consisting of a high-level language for expressing data analysis programs. It is used for performing Extract, Transform, and Load (ETL) operations. It is an interactive or script-based, execution environment supporting Pig Latin, a language used to express data flows. The Pig Latin language supports loading and processing of input data with a series of operators that transform the input data and produce the desired output.

Pre-requisites

To calculate and find loan application metrics by state using Pig, perform the following:

  • Setup Hortonworks (HDP) cluster with required services (HDFS, YARN, Pig, Hive, Tez, and others)
  • Create loan, application, and applicant Hive partition tables
  • Load data into those partition tables
  • Create loan metrics by state output partition table

Note: Datasets used in this use case, Hive DDL statements, and Pig scripts are available in GitHub. Please find GitHub location in the Reference section at the end of the article.

Use Case

In this use case, Pig is used to calculate and find loan application metrics. Pig scripts are executed in Tez execution engine instead of old MapReduce. The following calculations are included in the Pig scripts and are grouped by states, loan decision type, and loan type.

  • Average DTI of applicants
  • Total loan approved amount
  • Loan approved count
  • Total requested amount by applicants
  • Weighted average interest rate
  • Average applicant annual income
  • Average approved loan

Loan Application Metrics by State

Input Hive Tables

Partition tables are used as both input and output. The “loan_data” table structure of input table looks similar to the one below:

select

The “loan_data” table partition by “orgination_year” is derived from “orgination_date” field.

Note: Find the Reference section for Hive DDL statements GitHub location.

Extract, Transform, and Load (ETL) Using Pig

Pig script includes UDF functions and is used to invoke multiple Pig scripts. For simplicity, single Pig script is used to find loan application metrics.

Extract

Pig “Load” command and HCatLoader() are used to extract input data from partitioned Hive table.

select

The input data is stored in the variables for further processing.

Transform

In this phase, various operations such as filtering, joining datasets, grouping (by state, loan decision type, and loan type), and applying aggregate operations using Pig operators and derived calculations are performed. Filtering needed records will restrict the number of records to process and calculate. Two types of filters applied are:

  • Filtered only “Approved” records from loan data
  • Filtered only “loan” records from application data

select

Weighted average interest rate calculation is performed using the following steps:

  • Multiply each loan amount by its interest rate to obtain the per loan weight factor
  • Add the per loan weight factors together to obtain the total per loan weight factor
  • Add the loan amounts together to obtain the total loan amount
  • Divide the total per loan weight factor by the total loan amount. Multiply this by 100 to express it as a percentage and to obtain the actual interest rate.

Let us discuss them in detail. Multiply each loan amount by its interest rate to obtain the per loan weight factor.

select

DTI formula: (depts/income) * 100.

select

As required fields needed for calculations are in different variables, the loan, application, and applicant data and store are joined into a single variable.

select

The combined results are grouped by state, loan decision type, and loan type to provide metrics based on these combinations to gain more granularity. The aggregate operators used based on base value, the formula used for weighted rate, and average approved loan are explained in the below code:

select

Load

The output result is stored into the resulted Hive table using Pig “Store” command and HCatStorer ().

select

Execute Pig Script

As HCatalog functions are used in this use case, the following argument should be supplied while executing Pig scripts. The output result is stored into the resulted Hive table using Pig “Store” command and HCatStorer ().

select

The Pig scripts are executed using Tez execution engine, which constructs Direct Acyclic Graph (DAG) execution plan to execute the script in an optimized manner.

Understanding Tez DAG

The Apache Tez project aims at building an application framework that allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN. As execution will be tracked and listed in the Tez UI, all the tasks are executed using “Tez”. The Tez UI provides inventory of all queries executed with search and filter options.

select

Apache Tez models data processing as a dataflow graph. The vertices in the graph represents processing of data and the edges represents movement of data between the processing. Thus, the user logic that analyses and modifies the data sits in the vertices. Edges determine the consumer of the data, how the data is transferred, and the dependency between the producer and consumer vertices.

select

Vertex Swimlane: It is a new feature in HDP 2.5. It helps to quickly pinpoint the long-running parts of the query and to optimize them.

select

Conclusion

Pig has an option to dump the processed results and to validate the results before storing it into Hive table.

select

Once the results are verified, the final output will be stored into the partitioned Hive table.

select

The weighted average combines the interest rates into a single interest rate for yielding a combined cost same as the cost of the original separate loans. Let us sort by weighted rate descending to find out the loan in the state that produces more interest.

select

References

1478 Views 1 Views Today