Table of Content
In this blog, Pig is used to calculate and find loan application metrics using loan, application, and applicant mock data. This mock data is prepared with specific criteria in mind. The fields are included in each dataset as per the standard data model used in finance industry in United States (US).
Usually, applicants applying for loan belong to different states. The overall summary of loan application metrics by state will help the decision makers to find out performance (increase or decrease) of the branches by state. Few interesting calculations like average Depts. to Income (DTI), weighted average interest rate, and others by state were used.
About Apache Pig
Apache Pig is a platform used for analyzing large datasets consisting of a high-level language for expressing data analysis programs. It is used for performing Extract, Transform, and Load (ETL) operations. It is an interactive or script-based, execution environment supporting Pig Latin, a language used to express data flows. The Pig Latin language supports loading and processing of input data with a series of operators that transform the input data and produce the desired output.
To calculate and find loan application metrics by state using Pig, perform the following:
- Setup Hortonworks (HDP) cluster with required services (HDFS, YARN, Pig, Hive, Tez, and others)
- Create loan, application, and applicant Hive partition tables
- Load data into those partition tables
- Create loan metrics by state output partition table
Note: Datasets used in this use case, Hive DDL statements, and Pig scripts are available in GitHub. Please find GitHub location in the Reference section at the end of the article.
In this use case, Pig is used to calculate and find loan application metrics. Pig scripts are executed in Tez execution engine instead of old MapReduce. The following calculations are included in the Pig scripts and are grouped by states, loan decision type, and loan type.
- Average DTI of applicants
- Total loan approved amount
- Loan approved count
- Total requested amount by applicants
- Weighted average interest rate
- Average applicant annual income
- Average approved loan
Loan Application Metrics by State
Input Hive Tables
Partition tables are used as both input and output. The “loan_data” table structure of input table looks similar to the one below:
The “loan_data” table partition by “orgination_year” is derived from “orgination_date” field.
Note: Find the Reference section for Hive DDL statements GitHub location.
Extract, Transform, and Load (ETL) Using Pig
Pig script includes UDF functions and is used to invoke multiple Pig scripts. For simplicity, single Pig script is used to find loan application metrics.
Pig “Load” command and HCatLoader() are used to extract input data from partitioned Hive table.
The input data is stored in the variables for further processing.
In this phase, various operations such as filtering, joining datasets, grouping (by state, loan decision type, and loan type), and applying aggregate operations using Pig operators and derived calculations are performed. Filtering needed records will restrict the number of records to process and calculate. Two types of filters applied are:
- Filtered only “Approved” records from loan data
- Filtered only “loan” records from application data
Weighted average interest rate calculation is performed using the following steps:
- Multiply each loan amount by its interest rate to obtain the per loan weight factor
- Add the per loan weight factors together to obtain the total per loan weight factor
- Add the loan amounts together to obtain the total loan amount
- Divide the total per loan weight factor by the total loan amount. Multiply this by 100 to express it as a percentage and to obtain the actual interest rate.
Let us discuss them in detail. Multiply each loan amount by its interest rate to obtain the per loan weight factor.
DTI formula: (depts/income) * 100.
As required fields needed for calculations are in different variables, the loan, application, and applicant data and store are joined into a single variable.
The combined results are grouped by state, loan decision type, and loan type to provide metrics based on these combinations to gain more granularity. The aggregate operators used based on base value, the formula used for weighted rate, and average approved loan are explained in the below code:
The output result is stored into the resulted Hive table using Pig “Store” command and HCatStorer ().
Execute Pig Script
As HCatalog functions are used in this use case, the following argument should be supplied while executing Pig scripts. The output result is stored into the resulted Hive table using Pig “Store” command and HCatStorer ().
The Pig scripts are executed using Tez execution engine, which constructs Direct Acyclic Graph (DAG) execution plan to execute the script in an optimized manner.
Understanding Tez DAG
The Apache Tez project aims at building an application framework that allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN. As execution will be tracked and listed in the Tez UI, all the tasks are executed using “Tez”. The Tez UI provides inventory of all queries executed with search and filter options.
Apache Tez models data processing as a dataflow graph. The vertices in the graph represents processing of data and the edges represents movement of data between the processing. Thus, the user logic that analyses and modifies the data sits in the vertices. Edges determine the consumer of the data, how the data is transferred, and the dependency between the producer and consumer vertices.
Vertex Swimlane: It is a new feature in HDP 2.5. It helps to quickly pinpoint the long-running parts of the query and to optimize them.
Pig has an option to dump the processed results and to validate the results before storing it into Hive table.
Once the results are verified, the final output will be stored into the partitioned Hive table.
The weighted average combines the interest rates into a single interest rate for yielding a combined cost same as the cost of the original separate loans. Let us sort by weighted rate descending to find out the loan in the state that produces more interest.
- Datasets used in this use case, Hive DDL statements, and Pig scripts are available in the
GitHub location: https://github.com/treselle-systems/loan_application_metrics_by_state_using_pig
- Hive vs Pig
- Hadoop Pig and Pig Latin for Big Data
- About Tez
- Data Processing API in Apache Tez
- Weighted Interest Rate