Table of Content
- 1 Overview
- 2 About Apache Falcon and Apache Atlas
- 3 Pre-requisites
- 4 Use Case
- 5 Defining and Processing Data Pipeline using Apache Falcon
- 6 Preparing Input Hive Tables
- 7 Defining Feed
- 8 Defining Process
- 9 Executing Data Pipeline
- 10 Monitoring Feed and Processing in Oozie
- 11 Visualizing Data Pipeline in Falcon
- 12 Lineage in Atlas
- 13 Conclusion
- 14 References
In this blog article, Apache Falcon is used to centrally define data pipelines. Few definitions are used to auto-generate workflows in Apache Oozie.
As Apache Falcon dataflows are sank with Apache Atlas through Kafka topics, Atlas can manage Falcon metadata. Atlas provides Falcon feed lineage and provides the details of the tables & its source tables.
About Apache Falcon and Apache Atlas
Apache Falcon is a framework used to simplify data pipeline processing and management on Hadoop clusters. It is more likely a scheduling and execution engine for HDP components like Hive, Spark, HDFS DistCp, and Sqoop to move data around and/or process data along the way. It is a much improved Oozie and supports late data handling and retry policies.
Apache Atlas facilitates easy exchange of metadata to share a common metadata store using HDP components like Hive metastore, Kafka topics, Falcon repo, HBase table, and so on. This single view on metadata provides powerful searching capabilities on top of full text search (based on Apache Solr).
- In Falcon, ensure that Falcon clusters were created in HDFS as the clusters are used to execute feed and process.
- In Falcon, the table should be partitioned as Hive feed supports only partition table.
Loan application and account data are used to populate loan application transactional table. For more details, refer our blog “Hive Streaming with Kafka and Storm with Atlas”.
“cu_loan_application_raw” partition table is created from transactional table. You can also create a temporary table and load the data into partition table as did for “cu_account_raw” table.
Falcon data pipeline is used to:
- Define Hive input and HDFS output feeds.
- Define Pig execution engine process to do transformation using loan application and account mock data.
- Define Hive execution engine process to create transient table from the output of Pig transformation.
- Execute a data pipeline to ingest, transform, and store data into HDFS and create Hive table.
Atlas is used to track the data pipeline feed lineage and to know how data is populated into output Hive table.
Defining and Processing Data Pipeline using Apache Falcon
The Falcon data pipeline can be created using feed and process. While defining the feed, provide properties such as frequency, time zone, retention, late arrival, and others required for the feed.
While defining the process, provide properties such as frequency, time zone, input and output feeds, workflow engine, retry policy, and Access Control List (ACL).
Preparing Input Hive Tables
Create and populate data in both “cu_loan_application_raw” and “cu_account_raw”.
Note: Both tables are partitioned by insert_date, which is in YYYY-MM-DD format.
Datasets used in this use case, falcon feed and process XML files, and Hive DDL statements are available in GitHub. Please find the GitHub location in the Reference section at the end of the article.
In this use case, two Hive inputs and HDFS output feeds are used. These feeds will be used in Falcon process as the input and output for the execution engine.
The feeds are created in the following names and are tagged as “LoanApplication”:
- LoanApplicationRaw – loan application table input feed.
- AccountRaw – account table input feed.
- LoanApplicationTransient – HDFS output Feed.
Falcon UI for LoanApplicationRaw feed is as follows:
LoanApplicationRaw Feed XML Definition:
Note: Remaining feeds, used in this use case, are created using Falcon web UI as described above.
In Falcon process, the input and output are taken as feeds to ingest and store the data as per feed definition. It invokes the execution engine and passes the parameter to the execution engine as per the feed and process definition.
The processes are created in the following names and are tagged as “LoanApplication”:
- LoanApplicationPigProcess – Both “LoanApplicationRaw” and “AccountRaw” feeds are taken as inputs to ingest data, invoke the Pig script to do transformation, and to store the output into HDFS as per “LoanApplicationTransient” feed.
- LoanApplicationHiveProcess – “LoanApplicationTransient”, which is an output of “LoanApplicationPigProcess” process, is taken as input to invoke the Hive script to create table with data location of HDFS output feed.
The Falcon UI for LoanApplicationPigProcess is as follows:
LoanApplicationPigProcess XML Definition:
Note: LoanApplicationHiveProcess is created using Falcon web UI as described above.
Using the tags utilized while creating the feed and process, you can search and filter those feeds and processes in Falcon dashboard.
Executing Data Pipeline
In Falcon, both feed and process can be executed independently as there is no proper order for executing feed/process. You can use the “Schedule” option in the Falcon UI to execute all the feeds and processes.
Monitoring Feed and Processing in Oozie
All the feeds and processes executed in Falcon are listed out in Oozie dashboard.
Drill down into the concerned feed/process to find the status of that feed/process in Oozie.
Successful execution of the Pig process is shown in the below diagram:
Visualizing Data Pipeline in Falcon
“LoanApplicationRaw” and “AccountRaw” feed are used as input for LoanApplicationPigProcess. The LoanApplicationPigProcess provides the output as “LoanApplicationTransient”, which is clearly stated in the below diagram.
As both feed and process frequencies are defined as 10 minutes, both the feed and process are executed every 10 minutes.
The “LoanApplicationPigProcess” is as follows:
Lineage in Atlas
Loan Application Transient Feed Lineage – “LoanApplicationTransient” feed is defined with HDFS path and is the output of “LoanApplicationPigProcess”. The data lineage of “LoanApplicationTransient” feed is as follows:
- “cu_account_raw” Hive table is defined in “AccountRaw” feed.
- “cu_loan_application_raw” Hive table is defined in “LoanApplicationRaw” feed.
- “LoanApplicationPigProcess” invokes Pig scripts to do transformation and to store the final output into HDFS path, which is defined in “LoanApplicationTransient”.
Loan Application Raw Feed Lineage – “cu_loan_application_raw” Hive table is defined in the “LoanApplicationRaw” feed that is used as an input for “LoanApplicationPigProcess”. The output of the “LoanApplicationPigProcess” is “LoanApplicationTransient”.
As per the “LoanApplicationTransient” feed, the HDFS output contains the transformed data generated using Pig script defined in “LoanApplicationPigProcess”.
The final output table is created based on “LoanApplicationHiveProcess” and the data is populated using “LoanApplicationTransient” feed HDFS path transformed data.
- Datasets used in this use case, Falcon feed and process XML files, and Hive DDL statements are available in the
GitHub location: https://github.com/treselle-systems/falcon_data_pipeline
- Difference between Apache Atlas and Apache Falcon:
- Introduction to Apache Falcon:
- About Apache Atlas: