Table of Content
- 1 Overview
- 2 Pre-requisites
- 3 Use Case
- 4 Job Description
- 5 Creating DAG Folder and Restarting Airflow Webserver
- 6 Scheduling Jobs
- 7 Monitoring Jobs
- 8 Data Profiling
- 9 Email Notification
- 10 Conclusion
- 11 References
Airflow, an open source platform, is used to orchestrate workflows as Directed Acyclic Graphs (DAGs) of tasks in a programmatic manner. An airflow scheduler is used to schedule workflows and data processing pipelines. Airflow user interface allows easy visualization of pipelines running in production environment, monitoring of the progress of the workflows, and troubleshooting issues when needed. Rich command line utilities are used to perform complex surgeries on DAGs.
In this blog, let us discuss about scheduling and executing Talend jobs with Airflow.
- Airflow 1.7 or above
For more details about Airflow installation, refer https://airflow.incubator.apache.org/installation.html
- Python 2.7
- Talend Open Studio (Big Data or Data Integration)
Schedule and execute Talend ETL jobs with Airflow.
- Author Talend jobs
- Schedule Talend jobs
- Monitor workflows in Web UI
Talend ETL jobs are created by:
- Joining application_id from applicant_loan_info and loan_info as shown in the below diagram:
- Loading matched data into loan_application_analysis table.
- Applying a filter on LoanDecisionType field in loan_application_analysis table to segregate values as Approved, Denied, and Withdrawn as shown in the below diagram:
- Applying another filter on the above segregated values to segregate LoanType as Personal, Auto, Credit, and Home.
The created Talend job is built and moved to the server location. A DAG named Loan_Application_Analysis.py is created with corresponding path of the scripts to execute the flow as and when required.
Creating DAG Folder and Restarting Airflow Webserver
After installing Airflow, perform the following:
- Create a DAG folder (/home/ubuntu/airflow/dags) in the Airflow path.
- Move all the .py files into the DAG folder.
- Restart the Airflow webserver using the below code to view this DAG in UI list:
Loginto the AIRFLOW_HOME path-- eg.(/home/ubuntu/airflow)
To restart webserver ---> airflow webserver
To restart scheduler �---> airflow scheduler
The created Talend jobs can be scheduled using Airflow scheduler. For code, look into Reference section.
Note: The job can be manually triggered by clicking Run button under Links column as shown below:
Both the auto scheduled and manually triggered jobs can be viewed in the UI as follows:
On executing the jobs, upstream or downstream processes will be started as created in the DAG.
On clicking a particular DAG, the corresponding status such as success, failure, retry, queue, and so on of the job can be visualized in different ways in the UI.
The statuses of the jobs are represented in a graphical format as shown below:
The statuses of the jobs along with execution dates of the jobs are represented in a tree format as shown below:
The statuses of the jobs along with execution dates of the jobs are represented in a Gannt format as shown below:
Viewing Task Duration
On clicking Task Duration tab, you can view task duration of whole process or DAGs in a graphical format as shown below:
Viewing Task Instances
On clicking Browse –> Task Instances, you can view the instances on which the tasks are running as shown below:
On clicking Browse –> Jobs, you can view the details such as start time, end time, executors, and so on of the jobs as shown in the below diagram:
On clicking Browse –> View Log, you can view the details of the logs as shown in the below diagram:
Airflow provides a simple SQL query interface to query the data and a chart UI to visualize the tasks.
To profile your data, click Admin –> Connections to select the database connection type as shown in the below diagram:
Ad Hoc Query
To write and query the data, click Data Profiling –> Ad Hoc Query.
Different types of visualizations can be created for task duration, task status, and so on using charts.
To generate charts such as bar, line, area, and so on for a particular DAG using SQL query, click Data Profiling –> Charts –> DAG_id as shown in the below diagram:
All the DAGs are graphically represented as shown in the below diagram:
Email notification can be set to know job status such as email_on_failure, email_on_success, email_on_retries, and so on.
To enable the notification, perform the following:
- Configure settings in airflow.cfg file in airflow_home path as shown below:
- Reset your email setting to Gmail settings –> allow_less secure_apps –> ON to receive email alerts from Airflow.
Note: You may get authentication_error if the email settings are not properly configured. To overcome this issue, accept the login device as our device in “Gmail device review” as “Yes That Was Me”.
A job failure email is shown below:
On clicking the Log Link in the email, you will be redirected to Logs page.
In this blog, we discussed about authoring, scheduling, and monitoring the workflows from webUI and about triggering the Talend jobs directly from the webUI on demand using bash operator. You can also transfer data from one database to another database using generic_transfer operator.
Hooks can be used to connect to MySQL, HIVE, S3, Oracle, Pig, Redshift, and other operators such as docker_operator, hive_operator, hive_to_samba_operator, http_operator, jdbc_operator, mssql_to_hive, pig_operator, postgres_operator, presto_to_mysql, redshift_to_s3_operator, s3_file_transform_operator, and s3_to_hive_operator.
- Apache Airflow (incubating) Documentation: https://airflow.incubator.apache.org/
- Code in GitHub: https://github.com/treselle-systems/airflow_to_manage_talend_etl_jobs