Table of Content
- 1 Overview
- 2 Pre-requisites
- 3 Data Description
- 4 Use Case
- 5 Connecting Data Source with Talend DQM
- 6 Creating Analysis and Data Quality Statistics
- 7 Applying Static Rules
- 8 Performing Correlation Analysis
- 9 Identifying Data Duplicates Using Match Analysis
- 10 References
Data Quality is the process of examining data in different data sources according to predefined business goals. It helps to improve the quality of the data and collect statistics and information about the data. It helps business users in making more informed decisions with the quality data. In this blog, let us discuss Data Quality Statistics (DQS) using Talend Data Quality Management (DQM).
Download and install Talend data quality tool from the following link:
Loan applicant dataset, with basic applicant details such as applicant ID, gender, age, marital status, and so on, is used as the source data.
Sample Data Source in MySQL
Perform column and table level quality statistics on the input data source.
- Connect data source with Talend DQM
- Create analysis and data quality statistics
- Simple statistics
- Pattern matching statistics
- Text statistics
- Pattern frequency statistics
- Apply static rules
- Perform Correlation Analysis
- Identify data duplicates using Match Analysis
Connecting Data Source with Talend DQM
To connect Talend DQM with the database, perform the following:
- Open Talend Open Studio for Data Quality.
- In the left panel, click Metadata –> DB connections –> Create DB Connection to create a database connection to import the source data from the database for collecting statistics.
- Provide the required credentials to create metadata for MySQL DB connection as shown in the below diagram:
Creating Analysis and Data Quality Statistics
On successfully connecting Talend with MySQL, perform analysis on the following levels:
To collect the data quality statistics with the applicant dataset on column level, perform the following:
- Create analysis as shown in the below diagram:
- Select columns for performing the data quality statistics as shown in the below diagram:
- Select quality indicator for the selected columns to run analysis and view the analysis results.
The quality statistics based on the above-selected indicators in Talend DQM are:
- Simple Statistics
- Pattern Matching Statistics
- Text Statistics
- Pattern Frequency Statistics
The simple statistics on the applicant ID column, with Row Count, Null Count, Distinct Count, Unique Count, Duplicate Count, and Blank Count, is shown in the below diagram:
Pattern Matching Statistics
The pattern matching statistics is used to analyze the format of several types of data such as date in different formats, phone number patterns in different countries, zip codes, and so on. It provides both matching and non-matching patterns. Matched and unmatched patterns of the phone numbers with the countries are shown in the below diagram:
Matched and unmatched patterns of the applicant last name starting with uppercase are shown in the below diagram:
Matched and unmatched patterns of the US state codes in the applicant data are shown in the below diagram:
Matched and unmatched patterns of the date matching with the date of birth of the applicant are shown in the below diagram:
The text statistics is used to check the data with default length such as phone number (for example, 10 digits in India). The text statistics on the applicants’ phone number is shown in the below diagram:
Pattern Frequency Statistics
The pattern frequency statistics is used to check the pattern formats of the data source. The patterns of the phone numbers are shown in the below diagram:
Applying Static Rules
Business Rule statistics, called as table level statistics, is used to apply static rules and predefined business rules in the table columns.
Few static rules created are:
- Approved loan amount should not be greater than requested loan amount.
- Age column record is valid based on date of birth column in table.
- Gender column should have valid data like Male or Female.
The business rule analysis performed using the above static rules is shown in the below diagram:
Performing Correlation Analysis
The correlation analysis is used to explore the relationships and correlations in the data. It is used to highlight weak relationships between the data to find potential incorrect relationships. The correlation analysis between the cities and the states is shown in the below diagram:
Identifying Data Duplicates Using Match Analysis
The match analysis is used to assess the number of duplicates in the data. It estimates the number of groups of similar data on a table-set or a column-set basis. The match analysis, with column sets such as state and gender, is shown in the below diagrams: