Data Quality Metrics using Talend Data Quality Management

Data Quality Metrics using Talend Data Quality Management

Overview

Data Quality is the process of examining data in different data sources according to predefined business goals. It helps to improve the quality of the data and collect statistics and information about the data. It helps business users in making more informed decisions with the quality data. In this blog, let us discuss Data Quality Statistics (DQS) using Talend Data Quality Management (DQM).

Pre-requisites

Download and install Talend data quality tool from the following link:
https://www.talend.com/products/data-quality/

Data Description

Loan applicant dataset, with basic applicant details such as applicant ID, gender, age, marital status, and so on, is used as the source data.

Sample Data Source in MySQL

sample-data-source-in-mysql

Use Case

Perform column and table level quality statistics on the input data source.

Synopsis

  • Connect data source with Talend DQM
  • Create analysis and data quality statistics
    • Simple statistics
    • Pattern matching statistics
    • Text statistics
    • Pattern frequency statistics
  • Apply static rules
  • Perform Correlation Analysis
  • Identify data duplicates using Match Analysis

Connecting Data Source with Talend DQM

To connect Talend DQM with the database, perform the following:

  • Open Talend Open Studio for Data Quality.
  • In the left panel, click Metadata –> DB connections –> Create DB Connection to create a database connection to import the source data from the database for collecting statistics.

select

  • Provide the required credentials to create metadata for MySQL DB connection as shown in the below diagram:

select

Creating Analysis and Data Quality Statistics

On successfully connecting Talend with MySQL, perform analysis on the following levels:

  • Column
  • Table

To collect the data quality statistics with the applicant dataset on column level, perform the following:

  • Create analysis as shown in the below diagram:

select

  • Select columns for performing the data quality statistics as shown in the below diagram:

select

  • Select quality indicator for the selected columns to run analysis and view the analysis results.

select

The quality statistics based on the above-selected indicators in Talend DQM are:

  • Simple Statistics
  • Pattern Matching Statistics
  • Text Statistics
  • Pattern Frequency Statistics

Simple Statistics

The simple statistics on the applicant ID column, with Row Count, Null Count, Distinct Count, Unique Count, Duplicate Count, and Blank Count, is shown in the below diagram:

simple-statistics

Pattern Matching Statistics

The pattern matching statistics is used to analyze the format of several types of data such as date in different formats, phone number patterns in different countries, zip codes, and so on. It provides both matching and non-matching patterns. Matched and unmatched patterns of the phone numbers with the countries are shown in the below diagram:

pattern-matching-statistics

Matched and unmatched patterns of the applicant last name starting with uppercase are shown in the below diagram:

pattern-matching-statistics1

Matched and unmatched patterns of the US state codes in the applicant data are shown in the below diagram:

pattern-matching-statistics2

Matched and unmatched patterns of the date matching with the date of birth of the applicant are shown in the below diagram:

pattern-matching-statistics3

Text Statistics

The text statistics is used to check the data with default length such as phone number (for example, 10 digits in India). The text statistics on the applicants’ phone number is shown in the below diagram:

text-statistics

Pattern Frequency Statistics

The pattern frequency statistics is used to check the pattern formats of the data source. The patterns of the phone numbers are shown in the below diagram:

pattern-frequency-statistics

Applying Static Rules

Business Rule statistics, called as table level statistics, is used to apply static rules and predefined business rules in the table columns.

Few static rules created are:

  • Approved loan amount should not be greater than requested loan amount.
  • Age column record is valid based on date of birth column in table.
  • Gender column should have valid data like Male or Female.

The business rule analysis performed using the above static rules is shown in the below diagram:

applying-static-rules

Performing Correlation Analysis

The correlation analysis is used to explore the relationships and correlations in the data. It is used to highlight weak relationships between the data to find potential incorrect relationships. The correlation analysis between the cities and the states is shown in the below diagram:

select

Identifying Data Duplicates Using Match Analysis

The match analysis is used to assess the number of duplicates in the data. It estimates the number of groups of similar data on a table-set or a column-set basis. The match analysis, with column sets such as state and gender, is shown in the below diagrams:

match-analysis

match-analysis1

References

1379 Views 1 Views Today