Data Quality Checks with StreamSets using Drift Rules

Data Quality Checks with StreamSets using Drift Rules

Overview

In the world of big data, data drift has emerged as a critical technical challenge for data scientists and engineers in unleashing the power of data. It delays businesses from gaining real-time actionable business insights and making more informed business decisions.

StreamSets is not only used for big data ingestion but also for analyzing real-time streaming data. It is used to identify null or bad data in source data and filter out the bad data from the source data in order to get precise results. It also helps the businesses in making quick and accurate decisions.

In this blog, let us discuss about checking quality of data using Data rules and Data Drift rules in StreamSets.

Pre-requisites

  • Install Java 1.8
  • Install streamsets-datacollector-2.6.0.1

Use Case

Create a dataflow pipeline to check quality of source data and load the data into HDFS using StreamSets.

Data Description

Network data of outdoor field sensors is used as the source file. Additional fields, dummy data, empty data, and duplicate data were added to the source file.

The dataset has total record count of 600K.

Sample data

{“ambient_temperature”:”16.70″,”datetime”:”Wed Aug 30 18:42:45 IST 2017″,”humidity”:”76.4517″,”lat”:36.17,”lng”:-119.7462,”photo_sensor”:”1003.3″,”radiation_level”:”201″,”sensor_id”:”c6698873b4f14b995c9e66ad0d8f29e3″,”sensor_name”:”California”,”sensor_uuid”:”probe-2a2515fc”,”timestamp”:1504098765}

Synopsis

  • Read data from local file system
  • Configure data drift rules and alerts
  • Convert data types
  • Configure data rules and alerts
  • Derive fields
  • Load data into HDFS
  • Get Alerts During Data Quality Checks
  • Visualize data in motion

Reading Data from Local File System

To read data from the local file system, perform the following:

  • Create a new pipeline.
  • Configure “Directory” origin to read files from a directory.
  • Set Batch Size (recs) as “1” to read records one by one to easily analyze data and get accurate results.
  • Set “Data Format” as JSON.
  • Select “JSON content” as Multiple JSON objects.

reading_data_from_local_file_system

Configuring Data Drift Rules and Alerts

To configure data drift rules and alerts, perform the following:

  • Gather details about data drift as and when data passes between two stages.
  • Provide meters and alerts.
  • Create data drift rules to indicate data structure changes.
  • Click “Add” to add the conditions in the links between the stages.

Few conditions applied are:

    • Alerts when field name varies between two subsequent JSON records.
      Function: drift:names(, )
      For example: ${drift:names(‘/’, false)}
    • Alerts when number of fields vary between two subsequent JSON records.
      Function: drift:size(, )
      For example: ${drift:size(‘/’, false)}
    • Alerts when data type of specified field changes and specified field is missing (i.e.) Double-String, String-Integer.
      Function: drift:type(, )
      For example: ${drift:type(‘/photo_sensor’, false)}
    • Alerts when order of fields varies between two subsequent JSON records.
      Function: drift:order(, )
      For example: ${drift:order(‘/’, false)}
    • Alerts when String is Empty.
      For example: ${record:value(‘/photo_sensor’)==”"}
  • Click “Activate” to activate all the rules.

configuring_data_drift_rules_and_alerts

 

select

Converting Data Types

To analyze data and apply data rules, convert data with String data type into Decimal or Integer type.
For example: Convert String data type of “humidity” data (“humidity”:”76.4517″) in the source data into Double type (“humidity”:76.4517).

converting_data_types

Configuring Data Rules and Alerts

To configure data rules and alerts, perform the following:

  • Click “Add” to add the conditions in data rules and data drift rules in the links between stages.
  • Apply data rules for attributes.
    For example: ${record:value(‘/humidity’) < 66.2353 and record:value(‘/humidity’)>92.4165}

configuring_data_rules_and_alerts

 

configuring_data_rules_and_alerts2

 

select

Deriving Fields

To derive a new field using “Expression Evaluator” processor, add the below language expression in Field Expression:

For example, if derived field is “/prediction”, the expression is:

deriving_fields

Use “Stream Selector” processor to split records with different conditions,

deriving_fields1

Loading Data into HDFS

To load data into HDFS, perform the following:

  • Configure “Hadoop FS” destination processor.
  • Select data format as “JSON”.

Note: Hadoop-conf directory (/var/lib/sdc-resources/hadoop-conf) contains core-site.xml and hdfs-site.xml files. sdc-resources directory will be created while installing StreamSets.

loading_data_into_hdfs

Getting Alerts During Data Quality Checks

Alerts while Data in Motion

alerts_while_data_in_motion

Alert Summary on Detecting Data Anamolies

alert_summary on_detecting_data_anamolies

Visualizing Data in Motion

Record Summary Statistics

record_summary_statistics

Record Count In/Out Statistics

select

References

240 Views 1 Views Today