Sensor Data Quality Management using PySpark & Seaborn

Sensor Data Quality Management using PySpark & Seaborn

Overview

Data Quality Management (DQM) is the process of analyzing, defining, monitoring, and improving quality of data continuously. Few data quality dimensions widely used by the data practitioners are Accuracy, Completeness, Consistency, Timeliness, and Validity. Various DQM rules are configured to apply DQM to the existing data. These DQM rules are applied to clean up, repair, and standardize incoming data & identify and correct invalid data.

In this blog, let us check data for required values, validate data types, and detect integrity violation. DQM is applied to correct the data by providing default values, formatting numbers and dates, and removing missing values, null values, non-relevant values, duplicates, out of bounds, referential integrity violations, and value integrity violations.

Pre-requisites

Install the following Python packages:

  • PySpark
  • XGBoost
  • Pandas
  • Matplotlib
  • Seaborn
  • NumPy
  • sklearn

Data Description

Sensor data from the pub-nub source is used as the source file.

  • Total Record Count: 6K
  • File Types: JSON and CSV
  • # of Columns: 11
  • # of Records: 600K
  • # of Duplicate Records: 3.5K
  • # of NA Values:
    • Ambient Temperature: 3370
    • Humidity: 345
    • Sensor IDs: 12

Sample Dataset

select

Use Case

Perform data quality management on sensor data using Python API – PySpark.

Data Quality Management Process

select

Synopsis

  • Data Integrity
  • Data Profiling
  • Data Cleansing
  • Data Transformation

Data Integrity

Data integrity is the process of guaranteeing the quality of the data in the database.

  • Analyzed input sensor data with
    • 11 columns
    • 6K records
  • Validated source metadata
  • Populated relationships for an entity

Data Profiling

Data profiling is the process of discovering and analyzing enterprise metadata to discover patterns, entity relationships, data structure, and business rules. It provides statistics or informative summaries of the data to assess data issues and quality.

Few data profiling analyses include:

  • Completeness Analysis – Analyze frequency of attribute population versus blank or null values.
  • Uniqueness Analysis – Analyze and find unique or distinct values and duplicate values for a given attribute across all records.
  • Values Distribution Analysis – Analyze and find the distribution of records across different values of a given attribute.
  • Range Analysis – Analyze and find minimum, maximum, median, and average values of a given attribute.
  • Pattern Analysis – Analyze and find character patterns and pattern frequency.

Generating Profile Reports

To generate profile reports, use either Pandas profiling or PySpark data profiling using the below commands:

Pandas Profiling

PySpark Profiling

The profile report provides the following details:

  • Essentials – type, unique values, missing values
  • Quantile Statistics – minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive Statistics – mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent values
  • Histogram

Profile Report Overview

select

The sample profile report for a single attribute (ambient temperature) is as follows:

Ambient Temperature – Statistics

select

Ambient Temperature – Histogram

select

Ambient Temperature – Extreme Values

select

To view the complete profile report, see Reference section.

Data Cleansing

Data cleansing is the process of identifying incomplete, incorrect, inaccurate, duplicate, or irrelevant data and modifying, replacing, or deleting the dirty data.

select

  • Analyzed the number of null (NaN) values in the dataset using the below command:
    df.isnull().sum()

The number of null values is as follows:

select

  • Deleted NaN values in String type columns using the below command:
  • Imputed missing values using one of the below methods:

Method 1 – Impute package

Imputation is defined as the process of replacing the missing data with substituted values using any of the following options:

  • most_frequent: Columns of the dtype object (string) are imputed with the most frequent values in the column as mean or median cannot be found for this data type.
  • Mean: Ratio of the sum of elements to the number of elements in the list.
  • Median: Ratio of the sum of middle two numbers to two.

Note: If the missing values in the records are negligible, ignore those records.

In our use case, the most_frequent strategy is used for substituting the missing values using the below command:

Method 2 – Linear Regression model

To replace the missing data with the substituted values using Linear Regression model, use the below commands:

Data Transformation

Data transformation deals with converting data from the source format into the required destination format.

select

  • Converted attributes such as ambient_temperature and humidity from object type to float type using the below command:
  • Converted a non_numeric value of sensor_name into numeric data using the below command:
  • Converted a non_numeric sensor name into numeric data using the below command:
  • Converted a non_numeric value of sensor ID into numeric data using the below command:
  • Based on the above transformation, found feature importance using built-in function using the below commands:
Feature Importance Chart

select

From the above diagram, it is evident that photosensor feature has the highest importance and lat (latitude) feature has the lowest importance.

Correlation Analysis

Performed correlation analysis to explore data relationships and data correlations to highlight weak data relationships and find potential incorrect relationships. The correlation analysis between the sensor data variables is shown in the below diagram:

select

From the above diagram, it is evident that the ambient_temperature is highly correlated with the dewpoint and humidity and the latitude & longitude are negatively correlated as per the correlation analysis.

Reference

1387 Views 1 Views Today