Data Normalization and Filtration Using Drools

Data Normalization and Filtration Using Drools

Overview

Drools, a Rule Engine, is used to implement an expert system using a rule-based approach. It is used to convert both structured and unstructured data into transient data by applying business logic for normalizing and filtering data in DRL file.

In this blog, let us discuss about normalizing and filtering data using Drools.

Pre-requisite

Download and install the following:

Use Case

Oil Well Drilling datasets from two different states–Arkansas (AR) and Oklahoma (OK) of USA are taken as the input data for processing based on API numbers.

  • Filter invalid drill types data
  • Remove null date values
  • Normalize API numbers to have correct digits
  • Format dates to the required format
  • Remove duplicate well information by taking only maximum modified date value

Data Description

The Oil Well Drilling datasets contain raw information about wells & its formation details, drill types, and production dates. Arkansas dataset has 6040 records and Oklahoma dataset has 2559 records.

The raw data contains invalid values such as null, invalid date, invalid drill type, and duplicate well & invalid well information with modified dates. These raw data from source is transformed to MS SQL for further filtering and normalization.

Arkansas Dataset

Null values for date_of_1st_prod

select

Invalid values for initial_production

select

Incorrect Digits in Well API Numbers

select

Oklahoma Dataset

Duplicate well data

select

Invalid date values in test date

select

  • MS SQL is used to transform the input data into transient data.
  • Java Database Connectivity (JDBC) is used for interaction between Java and MS SQL in order to get input and to write output into MS SQL after transforming the data.

KIE File System

KIE File System is used to load DRL files and to reduce the dependency from KIE module configuration file. It allows us to change the business logic without redeployment and to keep the DRL files separately from Jar and deployment files.

All the rule files in the given path are loaded into MS SQL to apply the rules on facts inserted. Facts are inserted to ksessions for which particular rules are loaded.

Applying Rules

Multiple rules were applied to both the datasets to process data, remove duplicate and invalid data, normalize data, and filter data.

Arkansas Dataset – Rules Applied

  1. Removing Invalid Values
  2. Setting Initial Production and Unifying Date Format

Oklahoma Dataset – Rules applied

  1. Applying Date Filter
  2. Getting Max Modify Date Values
  3. Filtering Max Modify Date Values

Applying Rules in Arkansas Dataset

Rule 1: Removing Invalid Values

This rule is applied to remove invalid production date and initial production values using retract key words.

Arkansas data filtered after removing invalid production date and initial production values is as follows:

select

Rule 2: Setting Initial Production and Unifying Date Format

This rule is applied to format gas_vol values (initial production value is considered as gas_vol) and to format the production date to prescribed date type from string by giving the date format.
Well API numbers are normalized with zeros for formatting the API numbers to fourteen digits.

API numbers are normalized after applying the normalizing rule:

select

First production dates are converted into the prescribed format in DRL file and mapped with date_time column as shown in the below diagram:

select

Applying Rules in Oklahoma Dataset

Rule 1: Applying Date Filter

This rule is applied to filter the values based on their test date for last 7 years and to filter horizontal drill type.

Data with test date before 2010-01-01 is filtered to remove invalid test date values and is mapped with date_time as mentioned in the data description section:

select

Rule 2: Getting Max Modify Date Values

This rule is applied to get the max modify date values by grouping API number in temp storage object called MaxValue by using accumulate.

Accumulate gets the max modify date by grouping the data by well API numbers.

Rule 3: Filtering Max Modify Date Values

This rule is applied to filter max modify date values by storing the unique values such that data is not replicated.

After removing the duplicate well information by grouping the API number with max modify date, the number of records is reduced to fifty:

select

Salience

Salience is used to set orders for applying rules as certain rules need to be executed after execution of certain rules.

For example, in Arkansas dataset, rules for “removing invalid initial production and first production values” had to be executed first to format date and gas_vol.

Conclusion

Business rules are separated from business code by applying the business logic for normalization and filtration in the DRL file. Thus, easily changing business logic without redeployments.

In this blog, the test date was used for getting last 7 years data. Business analyst can change this date range in future without changing code or performing redeployment.

References

1581 Views 1 Views Today