Import and Ingest Data into HDFS using Kafka in StreamSets

Import and Ingest Data into HDFS using Kafka in StreamSets

Overview

StreamSets provides state-of-art data ingestion to easily and continuously ingest data from various origins such as relational databases, flat files, AWS, and so on, and write data to various systems such as HDFS, HBase, Solr, and so on. Its configuration-driven User Interface (UI) helps you design pipelines for data ingestion in minutes. Data is routed, transformed, and enriched during ingestion and made ready for consumption and delivery to downstream systems.

Kafka, an intermediate data store, helps to very easily replay ingestion, consume datasets across multiple applications, and perform data analysis. In this blog, let us discuss reading the data from different data sources such as Amazon Simple Storage Service (S3) & flat files and writing the data into HDFS using Kafka in StreamSets.

Pre-requisites

  • Install Java 1.8
  • Install streamsets-datacollector-2.6.0.1

Use Case

Import and ingest data from different data sources into HDFS using Kafka in StreamSets.

Data Description

Network data of outdoor field sensors is used as the source file. Additional fields, dummy data, empty data, and duplicate data were added to the source file.

The dataset has total record count of 600K with 3.5K duplicate records.

Sample data

Synopsis

  • Read data from local file system and produce data to Kafka
  • Read data from Amazon S3 and produce data to Kafka
  • Consume streaming data produced by Kafka
  • Remove duplicate records
  • Persist data into HDFS
  • View data loading statistics

Reading Data from Local File System and Producing Data to Kafka

To read data from the local file system, perform the following:

  • Create a new pipeline.
  • Configure File Directory origin to read files from a directory.
  • Set Data Format as JSON and JSON content as Multiple JSON objects.
  • Use Kafka Producer processor to produce data into Kafka.
    Note: If there are no Kafka processors, install Apache Kafka package and restart SDC.
  • Produce the data under topic sensor_data.

reading-data-from-local-file-system

reading-data-from-local-file-system1

Reading Data from Amazon S3 and Producing Data to Kafka

To read data from Amazon S3 and produce data into Kafka, perform the following:

  • Create another pipeline.
  • Use Amazon S3 origin processor to read data from S3.
    Note: If there are no Amazon S3 processors, install Amazon Web Services 1.11.123 package available under Package Manager.
  • Configure processor by providing Access Key ID, Secret Access Key, Region, and Bucket name.
  • Set the data format as JSON.
  • Produce data under the same Kafka topic – sensor_data.

reading-data-from-amazon-s3

reading-data-from-amazon-s3-1

Consuming Streaming Data Produced by Kafka

To consume streaming data produced by Kafka, perform the following:

  • Create a new pipeline.
  • Use Kafka Consumer origin to consume Kafka produced data.
  • Configure processor by providing the following details:
    • Broker URI
    • ZooKeeper URI
    • Topic – set the topic name as sensor_data (same data produced in previous sections 1 & 2)
  • Set the data format as JSON.

consuming-streaming-data-produced-by-kafka

Removing Duplicate Records

To remove duplicate records using Record Deduplicator processor, perform the following:

  • Under Deduplication tab, provide the following fields to compare and find duplicates:
    • Max. Records to Compare
    • Time to Compare
    • Compare
    • Fields to Compare
      For example, find duplicates based on sensor_id and sensor_uuid.
  • Move the duplicate records to Trash.
  • Store the unique records in HDFS.

removing-duplicate-records

Persisting Data into HDFS

To load data into HDFS, perform the following:

  • Configure Hadoop FS destination processor from stage library HDP 2.6.
  • Select data format as JSON.
    Note: core-site.xml and hdfs-site.xml files are placed in Hadoop-conf directory (/var/lib/sdc-resources/hadoop-conf). While installing StreamSets, sdc-resources directory will be created.

persisting-data-into-hdfs

Viewing Data Loading Statistics

Data loading statistics, after removing duplicates from different sources, is as follows:

viewing-data-loading-statistics

viewing-data-loading-statistics1

References

1690 Views 1 Views Today