Drill Data with Apache Drill – Part 2

Drill Data with Apache Drill – Part 2

Overview

This is second part about drilling data with Apache Drill. Apache Drill is an open source low latency SQL on Hadoop query engine for larger datasets. The latest version of Apache Drill is 1.10 with CTTAS, Web console, and JDBC connection. It can be integrated with several data sources such as CSV, JSON, TSV, PSV, Avro, and Parquet and can be operated using single query.

In this blog, let us discuss about configuring and connecting different data sources, and querying data from the data sources with Apache Drill. To know more, refer our previous blog post on Drill Data with Apache Drill.

Use Case

Persist data files in different data sources such as MySQL, HDFS, and Hive, query them on-the-fly, export query output in different file formats such as CSV, JSON, and TSV, and load the result sets into HDFS location.

Data Description

To demonstrate Apache Drill’s capability, financial stock data downloaded in parts in different formats such as JSON, CSV, and TSV is loaded into different data sources.

The files used are as follows:

  • energy_overview.tsv – MySQL
  • energy_technical.json – HDFS
  • stock_market_exchange.csv – Hive

To get a better understanding about the data files, refer our previous blog post on Drill Data with Apache Drill.

Pre-requisites

Install Apache Drill 1.10.0 in Distributed Mode on Linux.

Note: Drill requires JDK 1.7 and above.

Synopsis

  • Configure and connect different data sources such as MySQL, HDFS, and Hive.
  • Query the data sources on the fly.
  • Export query output in different file formats such as CSV, JSON, and TSV.
  • Store the output file in HDFS location.

Configuring Different Data Sources

RDBMS Storage Plugins

Apache Drill is tested with MySQL’s mysql-connector-java-5.1.37-bin.jar driver.

To connect Apache Drill with MySQL, configure RDBMS storage plugin in Apache Drill console as shown in the below diagram:

select

Hive Storage Plugins

To connect Apache Drill with Hive, enable the existing Hive plugin and update the configuration in Apache Drill console as shown in the below diagram:

select

File Storage Plugin

To connect Apache Drill with HDFS, replace file:/// to hdfs://hostname:port/ in DFS storage plugin as shown in the below diagram:

select

Using Apache Drill in Distributed Mode

Start a Drillbit on each node in a cluster to use Apache Drill in the distributed mode.

Starting Drill

To start Apache Drill, use the below command:

Stopping Drill

To stop Apache Drill, use the below command:

Verifying Drill Setup

To verify Apache Drill setup, use the below command:

select

To check Drillbits running on the cluster, use the below command:

select

Querying Data Sources with Apache Drill

Apache Drill’s self-describing data exploration behavior allows users to query different data files from diverse data stores.

Simple Join

It is used to join all the 3 files and retrieve name, ticker, MarketCap, sector, subsector, volume, and moving averages.

select

Total Volume Based on Subsector

select

Minimum Volume for Tickers

select

Subsector with ATR > 2

select

Tickers with w52high Between -20 & -100 with Their Last Price

The query can also be executed in Apache Drill web console as shown below:

select

The query output is as follows:

select

 

Loading Output Files in HDFS

To load the result sets as files in HDFS, configure workspace in DFS storage plugin by changing the configuration as shown in the below diagram:

select

The configured workspaces are shown in database list as shown in the below diagram:

select

 

Exporting Output Files

The result sets of the join queries are exported into CSV file format by using csvOut workspace and JSON file format by using jsonOut workspace in HDFS storage plugin.

Switch to csvOut Workspace

To switch to csvOut workspace, use the below command:

Change File Storage Format to CSV

To change file storage format to CSV, use the below command:

Exporting Output File into CSV Format

select The output files are exported and stored in the location:

/user/tsldp/csv_output/total_resulset/total_volume.csv

select

References

1556 Views 1 Views Today