Drill Data with Apache Drill

Drill Data with Apache Drill

Introduction

Apache Drill is a low-latency distributed query engine for large-scale datasets, including structured and semi-structured/nested data. Inspired by Google’s Dremel, Drill is designed to scale to several thousands of nodes and query petabytes of data at interactive speeds that BI/Analytics environments require. Apache Drill includes a distributed execution environment, purpose built for large-scale data processing. At the core of Apache Drill is the “Drillbit” service which is responsible for accepting requests from the client, processing the queries, and returning results to the client. When a Drillbit runs on each data node in a cluster, Drill can maximize data locality during query execution without moving data over the network or between nodes.

Drill Architecture:

drill_architecture

User: Providing interfaces such as a command line interface (CLI), a REST interface, JDBC/ODBC, etc., for human or applicationdriven interaction.
Processing: Allowing for pluggable query languages as well as the query planner, execution, and storage engines.
Data sources: Pluggable data sources either local or in a cluster setup, providing in-situ data processing.

Note that Apache Drill is not a database but rather a query layer that works with a number of underlying data sources. It is primarily designed to do full table scans of relevant data as opposed to, say, maintaining indices. Not unlike the MapReduce part of Hadoop provides a framework for parallel processing, Apache Drill provides for a flexible query execution framework, enabling a number of use cases from quick aggregation of statistics to explorative data analysis.

Apache Drill can query data residing in different file formats (CSV, TSV, JSON, PARQUET, AVRO) and in different data sources (Hive, HBASE, HDFS, S3, MongoDB, Cassandra, and others). Drill provides a unified query layer that can interact with different file formats in different data sources thus avoiding any ETL necessary to bring data in different places to one location.

Use Case

This use case is based on financial stock data that’s been downloaded in parts to multiple files in different formats such as JSON, CSV, and TSV to demonstrate how Apache Drill can easily infer the schema of these files and enable us to perform joins and aggregations easily. The next blog in this series will persist these files in different data stores such as S3, MongoDB, and HDFS and demonstrate Drill’s capability on querying them on the fly.

What we want to do:

  • Prerequisites
  • About Data Files
  • Self Describing Data Exploration

Solution

Prerequisites

  • Install Apache Drill in Embedded Mode: We will be using Drill 0.7 version and installing it on Linux is very easy. Follow the instructions on at the below link to download and install.

Note: Drill requires JDK 1.7 and above.

  •  Verify Drill Set up:
    verify_drill_setup

About Data Files

  • Understanding Data Files:

The required data files are tailor made to demonstrate the power of Apache Drill and attached in this post. Copy over these files and store it in a separate directory “data”

    • energy_overview.tsv: Tab separated file that contains basic details of a ticker along with its name, sector and sub-sector. A quick look at the file is as follows
      energy_overview
    • energy_technical.json: A JSON file that contains technical information about the stocks present in energy_overview.tsv. These information include ATR (Average True Range), different standard moving average ranges, RSI (relative strength index), 52 week high/low percentages. Please note that SMA and W52High/Low values are in percentages.
      energy_technical
    • stock_data.csv: A comma separated value of stock and other basic information of tickers in energy_overview file. The headers are removed so that Drill doesn’t consider them as data. This file contains information about the below columns:

Status,Name,Symbol,LastPrice,Change,ChangePercent,Timestamp,MarketCap,Volume,ChangeYTD,ChangePercentYTD,High,Low,Open

stock_data

Self Describing Data Exploration

Apache Drill can seamlessly infer the schema from different files in different format and enable us to join these files with ANSI SQL. Apache Drill’s self describing data exploration behavior becomes more powerful when these data files that are of different formats persisted in different distributed data stores such as Hive, HBase, HDFS, S3, MongoDB, Cassandra, and local file systems.

  • Drill’s Query Semantics: Drill uses specific semantics in the From clause of a query to find out where each data files are located. Drill can be configured with specific prefix to indicate different data stores that includes local file system, HDFS, MongoDB, Hive, HBase, Cassandra, and etc.
    drill’s_query_semantics
  • Basic Queries: Show first 5 rows from all data files
    basic_queries
  • Simple Join: Joins all 3 files with ticker symbol and retrieves ticker, name, sector, subsector, MarketCap, Volume, and moving averages. JSON data have built-in schema available but to reference data in TSV and CSV, column positions can be used as in array.
    simple_join
  • Sum of Volume Traded by each Subsector: Drill allows us to perform casting to different data types as we see fit.
    sum_of_volume
  • Top 10 companies in the order of high trading volume:
    Top_10_companies
  • Top companies that closed higher than their open:
    top_companies_that_closed_higher_than_their_open
  • Top companies that are overbought in recent time:
    top_companies_that_are_overbought_in_recent_time
  • Top companies that are oversold in recent time:
    top_companies_that_are_oversold_in_recent_time
  • Top 10 companies that are trading 20% closer to their 52 week high:
    top_10_companies_that_are_trading_20_closer_to_their_52_week_high
  • Top 10 companies that are trading 80% lower to their 52 week high:
    top_10_companies_that_are_trading_80_lower_to_their_52_week_high

Conclusion

  • Why Drill: Get started fast, schema-free JSON model, Query complex, semi-structured data in-situ, Real SQL – not “SQL-like”, leverage standard BI tools, access multiple data sources, custom user defined functions, high performance, and scales from a single laptop to a 1000-node cluster.
  • Drill can be installed in embedded or distributed mode depending on the need.
  • No need to define schema upfront or perform ETL activity to run analytics on different datasets persisted in different datastores.

References

4426 Views 3 Views Today