Apache Drill vs Amazon Athena – A Comparison on Data Partitioning

Apache Drill vs Amazon Athena – A Comparison on Data Partitioning

Overview

Big data exploration in almost all fields has led to the development of multiple big data technologies such as Hadoop (Hive, HDFS, Pig, HBase), NoSQL databases (MongoDB), and so on for accessing, exploring, and reporting huge volume of data. Amazon Athena, a serverless, interactive query service, is used to easily analyze big data using standard SQL in Amazon S3. Apache Drill, a schema-free, low-latency SQL query engine, enables self-service data exploration on big data.

In this blog, let us compare data partitioning in Apache Drill and AWS Athena and the distinct features of both.

Dataset Description

A sample dataset, containing census data of a particular country in the USA, is used in this use case. For sample dataset, consider Reference section.

Partitioning Data

In this section, let us discuss data partitioning based on male and female fertility rate in a predefined age group in Apache Drill and Athena.

Partitioning Data in Apache Drill

To perform data partition in Drill, perform the following:

  • Change data storage format to Parquet using the following command:
select

  • Create table and partition data using the following command:
The table created is as shown below:

select

The time taken to create a table is as shown below:

select

You can check the data loaded into the database using the following command:

The time taken to select the required data in a table is as shown below:

select

  • Get total count of male and female fertility data using the following command:
The count of males and females in a country is shown below:

select

The file size after partitioning data using Apache drill is as shown below:

select

Partitioning Data in Athena

Athena uses Hive data partitioning and provides improved query performance by reducing the amount of data scanned.

In Athena, data partitioning can be done in two separate ways as follows:

  • With already partitioned data stored on Amazon S3 and accessed on Athena.
  • With unpartitioned data.

In both methods, specify the partitioned column in create statement.

To perform data partition in Athena, perform the following:

  • Create table using the below query:
select

  •  Add partitions to the catalog by using the below command:
select

  • Check partitioned data using the below query:
select

Data Partition Comparison between Apache Drill and Amazon Athena

The time taken to perform create partition and select partition is as follows:

select

Distinct Features of Drill and Athena

select

Conclusion

In Apache Drill, data partitioning concepts can be applied directly. In Athena, we need to convert the files into Parquet format using EMR to perform data partitioning. A separate storage is not required in Athena as you can query the data directly from Amazon S3.

References

4541 Views 7 Views Today
  • Gati Dash

    So can we say Apache Drill is more efficient until and unless we have a specific usecase for Athena??]