Data Analysis Using Apache Hive and Apache Pig

Data Analysis Using Apache Hive and Apache Pig

Overview

Apache Hive, an open-source data warehouse system, is used with Apache Pig for loading and transforming unstructured, structured, or semi-structured data for data analysis and getting better business insights. Pig, a standard ETL scripting language, is used to export and import data into Apache Hive and to process large number of datasets. Pig can be used for ETL data pipeline and iterative processing.

In this blog, let us discuss about loading and storing data in Hive with Pig Relation using HCatalog.

Pre-requisites

Download and configure the following:

Use Case

In this blog, let us discuss about the below use case:

  • Loading unstructured data into Hive
  • Processing, transforming, and analyzing data in Pig
  • Loading structured data into a different table in Hive using Pig

Data Description

Two cricket data files with Indian Premier League data from 2008 to 2016 is used as a data source. The files are as follows:

  • matches.csv – Provides details about each match played
  • deliveries.csv – Provides details about consolidated deliveries of all the matches

These files are extracted and loaded into Hive. The data is further processed, transformed, and analyzed to get the winner for each season and the top 5 batsmen with maximum run in each season and overall season.

Synopsis

  • Create database and database tables in Hive
  • Import data into Hive tables
  • Call Hive SQL in Shell Script
  • View database architecture
  • Load and store Hive data into Pig relation
  • Call Pig script in Shell Script
  • Apply Pivot concept in Hive SQL
  • View Output

Creating Database and Database Tables in Hive

To create databases and database tables in Hive, save the below query as a SQL file (database_table_creation.sql):

select

Importing Data into Hive Tables

To load data from both the CSV files into Hive, save the below query as a SQL file (data_loading.sql):

select

Calling Hive SQL in Shell Script

To automatically create databases & database tables and to import data into Hive, call both the SQL files (database_table_creation.sql and data_loading.sql) using Shell Script.

select

Viewing Database Architecture

The database schema and tables created are as follows:

select

The raw matches.csv file loaded into Hive schema (ipl_stats.matches) is as follows:

select

The raw deliveries.csv file loaded into Hive schema (ipl_stats.deliveries) is as follows:

select

Loading and Storing Hive Data into Pig Relation

To load and store data from Hive into Pig relation and to perform data processing and transformation, save the below script as Pig file (most_run.pig):

select

Note: Create a Hive table before calling Pig file.

To write back the processed data into Hive, save the below script as a SQL file (most_run.sql):

select

Calling Pig Script in Shell Script

To automate ETL process, call files (most_run.pig, most_run.sql) using Shell Script.

select

The data loaded into Hive using Pig script is as follows:

select

Applying Pivot Concept in Hive SQL

As the data loaded into Hive is in rows, SQL Pivot concept is used to convert rows into columns for more data clarity and gaining better insights. User Defined Aggregation Function (UDAF) technique is used to perform pivot in Hive. In this use case, the pivot concept is applied to season and run rows alone.

To use Collect UDAF, add Brickhouse jar file into Hive class path.

The top 5 most run scored batsmen data for each season before applying pivot is shown as follows:

select

The top 5 most run scored batsmen data for each season after applying pivot is shown as follows:

select

Viewing Output

Viewing Winners of a Season

To view winners of each season, use the following Hive SQL query:

select

Viewing Top 5 Most Run Scored Batsmen

To view Top 5 most run scored batsmen, use the following Hive SQL query:

select

The top 5 most run scored batsmen is shown graphically using MS Excel as follows:

select

Viewing Year-wise Runs of Top 5 Batsmen

To view year-wise runs of the top 5 batsmen, use the following Hive SQL query:

select

The year-wise runs of the top 5 batsmen are shown graphically using MS Excel as follows:

select

References

2413 Views 13 Views Today
  • Thabrez syed

    I got the error to run hive to pig HCatLoader like this…

    hduser@thabresh-Aspire-E1-470G:/home/project/scripts/shell_scripts$ ./most_run.sh
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/usr/local/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

    Logging initialized using configuration in jar:file:/usr/local/hive/lib/hive-common-2.1.0.jar!/hive-log4j2.properties Async: true
    OK
    Time taken: 1.79 seconds
    OK
    Time taken: 0.88 seconds
    ls: cannot access ‘/usr/local/hive/hcatalog/share/hcatalog/hcatalog-core-*.jar’: No such file or directory
    ls: cannot access ‘/usr/local/hive/hcatalog/share/hcatalog/hcatalog-*.jar’: No such file or directory
    ls: cannot access ‘/usr/local/hive/hcatalog/lib/hbase-storage-handler-*.jar’: No such file or directory
    ls: cannot access ‘/usr/local/hive/hcatalog/share/hcatalog/hcatalog-pig-adapter-*.jar’: No such file or directory
    2018-03-27 22:26:12,193 [main] INFO org.apache.pig.Main – Apache Pig version 0.12.2-SNAPSHOT (r: unknown) compiled Jun 25 2014, 12:00:33
    2018-03-27 22:26:12,193 [main] INFO org.apache.pig.Main – Logging error messages to: /home/project/scripts/shell_scripts/pig_1522169772191.log
    2018-03-27 22:26:12,999 [main] WARN org.apache.hadoop.util.NativeCodeLoader – Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    2018-03-27 22:26:13,498 [main] INFO org.apache.pig.impl.util.Utils – Default bootup file /home/hduser/.pigbootup not found
    2018-03-27 22:26:13,775 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
    2018-03-27 22:26:13,776 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS
    2018-03-27 22:26:13,776 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to hadoop file system at: hdfs://localhost:9000
    2018-03-27 22:26:14,804 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS
    2018-03-27 22:26:14,861 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS
    2018-03-27 22:26:14,909 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS
    2018-03-27 22:26:14,977 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS
    2018-03-27 22:26:15,028 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS
    2018-03-27 22:26:15,072 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS
    2018-03-27 22:26:15,116 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS
    2018-03-27 22:26:15,831 [main] ERROR org.apache.pig.PigServer – exception during parsing: Error during parsing. Could not resolve org.apache.hive.hcatalog.pig.HCatLoader using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
    Failed to parse: Pig script failed to parse:
    pig script failed to validate: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve org.apache.hive.hcatalog.pig.HCatLoader using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
    at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:196)
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1678)
    at org.apache.pig.PigServer$Graph.access$000(PigServer.java:1411)
    at org.apache.pig.PigServer.parseAndBuild(PigServer.java:344)
    at org.apache.pig.PigServer.executeBatch(PigServer.java:369)
    at org.apache.pig.PigServer.executeBatch(PigServer.java:355)
    at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:140)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:202)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
    at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
    at org.apache.pig.Main.run(Main.java:607)
    at org.apache.pig.Main.main(Main.java:156)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
    Caused by:
    pig script failed to validate: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve org.apache.hive.hcatalog.pig.HCatLoader using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
    at org.apache.pig.parser.LogicalPlanBuilder.validateFuncSpec(LogicalPlanBuilder.java:1299)
    at org.apache.pig.parser.LogicalPlanBuilder.buildFuncSpec(LogicalPlanBuilder.java:1284)
    at org.apache.pig.parser.LogicalPlanGenerator.func_clause(LogicalPlanGenerator.java:5158)
    at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3515)
    at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1625)
    at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
    at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
    at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
    at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188)
    … 17 more
    Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve org.apache.hive.hcatalog.pig.HCatLoader using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
    at org.apache.pig.impl.PigContext.resolveClassName(PigContext.java:653)
    at org.apache.pig.parser.LogicalPlanBuilder.validateFuncSpec(LogicalPlanBuilder.java:1296)
    … 25 more
    2018-03-27 22:26:15,899 [main] ERROR org.apache.pig.tools.grunt.Grunt – ERROR 1070: Could not resolve org.apache.hive.hcatalog.pig.HCatLoader using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
    Details at logfile: /home/project/scripts/shell_scripts/pig_1522169772191.log

    Please Help to solve this error…what can i do now..?