Crime Analysis with Zeppelin, R & Spark

Crime Analysis with Zeppelin, R & Spark

Introduction

Apache Zeppelin, a web based notebook that enables interactive data analytics including Data Ingestion, Data Discovery, and Data Visualization all in one place. Zeppelin interpreter concept allows any language/data-processing-backend to be plugged into Zeppelin. Currently, Zeppelin support many interpreters such as Spark (Scala, Python, R, SparkSQL), Hive, JDBC, and others. Zeppelin can be configured with existing Spark eco-system and share SparkContext across Scala, Python, and R.

We have implemented SFO Crime Analysis with plain R, Shiny & R, and OpenRefine in the past and this time with Zeppelin & R. We will also briefly show how Spark with R programs on the same Crime dataset can be used. We suggest to refresh our previous blog posts as this one re-uses most of the stuff on a different environment. Please take a look at the Reference section on how to get to the previous blog post.

Use Case

This use case is based on San Francisco Incidents derived from SFPD (San Francisco Police Department) Crime Incident Reporting system for calendar year 2013. This dataset contains close to 130 K records that contains type of crime, date and time of the incident, day of the week, latitude and longitude of the incident. We will analyze this dataset and extract some meaningful insights.

What we want to do:

  • Prerequisites
  • Download Crime Incident Dataset
  • Data Extraction & Exploration
  • Data Manipulation
  • Data Visualization
  • Simple Spark & R Exploration

Solution

Prerequisites

  • Install Apache Zeppelin with Spark

Installation steps are bit involved and out of the scope of this post. However, there are many articles and blogs that shows how to set up Zeppelin with built-in Spark modules or integrate with existing Spark environment. Below are some installation references.

http://hortonworks.com/blog/introduction-to-data-science-with-apache-spark/

https://github.com/apache/incubator-zeppelin

http://blog.cloudera.com/blog/2015/07/how-to-install-apache-zeppelin-on-cdh/

Download Crime Incident Dataset

  • Download Dataset: This use case is based on San Francisco Incidents derived from SFPD (San Francisco Police Department) Crime Incident Reporting system for calendar year 2013. Download the dataset by clicking on the below link and unzip into your working directory on a machine where Zeppelin is running.

SFPD dataset: SFPD_Incidents.zip

  • Understanding Dataset: This dataset contains the following columns. We will be dropping some columns, rename few, and add few extra columns that will help in our analysis.
IncidntNum This is the incident number for this incident. Unfortunately, some of these incident numbers are duplicated
Category This is the category of the crime. Eg: Robbery, Fraud, Theft, Arson, and so on. This use case will combine similar crimes to keep the categories handful
Descript This is the description of the crime. We won’t need this and hence this column will be dropped
DayOfWeek Which day this incident happened in the week.
Date Date of the incident
Time Hour and Minute representation
PdDistrict Which district this area corresponds. We won’t need this and will be dropped
Resolution What happened to the culprits in the incident?
Address Street address of the incident
X Longitude of the location. This column will be renamed to longitude
Y Latitude of the location. This column will be renamed to latitude
Location Comma separated of latitude and longitude. We won’t need this and will be dropped

Basic Setup

basic_setup

  • Install Packages: This use case requires library “chron” to be installed to perform some date and time manipulation, and ggplot2 for visualization purpose

Structure of the Data

structure_of_the_data

Summary of the Data

summary_of_the_data

Data Manipulation Step 1 – Order & Remove Duplicates

data_manipulation

Data Manipulation Step 2

data_manipulation_step2

Data Manipulation Step 3

data_manipulation_step3

Data Manipulation Step 4 – Find Month of the event

data_manipulation_step4

Data Manipulation Step 5 – Group Similar Crimes

data_manipulation_step5

Data Visualization

  • Crimes by each Category:

data_visualization

  • Crimes by Time of the Day:

crimes_by_time_of_the_day

  • Crimes by Day of the Week:

crimes_by_day_of_the_week

  • Crimes by Month of the Year:

crimes_by_month_of_the_year

Simple Spark & R Exploration

Please note that we use a different interpreter (%spark.r) instead of (%r) to indicate Zeppelin to execute these as Spark jobs instead of ordinary R programs.

  • Spark Dataframe Setup:

Read the CSV file similar to R CSV read and create a Spark Dataframe to explore. Caching Spark Dataframe will help in running further analysis faster so that Spark doesn’t have to go thru the whole action process due to its laziness.

spark_dataframe_steup

  • SparkSQL – Group by Day of Week:

Please note that we use a SparkSQL interpreter (%sql) to work on the temp table (SFODF) that was registered in the previous section

sparkSQL_group_by_day_of_week

  • SparkSQL – Group by Crime Category:

sparkSQL_group_by_crime_category

  • SparkSQL – Group by Parameter:

Zeppelin enables form parameters so that users can check the reports by passing different parameters. In this example, We have provided two group by categories (District and Resolution) and by choosing from the dropdown, pie chart will change

By Resolution:

sparkSQL_group_by_parameter

By District:

sparkSQL_group_by_parameter_district

 

Zeppelin Dashboard

zeppelin_dashboard_v1

Conclusion

  • Zeppelin provides built-in Apache Spark Integration with Scala and PySpark. With a bit of custom integration, SparkR is also possible.
  • Automatic SparkContext and SQLContext injection and sharing of SparkContext across Scala, Python, and SparkR comes very handy
  • Zeppelin can dynamically create some input forms into your notebook as showin in SparkSQL – Group by Parameter

References

5146 Views 2 Views Today