Crime Analysis with Zeppelin, R & Spark

Crime Analysis with Zeppelin, R & Spark


Apache Zeppelin, a web based notebook that enables interactive data analytics including Data Ingestion, Data Discovery, and Data Visualization all in one place. Zeppelin interpreter concept allows any language/data-processing-backend to be plugged into Zeppelin. Currently, Zeppelin support many interpreters such as Spark (Scala, Python, R, SparkSQL), Hive, JDBC, and others. Zeppelin can be configured with existing Spark eco-system and share SparkContext across Scala, Python, and R.

We have implemented SFO Crime Analysis with plain R, Shiny & R, and OpenRefine in the past and this time with Zeppelin & R. We will also briefly show how Spark with R programs on the same Crime dataset can be used. We suggest to refresh our previous blog posts as this one re-uses most of the stuff on a different environment. Please take a look at the Reference section on how to get to the previous blog post.

Use Case

This use case is based on San Francisco Incidents derived from SFPD (San Francisco Police Department) Crime Incident Reporting system for calendar year 2013. This dataset contains close to 130 K records that contains type of crime, date and time of the incident, day of the week, latitude and longitude of the incident. We will analyze this dataset and extract some meaningful insights.

What we want to do:

  • Prerequisites
  • Download Crime Incident Dataset
  • Data Extraction & Exploration
  • Data Manipulation
  • Data Visualization
  • Simple Spark & R Exploration



  • Install Apache Zeppelin with Spark

Installation steps are bit involved and out of the scope of this post. However, there are many articles and blogs that shows how to set up Zeppelin with built-in Spark modules or integrate with existing Spark environment. Below are some installation references.

Download Crime Incident Dataset

  • Download Dataset: This use case is based on San Francisco Incidents derived from SFPD (San Francisco Police Department) Crime Incident Reporting system for calendar year 2013. Download the dataset by clicking on the below link and unzip into your working directory on a machine where Zeppelin is running.

SFPD dataset:

  • Understanding Dataset: This dataset contains the following columns. We will be dropping some columns, rename few, and add few extra columns that will help in our analysis.
IncidntNum This is the incident number for this incident. Unfortunately, some of these incident numbers are duplicated
Category This is the category of the crime. Eg: Robbery, Fraud, Theft, Arson, and so on. This use case will combine similar crimes to keep the categories handful
Descript This is the description of the crime. We won’t need this and hence this column will be dropped
DayOfWeek Which day this incident happened in the week.
Date Date of the incident
Time Hour and Minute representation
PdDistrict Which district this area corresponds. We won’t need this and will be dropped
Resolution What happened to the culprits in the incident?
Address Street address of the incident
X Longitude of the location. This column will be renamed to longitude
Y Latitude of the location. This column will be renamed to latitude
Location Comma separated of latitude and longitude. We won’t need this and will be dropped

Basic Setup


  • Install Packages: This use case requires library “chron” to be installed to perform some date and time manipulation, and ggplot2 for visualization purpose

Structure of the Data


Summary of the Data


Data Manipulation Step 1 – Order & Remove Duplicates


Data Manipulation Step 2


Data Manipulation Step 3


Data Manipulation Step 4 – Find Month of the event


Data Manipulation Step 5 – Group Similar Crimes


Data Visualization

  • Crimes by each Category:


  • Crimes by Time of the Day:


  • Crimes by Day of the Week:


  • Crimes by Month of the Year:


Simple Spark & R Exploration

Please note that we use a different interpreter (%spark.r) instead of (%r) to indicate Zeppelin to execute these as Spark jobs instead of ordinary R programs.

  • Spark Dataframe Setup:

Read the CSV file similar to R CSV read and create a Spark Dataframe to explore. Caching Spark Dataframe will help in running further analysis faster so that Spark doesn’t have to go thru the whole action process due to its laziness.


  • SparkSQL – Group by Day of Week:

Please note that we use a SparkSQL interpreter (%sql) to work on the temp table (SFODF) that was registered in the previous section


  • SparkSQL – Group by Crime Category:


  • SparkSQL – Group by Parameter:

Zeppelin enables form parameters so that users can check the reports by passing different parameters. In this example, We have provided two group by categories (District and Resolution) and by choosing from the dropdown, pie chart will change

By Resolution:


By District:



Zeppelin Dashboard



  • Zeppelin provides built-in Apache Spark Integration with Scala and PySpark. With a bit of custom integration, SparkR is also possible.
  • Automatic SparkContext and SQLContext injection and sharing of SparkContext across Scala, Python, and SparkR comes very handy
  • Zeppelin can dynamically create some input forms into your notebook as showin in SparkSQL – Group by Parameter


10673 Views 5 Views Today