SFO City Crime Analysis with OpenRefine

SFO City Crime Analysis with OpenRefine

Introduction

This blog deals with the analysis of San Francisco Incidents Dataset and extraction of some meaningful insights using the OpenRefine tool. OpenRefine (formerly Google Refine) is a powerful tool which is meant for working with messy data, cleansing and transforming it from one format to other, like CSV, TSV, XML, HTML table, ODF spreadsheet, and Excel. Please refer to our SFO City Crime Analysis with R blog, where we have cleansed the San Francisco Incidents dataset using R programming.

In OpenRefine, the following operations can be performed without any complications :

  • Import data in various formats like TSV, CSV, *SV, Excel (.xls and .xlsx), JSON, XML, RDF as XML, and Google Data documents
  • Export datasets into other formats (CSV, TSV, XML, HTML table, ODF spreadsheet, Excel) in a matter of seconds
  • Apply basic and advanced cell transformations
  • Easily clustering the cell data
  • Fetch Cell data from external APIs or URLs
  • Filter and partition data easily using regular expressions
  • Perform advanced data operations with the General Refine Expression Language (GREL)

Use Case

Let’s reuse the same SFPD (San Francisco Police Department) Crime Incident Reporting system dataset, for the calendar year 2013  as discussed in the SFO City Crime Analysis with R blog. This SFPD crime incident dataset contains close to 130K records which are classified into the type of crime, date and time of the incident, day of the week, latitude and longitude of the incident.

In this use case we are going to perform cleansing operations using the OpenRefine tool. We will analyze this dataset and extract some meaningful insights with the help of OpenRefine tool.

What we need to do:

  • Prerequisites
  • Download the Crime Incident Dataset
  • Data Extraction & Exploration
  • Data Manipulation
  • Export cleansed datasets and Data Visualization
  • How to use Macros

Solution

Prerequisite

  • Install JDK 1.6+
  • Download OpenRefine tool server stable version
  • After the download process, extract the zip file. Run the OpenRefine Server in the command prompt
    CMD

Download The Crime Incident Dataset

  • To download and understand the SFPD Crime Incident Reporting system dataset for the calendar year 2013, follow the steps explained in the section given below, from our previous blog “SFO City Crime Analysis with R”.
    • Download Crime Incident Dataset (Ignore Install Packages sub topic).

Data Extraction & Exploration

  • Open ‘OpenRefine‘ (http://127.0.0.1:3333/) in the browser.
  • From the OpenRefine home page, select Create Project -> choose file “sfpd_incident_2013.csv” , from the downloaded location and click on Next. It will lead to the project preview page.
  • On the Project preview page, rename the project as “sfpd_incident_2013”, and if possible, configure the other parsing options as well. For this use case, no other task, apart from renaming the project needs to be done. Next, click on the “Create Project >>” button.
  • After clicking on the Create Project button, the csv file gets uploaded into the heap memory for a while, and then gets stored in the workspace location.
  • Once the loading process completes, the project will open up in the browser as shown below.
    create

Data Manipulation

Data Manipulation 1: Remove Duplicates – based on IncidntNum column

  • To sort the IncidentNum in ascending order:

manip

  • To find and remove the duplicates rows based on IncidentNum:

dup

  • After removing the blank records, we can see the unique values by clicking on false facet.facet

Data Manipulation 2: Create new incident_time column with the format (HH:mm:ss) based on existing the Time column

hhmm

Data Manipulation 3: Create the new incident_date column with the format (yyyy-MM-dd) based on existing Date column

yymm

Data Manipulation 4: Create new incident_date_time column by merging both incident_date and incident_time columns

MERG

Data Manipulation 5: Create Address column based on PdDistrict and location columns

d5

Data Manipulation 6: Rename X and Y as longitude and latitude

d6

Data Manipulation 7: Grouping the incidents based on incident_time column and create a new column incident_time_tag

d7

Data Manipulation 8: Grouping the Category column and create new column crime_category

d8

Data Manipulation 9: Remove unwanted columns

d9

Export cleansed datasets and Data Visualization

  • Export Data: OpenRefine provides a method to export the cleansed datasets in various file formats like TSV, CSV, HTML table etc.
  • To export, select Export -> comma-separated value (If we want to export into other formats, select the corresponding option from the drop down menu).

exp

  • Data Visualization: OpenRefine provides Scatterplot facet option to visualize data. This option is useful to plot the visuals for columns which have only numeric values. For this use case, we plot visuals for the latitude and longitude columns.
  • To plot the visual, Select IncidntNum Facet -> Scatterplot facet.
    Lang vs Lat

How to use Macros

  • In this blog, we have done the cleansing operations for only SFPD Crime Incident Report for the calendar year 2013.
  • We can perform similar multiple operations for several other years (e.g. SFPD Crime Report for 2011, 2012, 2014) by applying a single macro. We need not repeat all the steps.
  • After completing data manipulation, we can easily extract the macro for all above steps and apply for other datasets which are similar to this dataset.
  • We applied those macros on new datasets to manipulate data at once, which help us avoid repetition of steps.

To extract the macro:

  • Click on Undo/Redo Tab -> Extract… on the current project page.
  • Select macros by choosing the checkbox and copy the macros from the clipboard and put them into use.

3.6

To apply the macro:

  • Goto New Project which is similar to this dataset
  • Click on Undo/Redo Tab -> Apply
  • Paste the macro in the text area and click on Perform Operations.

newpj

Conclusion

OpenRefine tool is especially created for a Non-Programmer and End-Users for cleansing their datasets according to their requirements. OpenRefine is one of the popular tools that is capable of performing Statistical Analysis, Text analysis, Classification, Clustering of data and for exporting data into other formats.
There are lots of public datasets, ranging across domains such as City Management Ethics, Transportation, Health and social service that can be used for data manipulation with the OpenRefine tool.

References

OpenRefine:
http://openrefine.org/
http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial
http://casci.umd.edu/wp-content/uploads/2013/12/OpenRefine-tutorial-v1.5.pdf

Prepare SQL using OpenRefine:
http://googlerefine.blogspot.ca/2014/04/prepare-sql-query-using-openrefine.html

GREL Functions:
https://github.com/OpenRefine/OpenRefine/wiki/GREL-Functions

OpenRefine Documentation :
Developer: https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Developers
Users: https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users

 

3445 Views 1 Views Today