Table of Content
- 1 Overview
- 2 Pre-requisites
- 3 Dataset Description
- 4 Use Case
- 5 Accessing Data
- 6 Preparing Data
- 7 Performing Exploratory Data Analysis
- 8 Building Machine Learning Model
- 9 Validating Model
- 10 Executing Model
- 11 Conclusion
- 12 References
Nowadays, Deep Learning (DL) and Machine Learning (ML) are used to analyze and accurately predict data. Machine Learning models are used to accurately predict crimes. Crime prediction not only helps in crime prevention but also enhances public safety. Autoencoder, a simple, 3-layer neural network, is used for dimensionality reduction and for extracting key features from the model.
Data Engineers spend much time in building an analytic model with proper validation metrics in order to higher the performance of the model. Data Analysts spend high time in building data pipelines as a part of Big Data Analytics. The Machine Learning models are developed in these pipelines with its own functionalities/features. On passing the models through the Analytical Pipeline, these models are easily deployed in real-time processing.
This blog is part one of a two-part series of Crime Analysis using H2O Autoencoders. In this blog, let us discuss building the analytical pipeline and applying Deep Learning to predict the arrest status of the crimes happening in Los Angeles (LA).
Install the following in R:
- H2O from the below repository:
Command to install – install.packages(“h2o”, type=”source”, repos=”https://h2o-release.s3.amazonaws.com/h2o/rel-weierstrass/2/R“
Crime dataset of Los Angeles, from 2016-2017, with 224K records and 27 attributes is used as the source file. This dataset is an open data resource for governments, non-profit organizations, and NGOs.
- Predict the arrest status of the crimes happening in Los Angeles.
- Achieve analytical pipeline.
- Analyze the performance of Autoencoders.
- Build deep learning and machine learning models.
- Apply required mechanisms to increase the performance of the models.
- Access data
- Prepare data
- Clean data
- Preprocess data
- Perform Exploratory Data Analysis (EDA)
- Build Machine Learning model
- Initialize H2O cluster
- Impute data
- Train model
- Validate model
- Execute model
- Pre-trained supervised model
The crime dataset is obtained from https://dev.socrata.com/ and imported into the database. The Socrata APIs provide rich query functionality through a query language called “Socrata Query Language” or “SoQL”.
The data structure is as follows:
In this section, let us discuss data preparation for building a model.
Data cleansing is performed to find NA values in the dataset. These NA values should be either removed or imputed with some imputation techniques to get desired data.
To get the count of NA values and view the results, use the below commands:
Total Number of NA Values for Each Column
From the above diagram, it is evident that the attributes such as crm_cd_2, crm_cd_3, crm_cd_4, cross_street, premis_cd and weapon_used_cd are repeated and are to be removed. These attributes are removed from the dataset.
Data preprocessing such as data type conversion, date conversion, month, year, & week derivation from the date field, new attributes derivation, and so on is performed on the dataset. The date attribute is converted from factor to POSIXct object. Lubridate package is used to get various fields such as the month, year, and week using this object. Chron package is used along with the time attribute to derive crime time interval (Morning, Afternoon, Midnight, and so on).
Performing Exploratory Data Analysis
The EDA is performed on the crime dataset to make better and useful EDA.
Top 20 Crimes in Los Angeles
Month with Highest Crimes
Area with Highest Crime Percentage
Top 10 Descent Groups Getting Affected
Top 10 Frequently Used Weapons for Crime
Safest Living Places in Los Angeles
Building Machine Learning Model
In this section, let us discuss building the best Machine Learning model for our dataset using Machine Learning algorithms.
Initializing H2O Cluster
Before imputing the data, initiate a H2O cluster running with port 12345 using init(). This cluster is accessed using http://localhost:12345/flow/index.html#.
In H2O, data imputation is performed using h2o.impute() to fill the NA values using default methods such as mean, median, and mode. The method is chosen based on the data type of each column. For example, factor or categorical columns are imputed using mode method.
The dependent variable is grouped based on the status codes of the crimes occurred. The crimes arrest status codes are grouped into Not Arrested and Arrested.
The dataset is split into Train, Test, and Validation frames based on certain ratios specified using h2o.splitframe. Each frame is assigned to a separate variable using h2o.assign().
To train the model, perform the following:
- Take the data pertaining to the year 2016 as the training set.
- Take the data pertaining to the year 2017 as the test set.
- Apply Deep Learning to the model.
- Perform Unsupervised classification to predict the arrest status of the crimes.
- Make the autoencoder model to learn the patterns of the input data irrespective of the given class labels.
- Make the model to learn the status behavior based on the features.
Function Used to Apply Deep Learning to Our Data: h2o.deeplearning
@param x – features for our model
@param training_frame – dataset to the model that needs to be applied.
@param model_id – string represents our model to save and load.
@param seed – for resproducability.
@param hidden – number of hidden layers.
@param epochs – number of iterations our dataset must go through.
@param activation – a string representing the activation to be used.
@params stopping_rounds, stopping_metric, export_weights_and_biases – used for cross validation purposes.
@param autoencoder – logic representing whether autoencoders should be applied or not
The above diagram shows the summary of our Autoencoders model and its performance for our training set.
A classification problem is encountered as Gaussian distribution is applied to our model instead of a Binomial classification.
As the above results are not satisfactory, the dimensionality of our model is reduced to get better results. The features of one of the hidden layers are extracted and the results are plotted to classify the arrest status using deep features functions in H2O package.
From the above results, the arrest status of the crimes happened cannot be exactly obtained.
So, dimensionality reduction with our autoencoder model alone is not sufficient to identify the arrest status in this dataset. The dimensionality representation of one of our hidden layers is used as features for Model Training. Supervised Classification is applied to the extracted features and the results are tested.
To validate the performance of our model, the cross-validation parameters used while building the model is used to plot the ROC curves and get the AUC value on our validation frames. A detailed overview of our model is obtained using summary() function.
To predict the arrest status of the crimes, perform the following:
- Apply the deep features to the dataset.
- Use our model to predict the arrest status.
- Plot the ROC curve with AUC values based on Sensitivity and Specificity.
- Group the results based on the predicted and actual values with the total number of classes and its frequencies.
- Decide the performance of our model on the arrest status of the crimes.
From the above diagram, the predicted number of Not Arrested cases is 28 and the predicted number of Arrested cases is 150. As the numbers seem to be less, this model will cause a slight problem in maintaining the historical records when used in real-time.
Pre-trained Supervised Model
The autoencoder model is used as a pre-training input for a supervised model and its weights are used for model fitting. The same training and validation sets are used for the supervised model. A parameter called pretrained_autoencoder is added in our model along with the autoencoder model name.
This pre-trained model is used to predict the results of our new data and to find the probability of classes for our new data.
The results are grouped based on the actual and predicted values and the performance of our model is decided based on the arrest status of the crimes.
From the above results, it is evident that there are only minor changes in the results from our previous results with the dimensionality representation. Let us plot the ROC curves and AUC values to compare both the results.
In this blog, we discussed creating the analytical pipeline for the Los Angeles crime dataset, applying the Autoencoders to the dataset, performing both Unsupervised and Supervised Classifications, extracting the dimensionality representation of our model, and applying the Supervised model.
In our next blog on Crime Analysis Using H2O Autoencoders – Part 2, let us discuss deploying the model by converting it into POJO/MOJO objects with the help of H2O functions.
- Sample Dataset
- Sample Dataset in JSON Format
- Sample Dataset Attribute Description
- City of Los Angeles Dataset API
- Queries using SODA
- Building deep neural nets with h2o
- H2o Deep Learning
- Sample Dataset in GitHub