Embrace Relationships with Neo4J, R & Java

Embrace Relationships with Neo4J, R & Java

Introduction

Graphs are everywhere, used by everyone, for everything. Neo4j is one of the most popular graph database that can be used to make recommendations, get social, find paths, uncover fraud, manage networks, and so on. A graph database can store any kind of data using a Nodes (graph data records), Relationships (connect nodes), and Properties (named data values).

A graph database can be used for connected data which is otherwise not possible with either relational or other NOSQL databases as they lack relationships and multiple depth traversals. Graph Databases Embrace Relationships as they naturally form Paths. Querying or traversing the graph involves following Paths. Because of the fundamentally path-oriented nature of the data model, the majority of path-based graph database operations are highly aligned with the way in which the data is laid out, making them extremely efficient.

Use Case

This use case is based on modified version of StackOverflow dataset that shows network of programming languages, questions that refers to these programming languages, users who asked and answered these questions, and how these nodes are connected with relationships to find deeper insights in Neo4J Graph Database which is otherwise not possible with common relation database or other NoSQL databases.

What we want to do:

  • Prerequisites
  • Download StackOverflow Dataset
  • Data Manipulation with R
  • Create Nodes & Relationships file with Java
  • Create GraphDB with BatchImporter
  • Visualize Graph with Neo4J

Solution

Prerequisites

  • Download and Install Neo4j: We will be using Neo4j 2.x version and installing it on Windows is very easy. Follow the instructions on at the below link to download and install.

Note: Neo4j 2.x requires JDK 1.7 and above.

http://www.neo4j.org/download/windows

  • Download and Install RStudio: We will be using R to perform some data manipulation on the StackOverflow dataset which is available in RData format and this includes filtering, altering, dropping columns, and others. This is done to show the power of R with respect to data manipulation and the same can be done in other programming languages as well. Download the open source edition of Rstudio from the below link.

http://www.rstudio.com/products/rstudio/#Desk

Download StackOverflow Dataset

  • Download Dataset: This use case is based on modified version of StackOverflow dataset which is rather old and available in both CSV and RData format. Follow the below links to download the dataset. The first link contains the details about various fields and the second link is to download RData

http://www.ics.uci.edu/~duboisc/StackOverflow

http://www.ics.uci.edu/~duboisc/StackOverflow/answers.Rdata

  • Understanding Dataset:

We will be mostly interested in the following fields which will be used to create nodes and relationships in Neo4j.

qid: Unique question id
i: User id of questioner
qs: Score of the question
tags: a comma-separated list of the tags associated with the question that refers to programming languages
qvc: Number of views of this question
aid: Unique answer id
j: User id of answer
as: Score of the answer

 

Data Manipulation with R

We will reshape the dataset to fit to our needs and appreciate the power of data manipulation with R. The actual RData contains around 250 K rows but this use case will perform the following manipulation to keep it interesting and small.

  • Open RStudio and Set Working Directory: Open RStudio and set the working directory to where the RData file was downloaded as shown belowopenRstudio
  • Load and Perform Data Manipulation:
    Data Manipulation
    console
    console2
    console3
    newconsole

Note: Ignore the warning message

Create Nodes and Relationship file with Java

We will write a Java program that takes the finadata.csv generated from the above R program and create multiple node files and a single relationship file that contains relations between the nodes. Our nodes and relationship structure is as follows:

Nodes: question_nodes, answer_nodes, user_nodes, lang_nodes
Relationships: The following are the relationships

  • Details about Java Program: This Java program is self explanatory and simply creates nodes and relationship files in CSV format as needed by the Neo4j Batch Importer program. Few things about the Java program to keep in mind
    • The format of Nodes file is as follows:

    • The format of Relationship file is as follows:

    • lang_nodes is manually created as it is static. All other nodes and relationship file is programmatically generated

    • finaldata.csv is renamed to sodata.csv (optional)
    • The dataset doesn’t come with names of questioners and answerers. So, we have downloaded some fictional names and associated them with the userid. This will make more sense when we view them in Neo4j graphical interface. A fictional name file for around 1500 names were created from http://homepage.net/name_generator/ and stored as “random_names.txt”.

  • Java Program to Create Nodes & Relationships:

Note:The below program has dependency only on OpenCSV library that can be downloaded from http://sourceforge.net/projects/opencsv/

    • Output of the Program: 

Run the above program from command line or within eclipse to create question_nodes.csv, answer_nodes.csv, user_nodes.csv, and rels.csv. Click here to download nodes and relationship zip file to quickly run it thru BatchImporter to create Graph DB.

Create GraphDB with Batch Importer

  • Download and Set up Batch Importer: Batch Importer program is a separate library that will create Graphdb data file which is needed by Neo4j. The input to the Batch Importer is configured in the batch.properties file that indicates what files to use as Nodes and Relationships. More details about the Batch Importer can be found in the readme at https://github.com/jexp/batch-import/tree/20

Download Link: https://dl.dropboxusercontent.com/u/14493611/batch_importer_20.zip

Note: Unzip to the location where the nodes and relationship files are created by the Java program. A sample folder structure is shown below:

filepath

    • Create batch.properties: Create the batch.properties file as shown below. The details of each of the properties is better explained at BatchImporter site. The highlighted properties are the most important that defines nodes and relationship input files.
    • Execute Batch Importer: Execute the batch importer program with import.bat within the Batch Importer directory and pass batch.properties and name of the graph db file to create

Visualize Graph with Neo4j

  • Copy graph.db file: Create a new directory “data” under the root of Neo4J installation directory and copy graph.db to data directory. This is optional but recommended to keep the graph.db in the same location as Neo4j. It should look like as below:graph
  • Start Neo4j: Execute “neo4j-community” file under bin directory of Neo4j to start Neo4j. You will be prompted to choose the location of the graph.db file.Neo4j
  • Visualize Graphs:
    • Launch Neo4j Web Console: http://localhost:7474/browser/Neo4j start
  • Navigate to Graphs: Click on the bubbles on the left top and choose “*” as follows:Neo4j relation
  • Customize Graph Attributes: Double click on “Java” node and choose “name” as the caption as shown below:Customize Graph Attributes
  • Explore Graphs: The below exploration shows the following:

Tracing the orange line indicates how the user Trevor answered (aid_853052) a Java question also asked a PHP question (qid_865476). Tracing the red line indicates the user Audrey answered two Java questions (aid_853030 and aid_892379). It’s lot of fun to work with Graph Database as the traversals are limitless. BTW, user names are fictional and not real users

Explore Graphs

Conclusion

  • Neo4j is one of the best graph databases around and comes with powerful Cypher Query Language that enables us to traverse the nodes via the relationships and using nodes properties as well. We will be covering CQL in our next blog post based on this graph data.
  • R is very handy in performing many data manipulation techniques to quickly cleanse, transform, and alter the data to our needs.
  • Neo4j also comes with Rest API to add nodes and relationships dynamically on the existing graph DB.

References

7531 Views 7 Views Today