
Table of Content
Introduction
Graphs are everywhere, used by everyone, for everything. Neo4j is one of the most popular graph database that can be used to make recommendations, get social, find paths, uncover fraud, manage networks, and so on. A graph database can store any kind of data using a Nodes (graph data records), Relationships (connect nodes), and Properties (named data values).
A graph database can be used for connected data which is otherwise not possible with either relational or other NOSQL databases as they lack relationships and multiple depth traversals. Graph Databases Embrace Relationships as they naturally form Paths. Querying or traversing the graph involves following Paths. Because of the fundamentally path-oriented nature of the data model, the majority of path-based graph database operations are highly aligned with the way in which the data is laid out, making them extremely efficient.
Use Case
This use case is based on modified version of StackOverflow dataset that shows network of programming languages, questions that refers to these programming languages, users who asked and answered these questions, and how these nodes are connected with relationships to find deeper insights in Neo4J Graph Database which is otherwise not possible with common relation database or other NoSQL databases.
What we want to do:
- Prerequisites
- Download StackOverflow Dataset
- Data Manipulation with R
- Create Nodes & Relationships file with Java
- Create GraphDB with BatchImporter
- Visualize Graph with Neo4J
Solution
Prerequisites
- Download and Install Neo4j: We will be using Neo4j 2.x version and installing it on Windows is very easy. Follow the instructions on at the below link to download and install.
Note: Neo4j 2.x requires JDK 1.7 and above.
http://www.neo4j.org/download/windows
- Download and Install RStudio: We will be using R to perform some data manipulation on the StackOverflow dataset which is available in RData format and this includes filtering, altering, dropping columns, and others. This is done to show the power of R with respect to data manipulation and the same can be done in other programming languages as well. Download the open source edition of Rstudio from the below link.
http://www.rstudio.com/products/rstudio/#Desk
Download StackOverflow Dataset
- Download Dataset: This use case is based on modified version of StackOverflow dataset which is rather old and available in both CSV and RData format. Follow the below links to download the dataset. The first link contains the details about various fields and the second link is to download RData
http://www.ics.uci.edu/~duboisc/StackOverflow
http://www.ics.uci.edu/~duboisc/StackOverflow/answers.Rdata
- Understanding Dataset:
We will be mostly interested in the following fields which will be used to create nodes and relationships in Neo4j.
qid: | Unique question id |
i: | User id of questioner |
qs: | Score of the question |
tags: | a comma-separated list of the tags associated with the question that refers to programming languages |
qvc: | Number of views of this question |
aid: | Unique answer id |
j: | User id of answer |
as: | Score of the answer |
Data Manipulation with R
We will reshape the dataset to fit to our needs and appreciate the power of data manipulation with R. The actual RData contains around 250 K rows but this use case will perform the following manipulation to keep it interesting and small.
- Open RStudio and Set Working Directory: Open RStudio and set the working directory to where the RData file was downloaded as shown below
- Load and Perform Data Manipulation:
1234567//Load answers.Rdata that was downloadedload("answers.Rdata")//The data is available in “data” object and a quick can be done with headhead(data)
12345678910111213141516171819//Load stringr library to perform some String manipulationrequire(stringr)//Create a new column Match and assign True/False based on whether the tags contain only specific language.//For this use case, we are interested only in subset of programming languages.data$Match <- str_detect(string = data$tags, pattern = "(java|mysql|linux|python|django|php|jquery)")//Create a new column length that contains number of words in tags column by using splitting.//sapply function will perform the function str_split recursively for each rowdata$length <- sapply(str_split(data$tags, ","), length)//The data object now contains 2 new columns: Match and length. Match column will have TRUE if the tags column contains//one of the programming language patterns that we are interested in. The length column will have number of words delimited//by commahead(data)
12345678910111213//Find number of rows in the data objectnrow(data) //This will show 263540 rows//Subset the data object where Match=True, length=1, question and answer score are greater than zero//Store the result in a newdata objectnewdata <- subset(data, (Match == "TRUE" & length == 1 & qs > 0 & as > 0))//the row count is significantly went down to 1668nrow(newdata)//The top 5 row sample shows that the tags column has only one programming language associatedhead(newdata)
1234567//Create a drop column list(qt, at, Match, and length) and drop from the newdata object that are not needed anymoredrops <- c("qt", "at", "Match", "length")//The new data frame finaldata object doesn’t contain the drops column listfinaldata <- newdata[, !(names(newdata) %in% drops)]head(finaldata)
12345//Order the finaldata object by question idfinaldata <- finaldata[order(finaldata$qid),]//Write the finaldata object to a CSV file that will be used to create nodes and relationshipswrite.csv(finaldata, "finaldata.csv",sep=",",row.names=FALSE)
Note: Ignore the warning message
Create Nodes and Relationship file with Java
We will write a Java program that takes the finadata.csv generated from the above R program and create multiple node files and a single relationship file that contains relations between the nodes. Our nodes and relationship structure is as follows:
Nodes: question_nodes, answer_nodes, user_nodes, lang_nodes
Relationships: The following are the relationships
1 2 3 4 5 6 7 8 9 10 11 |
//One question refers to one programming language Question REFERS Language //One question can have multiple answers Question HAS_ANSWER Answer //One question asked by one user Question ASKED_BY User //One answer answered by one user Answer ANSWERED_BY User |
- Details about Java Program: This Java program is self explanatory and simply creates nodes and relationship files in CSV format as needed by the Neo4j Batch Importer program. Few things about the Java program to keep in mind
- The format of Nodes file is as follows:
1234//id is the actual id, string is the datatype of the id, and users indicate the name of the index that we want to create in Neo4J. This file should contain somename:datatype:index_name and may contain more attributes of the nodes with tab delimited. This is the format that Neo4J Batch Importer expectsId:string:users attribute1 attribute2qid_123456 4 (views) 10 (score)
- The format of Nodes file is as follows:
- The format of Relationship file is as follows:
12345678//ids of the nodes and type of the relationship between them. So, the question qid_797771 is ASKED_BY user uid_94691id:string:users id:string:users typeqid_797771 uid_94691 ASKED_BYqid_887301 javascript REFERSqid_607386 aid_608425 HAS_ANSWERqid_809735 uid_88631 ASKED_BYqid_954376 uid_117795 ASKED_BY
- lang_nodes is manually created as it is static. All other nodes and relationship file is programmatically generated
123456789101112//lang_nodes.csvid:string:users namejava Javamysql MySQLlinux Linuxpython Pythondjango Djangophp PHPjquery JQueryjavascript Javascriptcakephp CakePHP
- finaldata.csv is renamed to sodata.csv (optional)
- The dataset doesn’t come with names of questioners and answerers. So, we have downloaded some fictional names and associated them with the userid. This will make more sense when we view them in Neo4j graphical interface. A fictional name file for around 1500 names were created from http://homepage.net/name_generator/ and stored as “random_names.txt”.
12345Edward MacDonaldNicholas ArnoldFaith LambertPeter WhiteTrevor Campbell
- Java Program to Create Nodes & Relationships:
Note:The below program has dependency only on OpenCSV library that can be downloaded from http://sourceforge.net/projects/opencsv/
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 |
package com.treselle.soagrapher; import java.io.BufferedReader; import java.io.FileReader; import java.io.FileWriter; import java.io.IOException; import java.io.PrintWriter; import java.util.ArrayList; import java.util.HashMap; import java.util.HashSet; import java.util.List; import java.util.Map; import java.util.Map.Entry; import java.util.Set; import au.com.bytecode.opencsv.CSVReader; public class NodeRelationCreator { private static final String QUESTION_NODE_FILE = "question_nodes.csv"; private static final String USER_NODE_FILE = "user_nodes.csv"; private static final String ANSWER_NODE_FILE = "answer_nodes.csv"; private static final String RELATIONS_FILE = "rels.csv"; private static final String INPUT_FILE = "sodata.csv"; private static final String RANDOM_NAME_FILE = "random_names.txt"; //stores question id as the key and views, score as map values private static Map<String, Map<String, String>> questions = new HashMap<String, Map<String, String>>(); //stores unique userids of both questioner and answerer private static Set<String> users = new HashSet<String>(); //stores random names from the file private static List<String> randomNames = new ArrayList<String>(); //stores answerid as key and score as the map values private static Map<String, Map<String, String>> answers = new HashMap<String, Map<String, String>>(); //stores various relations between nodes. The key is two nodes delimited by :: and the value is relation type private static Map<String, String> relsMap = new HashMap<String, String>(); private void readFromCSV() throws Exception{ //Read the CSV with tab delimited and skip first row CSVReader csvReader = new CSVReader(new FileReader(INPUT_FILE),',','\"',1); String[] rows = null; String lang = null; String questionId = null; String question_user = null; String question_score = null; String question_views = null; String answerId = null; String answer_user = null; String answer_score = null; Map<String, String> questionAttrs = null; Map<String, String> answerAttrs = null; while((rows = csvReader.readNext()) != null) { questionAttrs = new HashMap<String, String>(); answerAttrs = new HashMap<String, String>(); questionId = rows[0]; question_user = rows[1]; question_score = rows[2]; lang = rows[3]; question_views = rows[4]; answerId = rows[6]; answer_user = rows[7]; answer_score = rows[8]; questionAttrs.put("views",question_views); questionAttrs.put("score",question_score); questions.put("qid_"+questionId, questionAttrs); answerAttrs.put("score", answer_score); answers.put("aid_"+answerId, answerAttrs); users.add("uid_"+question_user); users.add("uid_"+answer_user); relsMap.put("qid_"+questionId+"::"+"aid_"+answerId, "HAS_ANSWER"); relsMap.put("qid_"+questionId+"::"+"uid_"+question_user, "ASKED_BY"); relsMap.put("aid_"+answerId+"::"+"uid_"+answer_user, "ANSWERED_BY"); relsMap.put("qid_"+questionId+"::"+lang, "REFERS"); } this.writeQuestionNodesFile(); this.writeAwnsersNodesFile(); this.writeUsersNodesFile(); this.writeRelationsFile(); csvReader.close(); } private void writeQuestionNodesFile(){ try{ FileWriter fos = new FileWriter(QUESTION_NODE_FILE); PrintWriter dos = new PrintWriter(fos); dos.println("id:string:users\tname\tviews\tscore"); for (Entry<String, Map<String, String>> entry : questions.entrySet()){ dos.print(entry.getKey()); Map<String, String> valueMap = entry.getValue(); dos.print("\t"+entry.getKey()); dos.print("\t"+valueMap.get("views")); dos.print("\t"+valueMap.get("score")); dos.println(); } dos.close(); fos.close(); }catch (IOException e) { System.err.println("Error writeQuestionNodesFile File"); } } private void writeAwnsersNodesFile(){ try{ FileWriter fos = new FileWriter(ANSWER_NODE_FILE); PrintWriter dos = new PrintWriter(fos); dos.println("id:string:users\tname\tscore"); for (Entry<String, Map<String, String>> entry : answers.entrySet()){ dos.print(entry.getKey()); Map<String, String> valueMap = entry.getValue(); dos.print("\t"+entry.getKey()); dos.print("\t"+valueMap.get("score")); dos.println(); } dos.close(); fos.close(); }catch (IOException e) { System.err.println("Error writeAwnsersNodesFile File"); } } private void writeUsersNodesFile(){ try{ FileWriter fos = new FileWriter(USER_NODE_FILE); PrintWriter dos = new PrintWriter(fos); dos.println("id:string:users\tname"); int count = 0; for(String user : users){ dos.print(user); dos.print("\t"+randomNames.get(count)); dos.println(); count++; } dos.close(); fos.close(); }catch (IOException e) { System.err.println("Error writeUsersNodesFile File"); } } private void writeRelationsFile(){ try{ FileWriter fos = new FileWriter(RELATIONS_FILE); PrintWriter dos = new PrintWriter(fos); dos.println("id:string:users\tid:string:users\ttype"); for (Map.Entry<String, String> entry : relsMap.entrySet()){ String splitKeys[] = entry.getKey().split("::"); dos.print(splitKeys[0]+"\t"); dos.print(splitKeys[1]+"\t"); dos.println(entry.getValue()); } dos.close(); fos.close(); }catch (IOException e) { System.err.println("Error writeRelationsFile File"); } } private void readRandomNames(){ try{ BufferedReader in = new BufferedReader(new FileReader(RANDOM_NAME_FILE)); String line = ""; while ((line = in.readLine()) != null) { randomNames.add(line); } in.close(); }catch (IOException e) { System.err.println("Error readRandomNames File"); } } public static void main(String[] args){ try{ long start = System.currentTimeMillis(); NodeRelationCreator nodeRelationCreator = new NodeRelationCreator(); nodeRelationCreator.readRandomNames(); nodeRelationCreator.readFromCSV(); long end = System.currentTimeMillis(); System.out.println("Done Processing in "+(end - start)+ " ms"); } catch(Exception e){ System.out.println("Exception in main is "+e.getMessage()); e.printStackTrace(); } } } |
- Output of the Program:
Run the above program from command line or within eclipse to create question_nodes.csv, answer_nodes.csv, user_nodes.csv, and rels.csv. Click here to download nodes and relationship zip file to quickly run it thru BatchImporter to create Graph DB.
Create GraphDB with Batch Importer
- Download and Set up Batch Importer: Batch Importer program is a separate library that will create Graphdb data file which is needed by Neo4j. The input to the Batch Importer is configured in the batch.properties file that indicates what files to use as Nodes and Relationships. More details about the Batch Importer can be found in the readme at https://github.com/jexp/batch-import/tree/20
Download Link: https://dl.dropboxusercontent.com/u/14493611/batch_importer_20.zip
Note: Unzip to the location where the nodes and relationship files are created by the Java program. A sample folder structure is shown below:
- Create batch.properties: Create the batch.properties file as shown below. The details of each of the properties is better explained at BatchImporter site. The highlighted properties are the most important that defines nodes and relationship input files.
12345678910111213dump_configuration=falsecache_type=noneuse_memory_mapped_buffers=trueneostore.propertystore.db.index.keys.mapped_memory=5Mneostore.propertystore.db.index.mapped_memory=5Mneostore.nodestore.db.mapped_memory=200Mneostore.relationshipstore.db.mapped_memory=500Mneostore.propertystore.db.mapped_memory=200Mneostore.propertystore.db.strings.mapped_memory=200Mbatch_import.node_index.users=exactbatch_import.nodes_files=lang_nodes.csv,question_nodes.csv,answer_nodes.csv,user_nodes.csvbatch_import.rels_files=rels.csv - Execute Batch Importer: Execute the batch importer program with import.bat within the Batch Importer directory and pass batch.properties and name of the graph db file to create
123//This command will create graph.db data file in the same location as your nodes and relationship filebatch_importer_20\import.bat batch.properties graph.db
1 2 3 4 5 6 7 8 |
Using Existing Configuration File Importing 9 Nodes took 0 seconds Importing 676 Nodes took 0 seconds Importing 1653 Nodes took 0 seconds Importing 1491 Nodes took 0 seconds Importing 4656 Relationships skipped (2) took 0 seconds Total import time: 2 seconds |
Visualize Graph with Neo4j
- Copy graph.db file: Create a new directory “data” under the root of Neo4J installation directory and copy graph.db to data directory. This is optional but recommended to keep the graph.db in the same location as Neo4j. It should look like as below:
- Start Neo4j: Execute “neo4j-community” file under bin directory of Neo4j to start Neo4j. You will be prompted to choose the location of the graph.db file.
- Visualize Graphs:
- Navigate to Graphs: Click on the bubbles on the left top and choose “*” as follows:
- Customize Graph Attributes: Double click on “Java” node and choose “name” as the caption as shown below:
- Explore Graphs: The below exploration shows the following:
Tracing the orange line indicates how the user Trevor answered (aid_853052) a Java question also asked a PHP question (qid_865476). Tracing the red line indicates the user Audrey answered two Java questions (aid_853030 and aid_892379). It’s lot of fun to work with Graph Database as the traversals are limitless. BTW, user names are fictional and not real users
Conclusion
- Neo4j is one of the best graph databases around and comes with powerful Cypher Query Language that enables us to traverse the nodes via the relationships and using nodes properties as well. We will be covering CQL in our next blog post based on this graph data.
- R is very handy in performing many data manipulation techniques to quickly cleanse, transform, and alter the data to our needs.
- Neo4j also comes with Rest API to add nodes and relationships dynamically on the existing graph DB.
References
- Neo4J: http://www.neo4j.org/
- Neo4J Use Cases: http://www.neo4j.org/learn/use_cases
- R: http://www.r-project.org/
- Neo4J Batch Importer: https://github.com/jexp/batch-import/tree/20
- Files: Click here to download nodes and relationship zip file
Pingback: Neo4j REST API + Extension Points | Treselle Systems