Cassandra Data Model for Twitter – Part 1

Cassandra  Data Model for Twitter – Part 1

Introduction

This is first in a multi part series that talks about Apache Cassandra – a key/value store where one key can be mapped to one or more values. Cassandra is extremely fast, even with millions of records and operations. Cassandra excels in storing time-series data, which is a huge use case in the big data world. This multi part series deals with the working of Cassandra with various data models.

Preference of Cassandra over NoSQL Database:

  • Cassandra trades off traditional features to support new and emergency use cases.
  • Rich data model with limited ad-hoc querying ability  (Cassandra does not support join and group by)
  • Cassandra is fault-resistant with tunable consistency levels.

The focus of this series is to understand various Cassandra data modeling and how the data’s are actually stored in Cassandra with a use case. This series is split into multi-part series

  • Single keyed and Super Column family and Collection storage (in this blog)
  • Time-series data models like Composite Column family, Compound Column family
  • Counter Columns, Dynamic Column family and Expiring Columns

Use Case

Let’s have a use case starting from the single keyed data model, and develop it over the other data models in this multi-part series. The use case is to store the social media user information in a repository. In this use case, Twitter is taken as the social media and Cassandra is used to store the information. This use case explains the different data modules of Cassandra and other advanced concepts like Counter Column, Compound Column Family, and Dynamic Column Family which is dealt in the later series of the blog.

These are the different ways we can retrieve the stored data:

  • Retrieve user details
  • Retrieve followers name as List
  • Retrieve friends name as List

What we need to do:

  • Install Cassandra
  • Pre-requisites
  • Create a  Java Program
    • To read the information of friends and followers(using Twitter4j)
    • To store the information to Cassandra
  • Design a data model to store user information and names of friends and followers using collection.

Solution

Install Cassandra:

Pre-requisites:

  • Twitter4j API – Reads twitter user information, by passing values such as authenticate access token, authenticate access token Secret, consumer Key, consumer Secret .  We can get the value of these tokens only after registering in https://dev.twitter.com/apps/new

Create a Java Program:

  • Program to read the information of friends and followers
     Friends details Bean
     Tweet Bean
     User details Bean
  • Program to store information in Cassandra

Create Data Model:

Check out our posts on Cassandra – Datastax Java Driver, to get a basic understanding on the usage of Cassandra and Datastax java driver.

  • Query to create a table to store user details

This data model stores data in a single keyed table (static Column Family) with the screen name as key. The process of inserting, retrieving and storing is explained below.

  • Insert into a single keyed column family

We can easily insert the data into a single keyed column family by passing the value of partition key (PRIMARY KEY/Row key). We cannot insert the data without passing value for partition key.
In Java point of view, we cannot put a value into a Hash Table with the null key.

  • Retrieve data from a Single keyed column family

The data can be retrieved using two options, they are

    • Retrieve all the data
    • Retrieve the data by specifying the row key value.
  • How it stores
    • All values of a Row key will append as a single column. This will be sorted by, column key and the time of insert. The data is stored in a new row if the row key does not have the same value.1
    • While using Single PRIMARY key (row key/ Partition key in Cassandra) Cassandra stores the data similar to the RDBMS because column names does not hold any value, If column name holds any value then such a column family are called as Dynamic Column Family.

Partition key – Denotes the first column of the key
Composite Partition key – Denotes the partition key formed of multiple columns
Clustering Columns – Denotes the remaining columns of the PRIMARY KEY definition

  • Query to create a column family to store the names of followers/ friends of a user name as List:
    In this case users may have many friends, in a relational database the canonical way to do that would be to create a friends_name table with many-to-one relationship to users, which implies to join. In the above data model the column family represents only one row for each user, and multiple followers is represented as list unlike RDBMS that stores, the followers proportionate to the number of rows.
  • Insert into a Collection column
    • List:
    • Set:
  • How it stores
    • We can store the followers name as a Collection object (List) into Cassandra because it supports the insertion of these objects using the pre-defined Collection data types.
    • The insertion and updating of Collection data types (Set, Map and List) is different and it depends on the data type we select. All the properties for the Collection data types are same as Java. Click here for more details on updating collection data type. 

Follow the same data model to store friends name in List.

  • Super Column Family

Cassandra will insert the data through a combination of composite partition key (partition key formed of multiple columns). The data insertion follows a particular order that is, passing row key, super column key, subcolumn key/name, and their respective values. The super column can handle more than one column but it is not advisable to use this data model for wide rows. Its stores the data as a Map of Map of a Map: an outer Map keyed by a row key, inner Map keyed by a super column name/key and the innermost Map keyed by a column name/key.

Map<RowKey, SortedMap<SuperColumnKey, SortedMap<ColumnKey, ColumnValue>>>

  • The following layout represents a row in a Super Column Family (SCF)3
  • Retrieve data from a Super Column Family

The data can be retrieved using the following options:

    • Retrieve all the data
    • Retrieve the data by specifying a row key value.
    • Retrieve the data combination of row key and super column key
    • Retrieve the data combination of row key, super column key and subcolumn key
  • Warning: Super columns are now deprecated (in CQL 3.1.0 and Cassandra 2.0).
    • Do not use super columns. They are a legacy design from a pre-open source release. This design was structured for a specific use case and does not fit most use cases. Super column reads entire super columns, and all its sub-columns into memory for each read request. This results in severe performance issues.
    • Super columns are not supported in CQL 3 hence use composite columns instead. Composite columns provide most of the same benefits as super columns without the performance issues.

Conclusion

  • Cassandra has a potential to handle large amount of time series data efficiently.
  • Cassandra also writes very fast which helps in the duplication process while retrieving.
  • Look out for our post in the advanced data model of Cassandra in the later series.

Reference

7002 Views 6 Views Today
  • viet nguyen thanh

    could i can see classes
    com.treselle.cassandra.twitter.bean.FriendsDetails;
    com.treselle.cassandra.twitter.bean.TweetBean;
    com.treselle.cassandra.twitter.bean.UserDetails;

    • Treeselle systems Blog

      Thank you for your valuable comments. Please find the requested Bean class under the heading “Program to read the information of friends and followers”

      • viet nguyen thanh

        thanks to Treselle System Blog. This article helped me very much.

  • viet nguyen thanh

    excuse me. may i can see this project’s resource ?
    it’s so interesting.