Cassandra – Data Model for Twitter – Part 3

Cassandra – Data Model for Twitter – Part 3

Introduction

This is third in a multi part series that talks about counter columns, expiring columns and dynamic data modeling of Apache Cassandra. To get a better understanding of the use cases, please refer our Cassandra – Data Model for Twitter – Part 2 and  Cassandra – Data Model for Twitter – Part 1.

Use Case

Let’s continue the use case of Twitter, to count the tweets of the user using counter column.

These are the different ways we can retrieve the stored data:

  • Get the total number of user’s friends
  • Get the total number of user’s followers
  • Get number of tweets posted by a user based on hour, day and month
  • Get number of tweets posted based on hour, day and month

What we need to do:

  • Create Java programs  for the following
    • To read information of friends and followers (using Twitter4j)
    • To read new tweets once it comes in (using Twitter4j-Stream and Twitter4j)
    • To count tweets in Cassandra
  • Design a data model to count the user tweets using counter column in an efficient way to retrieve data.

Solution

To read information of friends and followers (using Twitter4j)

  • Refer ReadUserDetails.java given in our first series of this blog.

To read tweets information of user

  • Refer ReadTweets.java given in our previous series of this blog.

Insert the following codes

  • Include this insertFollowersDetails method in StoreToCassandra.java
  • Include this insertFriendsDetails method in StoreToCassandra.java
  • Include this insertTweetsBasedOnTime method in StoreToCassandra.java
  • Include this insertTweetsBasedOnUser method in StoreToCassandra.java

Design a Data Model to count user’s friends, followers, and tweets

Counter Column family:

  • A counter is a special kind of column used to store a number that increases/ decreases counts- the occurrences of a particular event or process. A user-visible value of counter column is a 64-bit signed integer.
  • Counter column families must use CounterColumnType as the validator (column value type). On that note currently, counters may only be stored in dedicated column families.
  • A counter column cannot be mixed in with regular columns of a column family; we must create a column family specifically to hold counters.
  • We can also use the data model structure Rollups day, Rollups minute, Rollups hour for counter column family.
  • replicate_on_write property  should always be set to true for counter column families.

Many ways to count/model the Time series data:

Rollups minute: For every minute a new data is append as new column. Data will be overwritten for the same hour, same minute and same event.

event-hour

Rollups hour:  For every hour a new data is append. If the keys are repeated then the data will be overwritten.

event-day

Rollups day:  For every day a new data is append. If the keys are repeated then the data will be overwritten.

event

The best way to design depends on our use case and access patterns.

To count the number of tweets posted in a day, refer to the data model given below:

  • Query to create counter column
  • Query to Increase counter column
  • We can also decrease the value of counter column
     

The information is stored in counter column as shown below:

last

Expiring Columns:

  • A column can also have an optional expiration date called TTL (time to live). Whenever a column is inserted, the client can request to specify an optional TTL value, defined in seconds, for the column.
  • TTL columns are marked as deleted (with a tombstone) after the requested amount of time has expired. Once they are marked with a tombstone, they are automatically removed during normal compaction and repair processes.
  • If we have to change the TTL of an expiring column, we have to re-insert the column with a new TTL
  • An expiring column has an additional overhead memory of 8 bytes in memory and on disk (to record TTL and expiration time) compared to standard columns. This row will be deleted after 60 seconds.

Dynamic Column Families:

  • In a dynamic column family (or column family with wide rows) each internal row may contain completely different sets of cells. This column family takes advantage of Cassandra’s ability to use arbitrary application-supplied column names to store data.
  • The typical example of such a column family is time series.
  • In others words Composite column family is referred as Dynamic column family.

Conclusion

We have covered the different types of data modeling in Cassandra using Twitter as an example. To find out ways to improve read performance of Cassandra, refer this link http://www.datastax.com/docs/1.1/operations/tuning

Reference

3986 Views 1 Views Today