Cassandra – Data Model for Twitter – Part 2

Cassandra – Data Model for Twitter – Part 2

Introduction

This is second in a multi part series that talks about Composite and Compound Data Modeling of Apache Cassandra. It is essential to have a basic understanding of Cassandra’s interaction, which is detailed in Cassandra and Datastax series. Refer the first blog of this series to understand the single data model of Cassandra.

Use Case

Let’s continue with the use case of Twitter, to store the user information and tweets of the user in Cassandra based on the time.

These are the different ways we can retrieve the stored data:

  • Get friends details of a user
  • Get followers details of a user
  • Get all tweets based on hour, day, month. You can also filter the tweets
    • By starting time
    • By end time
    • Between intervals
  • Get users timeline based on hour, day, and month. You can also filter the tweets
    • By starting time
    • By end time
    • Between intervals

What we need to do:

  • Create Java programs  for the following
    • To read information of friends and followers (using Twitter4j)
    • To read new tweets once it comes in (using Twitter4j-Stream and Twitter4j)
    • To store information in Cassandra
  • Design a Data Model to store user‘s friends and follower’s details and tweets of the user in an efficient way to retrieve the data.

Solution

To read the information of friends and followers (using Twitter4j)

  • Refer the ReadUserDetails .java in our previous series of Cassandra blog.

To read new tweets once it comes in

 Note: Please refer the queries.properties file in our Cassandra – Data Model for Twitter – Part1.

To store information in  Cassandra

 Design a Data Model to store user friend’s and follower’s details and tweets of the user

  • Query to create a table to store user’s followers
    • The drawback in storing all the details of the followers as static Column Family with screen_name as Row key, is that it will overwrite all the data, and only the last inserted data will be available for the screen_name.
    • We can store the data by the combination of follower_screen_name and screen_name as Compound key, thus allowing the data to be inserted in a single row as a normal Column Family. If data is inserted in this manner we cannot retrieve the followers of a specific user because we cannot query a Compound keyed Column Family with single partition key.
    • If we store the data with follower_screen_name as Row key, we cannot identify the person whom he follows. Cassandra does not support of joins, hence we need to store the data in the way of screen_name as row key (partition key) and followers as Composite key by this we are able to query all/specific the followers of the user by using their screen name, or we can view the details of the followers alone, by allowing filtering.

The stored information is shown below:

usr

Similar data model is followed for storing friend’s details, tweets information, tweets information based on hour, day, and month by changing Column key.

Maintain screen name as Row key and published time as Column key to store tweets information. In this data model we can track data in all conditions, since the data is in a single node. When data is filtered from the big rows based on the query, it slows the read performance for every hit.

In order to speed up the retrieve process, store the tweets based on the hour, day, and month. Keep hour, day and month as row key in separate Column Family since Cassandra is very fast in writing the data, we can duplicate the data as per our convenience.

The data stored in tweets information Column Family is shown below:

usr1

  • Query to create the Column Family to store the tweets based on day for a user:
    We can store our data by screen name as Row key, and published day and time as Composite keys.  If we store the data in wide rows, it slows while retrieving the data hence to get better performance we use Compound and Composite key. Compound key is the combination of screen name, and published time as Column key by this we can retrieve the data faster.

The stored information is shown below:

usr2

Similar data model is followed for storing tweets information based on hour, day and month, by changing rows and columns keys. From this stored information, we can track tweet_day Column Family by minutes and hours.

  • Challenges:
    • On January 14th, 2014, connections to api.twitter.com will be restricted to TLS/SSL connections only. If the application still uses HTTP plaintext connections we need to update it to use HTTPS connections, otherwise the app will stop functioning. This SSL requirement will be enforced on all api.twitter.com URLs, including all steps of OAuth and all REST API resources. Hence we should use SSL to connect at api.twitter.com. Any well-established HTTP client library supports the ability to connect to a SSL-enabled server, and usually the required change is just a matter of updating a few lines of code or configuration files.

We can use SSL to connect at api.twitter.com in two ways, they are:

    • By passing -Dhttp.useSSL=true as JVM argument.
    • twitter4j.conf.ConfigurationBuilder.setUseSSL(true)

If you don’t use SSL, then the following exception will be thrown

  • For Oracle/Sun JVM:
  • For OpenJdk it throws
  • While passing -Dhttp.useSSL=true as JVM argument. If we use Openjdk and the certificate is missing in the keystore then it throws
    First we shall check if the certificate is already in the keystore by running the following command:keytool -list -keystore “%JAVA_HOME%/jre/lib/security/cacerts”

If the certificate is missing, retrieve it by downloading. Later add it to the keystore with the following command:

keytool -import -noprompt -trustcacerts -alias <AliasName> -file <certificate> -keystore <KeystoreFile> -storepass <Password>

After importing, run the first command again to check if the certificate is added.

        Sun/Oracle information can be found here.

  • The below exception occurs when an application exceeds the rate limit for a given API endpoint.
    Rate limiting model allows for a wider range of requests through per-method request limits. There are two initial buckets available for GET requests: 15 calls every 15 minutes and 180 calls every 15 minutes.

    Some features of the application that are provided are simply impossible in light of rate limiting, especially around the freshness of results. If real-time information is the aim of application, use streaming API. Click here for more details.

Conclusion

Cassandra performs well, when storing time series data in Compound and Composite Column Family data model, rather than using Super Column Family.

Look out for our post in advanced data model of Cassandra in the later series

Reference

http://www.datastax.com/docs/1.1/ddl/column_family

http://www.slideshare.net/patrickmcfadin/become-a-super-modeler

http://ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/

4885 Views 3 Views Today