Twitter Analysis with Apache Storm

Twitter Analysis with Apache Storm

Introduction

Apache Storm is a free and open source, distributed, real time computation system.The following characteristics of Storm make it ideal for real-time data processing workloads:

  • Scalable: Storm scales up to massive numbers of messages per second.
  • Guarantees no data loss: Storm guarantees that every message will be processed.
  • Extremely robust: It is an explicit goal of the Storm project to make the user experience of managing Storm clusters as painless as possible.
  • Fault-tolerant: If there are faults during execution of your computation, Storm will reassign tasks as necessary.
  • Programming language agnostic: Storm topologies and processing components can be defined in any language, making Storm accessible to nearly anyone.

Components of a Storm cluster:

  • Nimbus: The master node runs a daemon called “Nimbus” that is similar to Hadoop’s “JobTracker”. Nimbus is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures.
  • Zookeeper: takes care of coordination between Nimbus and the Supervisors.
  • Supervisor: Each worker node runs a daemon called the “Supervisor”, that communicates with Nimbus through Zookeeper, starts and stops workers based on what Nimbus has assigned to it.

Below abstractions help us understand how Storm processes data:

  • Tuples – an ordered list of elements.
  • Streams – an unbounded sequence of tuples.
  • Spouts – sources of streams in a computation.
  • Bolts – process input streams and produce output streams.
  • Topologies – A directed graph of Spouts and Bolts.
    Storm data

Use case

In this use case, we will describe how to process a stream of tweets coming from the Twitter account, and write every incoming tweet’s details into a file.

What we need to do:

  • Create a Data Model
  • Create a Spout that emits a stream of tweets from Twitter account.
  • Create Bolts that receive the tweets from the Spout and process them.
  • Create a Topology with Spouts and Bolts.
  • Running a Storm Topology.
  • Scrutinize the output.

Solution

Before solving our use case, let’s get some pre-requisites satisfied.

Creating a new Storm project is just a matter of adding the Storm library and its dependencies to the Java class path, as we are going to run the Topology in local mode.

Prerequisites

Data Model

Storm implements a data flow model in which data flows continuously through a network of transformation entities (see below figure).

Data Model

In this use case we have created two Bolts (DetailsExtractorBolt and RetweetDetailsExtractorBolt) to process the tweet information. Actually, processing the entire tweet information can be accomplished using a single Bolt. Instead here, we have made use of two Bolts to process the tweets, in order to prove the Storm Joins concept.

Create a Spout

Spout emits a stream of tweets from the Twitter account.
Spout reads the tweets from the Twitter API and emits them into the Topology.

  • The following class ‘TwitterSpout.java’ emits a stream of tweets from the Twitter account:

Create bolts

Bolts receive tweets from Spout and process them.
They extract the required tweet information from the Spout, and store them out in a file.

Here we have created the following bolts:

  • DetailsExtractorBolt.java: This Bolt retrieves the tweet information details and emits tuples.
  • RetweetDetailsExtractorBolt.java: This Bolt retrieves the re-tweet details.
  • FileWriterBolt.java: This Bolt join the tuples from both Bolts created above and writes all the details in a csv file.

FileDetails.java : This is bean class for all tweet details.

TweetDetailsManager.java: This class helps avoid data conflict between two bolts from using map.

ApplicationConstants.java: Declaration of Application specific constants

Create Topology with Spouts and Bolts

In order to perform Realtime computations on Storm, we need to create Topologies. A Topology is a graph of computations. Each node in a Topology contains processing logic, and links between nodes indicate how data should be passed between the nodes.

Below are the concepts that will give us an idea about the logic underlying our Topology:

  • Parallelism: In Storm’s terminology, parallelism is specifically used to describe the so-called parallelism hint, which indicates the initial number of executors (threads) of a component. In our use case we have used parallelism 2, for both DetailsExtractorBolt and RetweetDetailsExtractorBolt.
  • Groupings: In this use case, two types of groupings have been used; they are:
    • Shuffle grouping: Tuples are randomly distributed across the Bolts’ tasks, in a way such that each Bolt is guaranteed to get an equal number of tuples.
    • Fields grouping: The stream is partitioned by the fields specified in the grouping.
  • Storm joins: In our use case, we are trying to get the tuples from both DetailsExtractorBolt and RetweetDetailsExtractorBolt into the FileWriterBolt. This can be achieved using Storm Joins.
  • A streaming join combines two or more data streams together based on some common field.

Running a storm topology in local mode

In local mode, Storm executes completely in process by simulating worker nodes with threads. Local mode is useful for testing and development of Topologies.

Execute MainTopology.java class, which deploys the Topology in local mode.

Scrutinize the output

We will find user_tweets.csv file in the project with tweet details as shown below.

Challenges

Issue: If we try to get tuples from the streams of both the Bolts, without identifying the Bolt from which a particular stream gets emitted, then the following exception occurs:

Solution: To resolve this exception, in the FileWriterBolt, include the condition to identify the stream corresponding to the Bolt using tuple.getSourceComponent(). (Please refer to the FileWriterBolt.java file to get a clear idea.)

Issue: If we process the Topology using the Thread.sleep() method, the following exception occurs, when the thread sleep time limit is exceeded:

Solution: In our use case, the Topology has to be in running state to get the real-time stream. Hence, to overcome the above exception, instead of using Thread.sleep(), we have used the Runtime.getRuntime().addShutdownHook() method, which will be invoked before shutdown of the JVM. 

Conclusion

We have successfully been able to process tweet information from the Twitter account using Apache Storm. It is also possible to process millions of messages per second with a small cluster. Storm also makes it easier to write and scale complex real time computations on a cluster of computers, making it ideal for real time data processing. In addition to this, Storm is easy to operate once deployed. Storm has been designed to be extremely robust – the cluster will keep on running, month after month.

References

4465 Views 1 Views Today