Text Normalization with Spark – Part 1

Text Normalization with Spark – Part 1

Overview

Numerous methods such as text mining, Natural Language Processing (NLP), information retrieval, and so on, exist for analyzing unstructured data. Due to the rapid growth of unstructured data in all kinds of businesses, scalable solutions have become the need of the hour. Apache Spark is equipped with out of the box algorithms for text analytics, and it also supports custom development of algorithms that are not available by default. In this blog post, our main goal is to perform basic text normalization using simple regular expression technique with Apache Spark and then decipher Spark stages, jobs, and DAG’s in the next blog post.

Use Case Source

Book: Business at the speed of thought – Bill Gates.txt
This book inspires everybody to take actions for their future. The greatest billionaire, the head of a huge computer company “Microsoft”, Bill Gates wrote it. The book was published in 1999.This book is all about business written by the greatest businessman. So, everyone including me can guess the most popular word in this book is “business”. Let us find out using Spark.

Pre-requisites

    • Setup the Hadoop cluster and Spark. We have setup 4 nodes cluster using HDP distribution and Spark2 client installed in all the 3 nodes.
    • Install Simple Build Tool (SBT) and use “sbt assembly” command to create Uber jar (aka Fat JAR, Super JAR) with all its dependencies.

sbt assembly

We have created three Jars for the use case as follows:

    1. MostPopularWords-assembly-1.0.jar – Find out different words and its count from raw data.
    2. MostPopularWordsBetter-assembly-1.0.jar – Applied regular expression to filter only words and remove special characters. Also, we have sorted the results by count to find out the most popular words.
    3. NormalizedMostPopularWords-assembly-1.0.jar – In addition to regular expression, used stop words in English language to remove the common words in English language from the output. This will produce the final output results as expected.uber_jars
  • Upload the Jars into the host machine where the Spark2 clients were installed to submit the job.
  • Upload the input “Business at the speed of thought – Bill Gates.txt” and “stopwords_en.txt” into HDFS.

Note: In our use case, both input and output are from and to HDFS. YARN cluster is used to submit the Spark Job.

Text Normalization Steps

The following steps are followed to identify most and least popular words,

  • Find out different words and its count from raw data.
  • Apply regular expression rule to filter only words and remove special characters.
  • Apply stop list of words to filter out common words used in English Language.

Find out different words and its count from raw data

It is a common word count example to find out different words and its count from input raw data.

Notes:

  • “user/tsldp/book/Business_at_the_Speed_of_Thought-Bill_Gates.txt” – Input raw data HDFS path argument.
  • “/user/tsldp/book/most_popular_words” – Output HDFS path argument.

Output:

most_popular_words_result

Problem: Word variants with different capitalization, punctuation, tab delimiter spaces, numeric characters, and so on.

Apply regular expression rule to filter only words and remove special characters

Apply regular expression with Spark filter function to filter only words (Alphabets) and length of the words greater than 2 characters.

Notes:

  • “user/tsldp/book/Business_at_the_Speed_of_Thought-Bill_Gates.txt” – Input raw data HDFS path argument.
  • “/user/tsldp/book/most_popular_words_better” – Output HDFS path argument.

Output:

most_popular_words_better_result
Problem: The words with more counts are “the, and, that, and so on”, which are common words in English Language and let us filter those from output in next step using stopwords_en.txt.

Apply stop list of words to filter out common words used in English Language

We have used Spark shared variable “broadcast” to achieve distributed caching. Broadcast variables are useful when large datasets need to be cached in executors. “stopwords_en.txt” is not a large dataset but we have used in our use case to make use of that feature.

What are Broadcast Variables?
Broadcast variables in Apache Spark is a mechanism for sharing variables across executors that are meant to be read-only. Without broadcast variables, these variables would be shipped to each executor for every transformation and action, which can cause network overhead. However, with broadcast variables, they are shipped once to all executors and are cached for future reference.

normalized_most_popular_words_spark_submit

Notes:

  • “user/tsldp/book/Business_at_the_Speed_of_Thought-Bill_Gates.txt” – Input raw data HDFS path argument.
  • “/user/tsldp/book/stopwords_en.txt” – Input stopwords file HDFS path.
  • “/user/tsldp/book/most_popular_words” – Output HDFS path argument.

Output:

normalized_most_popular_words_output
Preview output in HDFS:

text_normalized_output

Conclusion

Most popular words from the book written by the greatest businessman – “Bill Gates” about the business are:

  • information – 242
  • people – 181
  • company – 162
  • business – 158
  • digital – 144

It is very interesting as the book is all about business. But, the word “business” is not the most popular word from that book.

All jobs in spark comprise a series of operators and run on a set of data. All the operators in a job are used to construct a DAG (Directed Acyclic Graph). The DAG is optimized by rearranging and combining operators where possible.

References

3170 Views 10 Views Today