Natural Language Toolkit

Natural Language Toolkit

Introduction

This is second part of web crawler using Python with Scrapy. When the data crawling is completed, all products with prices are stored in separate txt files with web names. Output is displayed from stored content through crawling.  We use NLTK and MatPlotLib during this process. Please refer our previous blog on “Web Crawler – Python with Scrapy” to get a basic understanding.

NLTK – Natural Language Toolkit (NLTK) is a Python package for natural language processing. Natural Language Processing is used everywhere—in search engines, spell checkers, mobile phones, computer games, and even in our washing machine. Python’s Natural Language Toolkit (NLTK) suite of libraries has rapidly emerged as one of the most efficient tools for Natural Language Processing.

MatPlotLib – It is a python 2D plotting library which produces good quality publication figures in a variety of hardcopy formats and interactive environments across platforms.
The sample file of crawled content is shown below:

crawled content

Use Case

The output is shown (all web names, products name, and brand popularity) from crawled and stored content.

What we need to do:

  • Create a config file.
  • Read all text files and convert to single string.
  • Tokenize and find the texts with location.
  • Show the output with chart.

Solution

Before solving our use case, let’s get some pre-requisites satisfied.

Pre-requisites:

Create a config file:

  • This is a configuration file. Here all common variables such as string, numbers, list, dictionary, tuples, file names, etc. are available.

Read and convert text files to single string:

  • Read the text files from stored data.
  • Tokenize the texts. We will use the following types of tokenizers.
    • word_tokenize
    • BlanklineTokenizer
    • WhitespaceTokenizer
    • word_tokenize – Shows all web names in text content.
      • Sample output
    • BlanklineTokenizer – Shows all products name in text content.
      • Sample output
    • WhitespaceTokenizer – Shows brand popularity in text content.
      •  Sample output
  • Display output with chart

Bar chart displays and compares data. It is a visual display used to compare different characteristics of data with the frequency of occurrence. This type of display allows us to compare and interpret the data quickly.

  • Horizontal bar chart
  • Sample output

Horizontal bar graph

  • Vertical bar chart
  • Sample output

bar graph

  • Pie chart
  •  Sample output

pie chart

  • Dot chart
  • Sample output

dot graph

Conclusion

  • We have used word_tokenizer, BlanklineTokenizer, WhitespaceTokenizer because it is comfortable.
  • Corpus reader is an awesome reader because all files are loaded very quickly.
  • It is easy to find the text using tokenizer.
  • It is easy to learn NLTK and MatPlotLib.

References

3617 Views 1 Views Today