Apache Nutch – A Web Crawler Framework

Apache Nutch – A Web Crawler Framework

Introduction

This is first in a multi part series that talks about Apache Nutch – an open source web crawler framework written in Java. This is another popular project using Apache Lucene.  The main objective of this framework is to scrape the unstructured data from disparate resources like RSS, HTML, CSV, PDF, and structure it for searching process. Apache Nutch can efficiently manage recrawling. Apache Lucene plays an important role in helping Nutch to index and search. Apache Nutch functionality is similar to the other crawlers like Crawl-AnyWhere (Java), Scrapy (Python), and Heritrix (Java).

The main differentiators of Apache Nutch include the following:

Extensibility: Apache Nutch framework helps to extend user’s customized functionality with the help of some interfaces like Parse, Index, and ScoringFilter.
Pluggable: Apache Nutch framework configuration is based on plug and play style. It helps to add or remove required functionality from the configuration.
Obeys robots.txt rules: Apache Nutch framework obeys robots.txt rules while crawling a page. Nutch scrapes content from websites with proper user-agent for robots.txt. Nutch will not scrape content from restricted sites.

Nutch Stages:

Nutch processes involve three stages. They are

  • Sourcing
  • Prepping
  • Loading
  • Sourcing

In this stage, Nutch scrapes content from disparate resources, fetches it, and parses the data from each page or resource.

  • Prepping

In this stage, Nutch applies some logic to fine-tune the parsed content, dedups (delete duplicate) the documents and adds some more fields (required) to each document. Finally, it indexes the document and provides results.

  • Loading

In this stage, Nutch loads the index in Apache Solr, Elastic-Search full-text search engines or any repository. Apache Solr and Elastic-Search engines help the front end user to search and integrate with index created by Nutch.

Important deployment configuration files used in Nutch: 

core-default.xml, core-site.xml, hdfs-default.xml, hdfs-site.xml, mapred-default.xml, mapred-site.xml, nutch-default.xml, nutch-site.xml

Configuration Files

Description

nutch-default.xml Has a default property and value to help in Nutch crawling process and located in ${nutch_home}/conf directory.
nutch-site.xml Alternative for nutch-default.xml and if any change is required in the properties of nutch-default.xml, we can import property in nutch-site.xml file and change its value.

core-default.xml, hdfs-default.xml, mapred-default.xml

Used for Hadoop configuration. It is located within hadoop-core-xxx.jar and we can import properties from these xml files and make changes in it as per requirement in core-site.xml, hdfs-site.xml, mapred-site.xml files in ${nutch_home}/conf directory
mapred- default.xml Used to configure the map-reduce algorithm (helps to run jobs) implementation in Nutch to.
hdfs-default.xml Used to implement Hadoop Distributed File System in Nutch

Note: The files which have “site.xml” as suffix are supplementary to the respective default files, where the property and its values can be customized. This helps to avoid conflicts while using properties with customized values.

Example: nutch-default.xml – nutch-site.xml, core-default.xml – core-site.xml, mapred-default.xml – mapred-site.xml

Data Structure

Apache Nutch crawls a page in multiple step processes such as inject, generate, fetch, parse, update and index. Each process runs through MapReduce algorithm. MapReduce uses some data as input to run each of these processes and generates different data. For example crawldb, linkdb are the generated data by inject and update process.

So many articles, blogs, and tutorials explain Apache Nutch framework.
Please go through the link below to understand more about data structures in Nutch.

https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-16/nutch-search-engine

This series is broken down into the following multi-part blogs:

  • Apache Nutch – A Web Crawler Framework (this blog)
  • Apache Nutch with Mime Type Hacking
  • Apache Nutch with New Parser (Extensible Functionality)
  • Apache Nutch with AdaptiveFetchSchedule Algorithm

Apache Nutch Framework v1.1

This is the first version in Nutch which has an additional feature to crawl, the source pages, follow redirected URLs and also, crawl those redirected page’s content. The former versions do not have this feature. Apache Nutch v1.1 uses Lucene-based indexing to index the documents. This release comes with several major upgrades in existing libraries (Hadoop, Solr, Tika, etc.) on which Nutch depends.

Advantages over Nutch 0.9

  • Nutch 0.9 crawls, only source page content, which affects the content quality. Nutch 1.1 can crawl the source page or resource up to n depth level, which means it can crawl the source content of a page, follows redirect URLs of the page, and crawls content of those redirecting URLs. This way it increases the content quality.
  • Upgradation in Apache Lucene from 2.9 to 3.0.4

Use Case

Let’s have a use case, to start with, by crawling RSS pages with a predefined parser plug-in parse-tika.

What we need to do:

  • Insert source URLs to a file.
  • Add parse-tika parse plug-in in plugin.includes property and set up other properties.
  • Scrape the pages and index the documents using crawl command.

Solution

Before solving our use case, let’s get some pre-requisites satisfied. Apache Nutch needs Linux environment to run Nutch Scripts and lukeall tool to read the index.

Pre-requisites:

  • JDK 1.6 +
  • Apache Nutch v1.1 setup
  • Cygwin setup If working in windows environment to run Nutch crawl commands (Scripts)
  • Luke-all tool to read the index created by Nutch Crawl Command

Nutch v1.1 bin setup: http://archive.apache.org/dist/nutch/apache-nutch-1.1-bin.zip
Cygwin: Win 32 bit: http://cygwin.com/setup-x86.exe
Win 64 bit:http://cygwin.com/setup-x86_64.exe
Luke-all tool: https://luke.googlecode.com/files/lukeall-3.5.0.jar

  • Verify:
    • Download and extract the apache-nutch-1.1-bin binary setup.
    • Download and install Cygwin. After installation, open the bat file from {cygwin_home}/cygwin.bat to run cygwin.
    • Go to ${nutch_home} in cygwin command prompt and run the “bin/nutch” to verify whether Nutch setup is correct. It will list out as given below

Insert source URLs to a file

  • Insert source URLs: Insert any number of source URLs that you want Nutch to crawl (one URL per line) into a file called “seed.txt”. For these use cases, insert RSS pages (3 pages) one by one into seed.txt file. When injection process occurs, these URLs are converted to crawldb entries and injected to crawldb.

Challenges:

  • When Nutch read URLs from seed.txt file and injects it into crawldb, it reads regex pattern from a file “${nutch_home}/conf/regex-urlfilter.txt” to check whether URLs are allowed to proceed for crawling process. By default, Nutch has regex-pattern which allows all URLs, one such example is given below:
  •  However, we need to allow only URLs which ends with “limit=10/xml” and restrict the others. So for this, we have to provide a condition in regex-urlfilter.txt file as seen below:

Note: Now Nutch will filter and inject only first two URLs from seed.txt for crawling process and skip the last URLs.

Add parse-tika parser plug-in in plugin.includes property

  • Implement the “plugin.includes” property in nutch-site.xml file to add the additional functionality/Endpoints for crawling process and avoid conflict.
  • All plug-ins are activated through “plugin.includes” property only. Every Nutch job reads “plugin.includes” property and processes all plugins, which are activated. The parse-tika plug-in is activated for parsing job by adding it into this property, which is highlighted in bold.

Other Endpoints uses

Endpoints

Description

protocol-http Scrapes  the http based source pages
protocol-ftp Access the file transfer protocol for crawling the file resources
protocol-httpclient Scrapes the https based source URLs and for page authentication
urlfilter-regex Reads the filename from “urlfilter.regex.file”  property and helps Nutch to check each input or seed urls with regular expression given in regex-urlfilter.txt. This end point extension is essential for challenges given in this chapter
index-basic Adds some basic fields like URL, content, title, cache, tstamp to each document while indexing process starts
index-more Adds some more important fields like Last-Modified, content Length, Content-Type to each document while indexing process starts
  • We can implement one or more endpoints in the plugin.includes for single or multiple crawling process. Every endpoint implemented in this property will be processed at a stage. Some endpoints are
    • Protocol
    • Parsers
    • HtmlParseFilter
    • ScoringFilter
    • URLFilter
    • URLNormalizers
    • IndexingFilters
  • Set Up Other Properties
    • “http.agent.name” property has to obey the robots.txt rules by referring user-agent name and scrape the content from the server. This user-agent name varies from one server to another. To check user-agents of the server, provide domain with “/robots.txt” as given below.

http://${domain}/robots.txt

Example: http://wiki.apache.org/robots.txt

    • Open the above URL in web browser, you can see the robots rules (allows and disallows) for very sub links with user-agent.
  • Scrape the pages and index the documents using crawl command
  • Scrape the pages 
    • Make a directory and name it as “crawler” in ${nutch_home} to store input and output data process using MapReduce jobs.
    • Open the Cygwin and go to ${nutch_home} directory and run the below command to start the scrape process:
    • This command first injects the urls to crawler/crawldb and generates the segments for those URLs in crawler/crawldb. Then it will fetch, parse the segments by updating the URLs into the index of the documents in crawler/indexes, dedups (delete duplicates) documents and finally indexes the documents (deduped) into crawler/index directory.
    • The above process is recorded in the log file ${nutch_home}/logs/hadoop.log and runs in cygwin console too. We can check the last line of console “crawl finished: crawler” which is highlighted in bold to verify the successful completion of the command.
    • After the successful completion of the crawling process, run the luke-all jar tool, browse, and open the crawler/index directory in that tool.  We can see the details of the index like number of documents, number of fields, field values for each document in luke-all application as given below:Nutch Blog I
    • This use case scrapes the content from two URLs and creates two documents (one for each URLs) with fields like boost, content, title, segments, title, content using parse-tika parser for parsing job. Number of documents created by the crawler is highlighted in the above figure.

Conclusion

  • Apache Nutch has the ability to scrape any unstructured content from any kind of resource like RSS, HTML, CSV, PDF, XML, twitter etc, and structure it. It also provides crawling management facility to manage any number of scraping processes with different functionality.
  • Apache Nutch has the ability to work in Apache Hadoop Clusters.
  • Apache Nutch provides us freedom to add our own functionality in the crawling process.
  • The later series will focus on Apache Nutch with Mime Type Hacking, which deals with mapping the parser plugin for a particular mime type for the parse job.

References

http://hadoop.apache.org/docs/current2/hadoop-project-dist/hadoop-common/core-default.xml
http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
https://hadoop.apache.org/docs/current2/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
http://wiki.apache.org/nutch/nutch-default.xml

 

10157 Views 7 Views Today