Apache Nutch with Custom Parser

Apache Nutch with Custom Parser

Introduction

This is third in the multi part series that talks about Apache Nutch with custom parser. Now let’s see how to create a custom parser plug-in and configure it with Nutch to scrape content. It is essential to have a basic understanding of Nutch and its configuration which is detailed in our first and second blog, “Apache Nutch – A Web Crawler Framework” and “Apache Nutch with customized Mime – Type” respectively.

Use Case

Let’s reuse RSS feed example discussed in the first blog to understand URL crawling, and indexing documents using custom Feed plug-in, instead of using pre-defined parse-tika plug-in.

What we need to do:

  • Write a Feed program to parse the content
  • Prepare metadata files of custom plug-in (plugin.xml and build.xml) with required configuration
  • Build the custom parser
  • Reconfigure the parse-plugin.xml with custom parse
  • Crawl and index the content

Solution

Before solving our use case, let’s get some pre-requisites satisfied. Fortunately, to create custom feed parser plug-in, only a couple of jar files are needed.

Pre-requisites:

  • Download rome.jar

http://repo1.maven.org/maven2/rome/rome/0.9/rome-0.9.jar

  • Download jdom.jar

http://repo1.maven.org/maven2/jdom/jdom/1.0/jdom-1.0.jar

  • Configure other required jars from Nutch library folder

${nutch_home}/lib/nutch1.1.jar
${nutch_home}/lib/hadoop-0.20.2-core.jar
${nutch_home}/lib/commons-logging-1.0.4.jar

Note: Configure all above jars in Eclipse or in any preferable environment, to develop this plug-in.

Write a Feed program to parse the content:

  • ParseFeed.java
  • Compile the Java file and save it as parse-feed.jar and put it in

 ${nutch_home}/plugins/${plugin_name}/parse-feed.jar directory.

Prepare metadata files:

Using custom plug-in (plugin.xml and build.xml) prepare metadata files with required configuration:

  • plugin.xml
    • This file needs to be located in
      ${nutch_home}/plugins/${plugin_name}/
    • This plugin.xml file instructs Nutch about parse-feed plugin.
    • The parser plug-in is mapped to two mime-types: application/xml and text/xml which is highlighted in bold.
    • Plugin id is very important in order to point out the respective plug-in.
  • build.xml 
    • Build the ant file with importing build-plugin.xml file which performs all the basic operations or targets of all plug-ins.
    • This build file imports the build-plugin.xml file which is highlighted in bold.

Create a Plug-in with Ant:

  • Plug-in Structure: The plug-in source files (ParseFeed.java), plugin.xml, and its build.xml files must be located as in below directory structure.

plugin structure

  • Build with ant : To build the parse-feed parser plug-in, add the lines given below in build.xml file which is located in ${nutch_home}/src/plugin
  • Create and copy the parse-feed plug-in  
    • Run the ant for ${nutch_home}/build.xml in command prompt, when all the files have been created, edited and saved in its locations, then compile and generate the parse-feed.jar.
    • This will build all the plugins in ${nutch_home}/build directory and create parse-feed plug-in with required files and jars in ${nutch_home}/build/plugins/parse-feed.
    • Copy parse-feed folder (plugin) from ${nutch_home}/build/plugins directory and paste it in ${nutch_home}/plugins/parse-feed location.
    • Parse-feed directory contains parse-feed’s plugin.xml, parse-feed.jar and its dependency jars which are located in lib folder of parse-feed parser directory.

Reconfigure with custom Parser:

  • parse-plugin.xml
    • By default, RSS mime-types: application/xml and text/xml are mapped with parse-tika pre-defined parser. So comment those mapping in parse-plugin.xml file.
    • Add application/xml and text/xml mime-type mapping with custom parser as shown below
    • Add alias mappings for parse-feed  to the actual extension implementation ids described in parse-feed  plugin’s plugin.xml file
    • To know more about parse-plugin.xml mapping, refer second series of the blog.

To crawl and index the content

  • If we want Nutch to use this parse-feed plugin, we have to add parse-feed id name to plugin.includes in conf/nutch-site.xml, which is given below in highlighted text. Remove parse-tika parser id if it is added in plugin.includes.
  • Use the command given below in cygwin command prompt to crawl and index the content of a page
  • After the successful completion of crawling process using parse-feed parser plugin, run luke-all jar tool, and open crawler/index directory in that tool.  We can see the details of the index such as number of documents, number of fields, field values for each document in luke-all application as given below:

Lucene index toolbar

  • Compare the result image given above with the result image in the first blog (Apache Nutch – A Web Crawler Framework).
  • We will be able to notice changes in the documents count.

parser table

  • This parse-feed adapter will add one document in each SyndEntry available in the RSS pages. This process helps us to get accurate content from each URL.

Conclusion

  • Apache Nutch plugin-in architecture enables us to extend with our own logic with Nutch, using interface like parse, index, and ScoringFilter.
  • The later series will focus on Apache Nutch integration with Hadoop MapReduce concept where we will see how Nutch executes all jobs.

References

  • Writing plug-in in Nutch:

http://wiki.apache.org/nutch/WritingPluginExample-1.2

6766 Views 8 Views Today
  • Piyush Golapkar

    Im currently using apache-nutch 1.9.
    i hav integrated it with elasticsearch 1.3.2 so the data gets indexed in elasticsearch.
    now i need to write a plugin within nutch which will parse the data before idexing it.It should only take the data as per my requirements.can u plz help me with this..

  • Asmat Ali

    Exactly the same question as Piyush asked for nutch 1.12 and Solr 6.3. Could you help any please?