Apache Nutch with customized Mime-Type

Apache Nutch with customized Mime-Type

Introduction

This is second in a multi part series that explains the performance of Apache Nutch with customized mime-type to parse the content of a page. Parsing is the most important stage in a multi step crawling process. There are many parsers available in Nutch for various resources which help to parse the content. Parser will change depending on the mime-type of the resources. Let’s see how Nutch configures mime-type with parsers, and then reconfigure it with another existing or new mime-type. It’s important to have a basic understanding in configuration of Nutch to scrape the resources which is detailed in our first blog of this series.

Nutch scrapes the content using the following process:

  • Inject process – Injects all URLs for crawling process
  • Fetch process – Fetches the full content of a page or resources
  • Parse process – Parses the content from a page or resources, only if the mime-type of resources match with the Parser.
  • Update process – Updates the URLs with updated status of the resources
  • Index process – Indexes the documents, which contains data gathered from parsed data/content.

Note: If parser doesn’t match with mime-type of resource, Nutch refuses to parse the content.

Use Case

Let’s have RSS feeds for this use case. Each RSS feeds have different kind of mime-type like application/xml, text/xml (XML), application/rss+xml (RSS), application/atom+xml (ATOM), application/rdf+xml. By default Nutch configures parse-tika parser to support all above mime-type. Parse-tika adapter handles all types of mime-type with the help of Feed API.

What we need to do:

  • Look into parse-plugins.xml file and its details
  • Change mime-type mapping in parse-plugins.xml file

Solution

Look into parse-plugins.xml file and its details

  • When the crawling process occurs, Nutch reads parse-plugins.xml file and detects the parser, with help of the respective mime type mapping. The parse-plugins.xml file is located in ${nutch_home}/conf directory. Nutch skips the parsing job, if any new mime-type is found.
  • Each parser plug-in is mapped accordingly, to the mime types.
  • parse-tika plug-in is mapped to support all kinds of resources’ mime-type (*) which is highlighted in bold.
    Note: If any new mime-type is found, other than the mapped mime-types then “parse-tika” will handle the parser process.
  • Aliases tag provides alias names for each parser plug-in adapter and its extension-point.
    Example: for parse-tika plug-in, alias name = parse-tika and its extension points is “org.apache.nutch.parse.tika. TikaParser” which is also highlighted in bold.
  • Plugins used in the parse-plugins.xml file should be available in ${nutch_home}/plugins directory with respective plugin.xml file. For further details about plugin check our next blog on Apache Nutch with New custom Parser.

Changes in Mime-type mapping in parse-plugins.xml file

  • Map parser plugin with another mime-type
    There is another resource which handles parse-feed plugin, hence we have mapped/added that mime-type (xxxx) of that particular resource with parse-feed plugin as shown below. 
  • With the help of adding the above mime type mapping to parse-plugins.xml file, Nutch can successfully fetch and parse the content of the resources which has mime-type (xxxx).

Conclusion

  • Apache Nutch also helps to change the Parser for prior defined mime-type.
  • Apache Nutch adds/replaces new parser for mime-type mapped in old parser.
  • The later series will focus on Apache Nutch with New Parser where extension functionality of Nutch is used through new customized RSS parser plugin instead of parse-feed plug-in adapter.

Reference

4157 Views 2 Views Today