
Table of Content
Introduction
This is third in the multi part series that talks about Apache Nutch with custom parser. Now let’s see how to create a custom parser plug-in and configure it with Nutch to scrape content. It is essential to have a basic understanding of Nutch and its configuration which is detailed in our first and second blog, “Apache Nutch – A Web Crawler Framework” and “Apache Nutch with customized Mime – Type” respectively.
Use Case
Let’s reuse RSS feed example discussed in the first blog to understand URL crawling, and indexing documents using custom Feed plug-in, instead of using pre-defined parse-tika plug-in.
What we need to do:
- Write a Feed program to parse the content
- Prepare metadata files of custom plug-in (plugin.xml and build.xml) with required configuration
- Build the custom parser
- Reconfigure the parse-plugin.xml with custom parse
- Crawl and index the content
Solution
Before solving our use case, let’s get some pre-requisites satisfied. Fortunately, to create custom feed parser plug-in, only a couple of jar files are needed.
Pre-requisites:
- Download rome.jar
http://repo1.maven.org/maven2/rome/rome/0.9/rome-0.9.jar
- Download jdom.jar
http://repo1.maven.org/maven2/jdom/jdom/1.0/jdom-1.0.jar
- Configure other required jars from Nutch library folder
${nutch_home}/lib/nutch1.1.jar
${nutch_home}/lib/hadoop-0.20.2-core.jar
${nutch_home}/lib/commons-logging-1.0.4.jar
Note: Configure all above jars in Eclipse or in any preferable environment, to develop this plug-in.
Write a Feed program to parse the content:
- ParseFeed.java
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105package com.treselle.parse.feed;import java.io.ByteArrayInputStream;import java.util.Iterator;import java.util.List;import org.apache.commons.logging.Log;import org.apache.commons.logging.LogFactory;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.util.StringUtils;import org.apache.nutch.metadata.Metadata;import org.apache.nutch.parse.Outlink;import org.apache.nutch.parse.ParseData;import org.apache.nutch.parse.ParseResult;import org.apache.nutch.parse.ParseStatus;import org.apache.nutch.parse.ParseText;import org.apache.nutch.protocol.Content;import org.apache.nutch.util.EncodingDetector;import org.xml.sax.InputSource;import com.sun.syndication.feed.synd.SyndEntry;import com.sun.syndication.feed.synd.SyndFeed;import com.sun.syndication.io.SyndFeedInput;public class ParseFeed implements org.apache.nutch.parse.Parser {public static final Log LOG = LogFactory.getLog("com.treselle.parse.feed");private Configuration conf = null;private String defaultEncoding = null;@Overridepublic Configuration getConf() {return this.conf;}@Overridepublic void setConf(Configuration conf) {this.defaultEncoding = conf.get("parser.character.encoding.default","ISO-8859-1");this.conf = conf;}@Overridepublic ParseResult getParse(Content content) {SyndFeed feed = null;ParseResult parseResult = new ParseResult(content.getBaseUrl());EncodingDetector detector = new EncodingDetector(this.conf);detector.autoDetectClues(content, true);String encoding = detector.guessEncoding(content, this.defaultEncoding);try {String feedContent = new String(content.getContent());InputSource input = new InputSource(new ByteArrayInputStream(feedContent.getBytes()));input.setEncoding(encoding);SyndFeedInput feedInput = new SyndFeedInput();feed = feedInput.build(input);} catch (Exception e) {LOG.warn("Parse failed: url: " + content.getUrl() + ", exception: " + StringUtils.stringifyException(e));return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());}List<SyndEntry> entries = feed.getEntries();String feedLink = feed.getLink();for (Iterator<SyndEntry> i = entries.iterator(); i.hasNext();) {SyndEntry entry = i.next();addToMap(parseResult, feedLink, entry, content);}return parseResult;}/*** This methods get Metadatas from Each entry (SyndEntry) and add those results with parseResult object* @param parseResult* @param feedLink* @param entry* @param content*/private void addToMap(ParseResult parseResult, String feedLink, SyndEntry entry, Content content) {LOG.info("The feedlink is " + feedLink);String link = entry.getLink();String title = entry.getTitle();String description = entry.getDescription().getValue();Metadata parseMeta = new Metadata();Metadata contentMeta = content.getMetadata();LOG.info("entry Link is : " + link);LOG.info("entry Title is : " + title);LOG.info("entry Description is : " + description);LOG.info("entry Updated Date is : " + entry.getUpdatedDate());LOG.info("entry published date is : " + entry.getPublishedDate());parseMeta.set("title ", title);parseMeta.set("description ", description);parseMeta.set("linkurl ", link);if (entry.getPublishedDate() != null) {parseMeta.set("date", entry.getPublishedDate().toString());}if (entry.getUpdatedDate() != null) {parseMeta.set("date", entry.getUpdatedDate().toString());}LOG.info("Date " + parseMeta.get("date"));parseResult.put(link, new ParseText(description), new ParseData(ParseStatus.STATUS_SUCCESS, title, new Outlink[0], contentMeta, parseMeta));}} - Compile the Java file and save it as parse-feed.jar and put it in
${nutch_home}/plugins/${plugin_name}/parse-feed.jar directory.
Prepare metadata files:
Using custom plug-in (plugin.xml and build.xml) prepare metadata files with required configuration:
- plugin.xml
1234567891011121314151617181920212223242526<?xml version="1.0"?><plugin id="parse-feed" name="Feed Parse Plug-in" version="1.0.0"provider-name="nutch.org"><runtime><library name="parse-feed.jar"><export name="*" /></library><library name="rome-0.9.jar" /><library name="jdom.jar" /></runtime><requires><import plugin="nutch-extensionpoints" /></requires><extension id="com.treselle.parse.feed" name="Feed Parser"point="org.apache.nutch.parse.Parser"><implementation id="com.treselle.parse.feed.ParseFeed"class="com.treselle.parse.feed.ParseFeed"><parameter name="contentType" value="application/xml" /><parameter name="contentType" value="text/xml" /><parameter name="pathSuffix" value="rss" /></implementation></extension></plugin>- This file needs to be located in
${nutch_home}/plugins/${plugin_name}/ - This plugin.xml file instructs Nutch about parse-feed plugin.
- The parser plug-in is mapped to two mime-types: application/xml and text/xml which is highlighted in bold.
- Plugin id is very important in order to point out the respective plug-in.
- This file needs to be located in
- build.xml
- Build the ant file with importing build-plugin.xml file which performs all the basic operations or targets of all plug-ins.
1234567<?xml version="1.0"?><project name="parse-feed" default="jar-core"><import file="../build-plugin.xml" /></project> - This build file imports the build-plugin.xml file which is highlighted in bold.
- Build the ant file with importing build-plugin.xml file which performs all the basic operations or targets of all plug-ins.
Create a Plug-in with Ant:
- Plug-in Structure: The plug-in source files (ParseFeed.java), plugin.xml, and its build.xml files must be located as in below directory structure.
- Build with ant : To build the parse-feed parser plug-in, add the lines given below in build.xml file which is located in ${nutch_home}/src/plugin
123<target name="deploy"><ant dir="parse-feed" target="deploy"/></target> - Create and copy the parse-feed plug-in
- Run the ant for ${nutch_home}/build.xml in command prompt, when all the files have been created, edited and saved in its locations, then compile and generate the parse-feed.jar.
1C:\Nutch\apache-nutch-1.1-bin> ant - This will build all the plugins in ${nutch_home}/build directory and create parse-feed plug-in with required files and jars in ${nutch_home}/build/plugins/parse-feed.
- Copy parse-feed folder (plugin) from ${nutch_home}/build/plugins directory and paste it in ${nutch_home}/plugins/parse-feed location.
- Parse-feed directory contains parse-feed’s plugin.xml, parse-feed.jar and its dependency jars which are located in lib folder of parse-feed parser directory.
- Run the ant for ${nutch_home}/build.xml in command prompt, when all the files have been created, edited and saved in its locations, then compile and generate the parse-feed.jar.
Reconfigure with custom Parser:
- parse-plugin.xml
- By default, RSS mime-types: application/xml and text/xml are mapped with parse-tika pre-defined parser. So comment those mapping in parse-plugin.xml file.
- Add application/xml and text/xml mime-type mapping with custom parser as shown below
123456<mimeType name="text/xml"><plugin id="parse-feed" /></mimeType><mimeType name="application/xml"><plugin id="parse-feed" /></mimeType> - Add alias mappings for parse-feed to the actual extension implementation ids described in parse-feed plugin’s plugin.xml file
123<aliases><alias name="parse-feed" extension-id="com.treselle.parse.feed.ParseFeed" /></aliases> - To know more about parse-plugin.xml mapping, refer second series of the blog.
To crawl and index the content
- If we want Nutch to use this parse-feed plugin, we have to add parse-feed id name to plugin.includes in conf/nutch-site.xml, which is given below in highlighted text. Remove parse-tika parser id if it is added in plugin.includes.
12345<property><name>plugin.includes</name><value>protocol-(http|ftp|httpclient)|urlfilter-regex|parse-feed|index-(basic|more)|scoring-opic|urlnormalizer-(pass|regex|basic)</value><description>Regular expression naming plug-in directory names to include. Any plug-in not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plug-in. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description></property> - Use the command given below in cygwin command prompt to crawl and index the content of a page
12User@Pc-name /cygdrive/c/Nutch/apache-nutch-1.1-bin$ bin/nutch crawl seed.txt -dir crawler -depth 1 -topN 5 - After the successful completion of crawling process using parse-feed parser plugin, run luke-all jar tool, and open crawler/index directory in that tool. We can see the details of the index such as number of documents, number of fields, field values for each document in luke-all application as given below:
- Compare the result image given above with the result image in the first blog (Apache Nutch – A Web Crawler Framework).
- We will be able to notice changes in the documents count.
- This parse-feed adapter will add one document in each SyndEntry available in the RSS pages. This process helps us to get accurate content from each URL.
Conclusion
- Apache Nutch plugin-in architecture enables us to extend with our own logic with Nutch, using interface like parse, index, and ScoringFilter.
- The later series will focus on Apache Nutch integration with Hadoop MapReduce concept where we will see how Nutch executes all jobs.
References
- Writing plug-in in Nutch:
http://wiki.apache.org/nutch/WritingPluginExample-1.2