How to develop Apache Nutch’s plugin (4) IndexingFilter extension-point

In previous post, I introduced that Nutch’s Language Identificaiton plugin has 3 extensions on HtmlParseFilter, IndexingFilter and QueryFilter. In particular, the IndexingFilter extension handles a procedure of language identification.
So we’ll research the way of developing an extension plugin on IndexingFilter extension-point. Other extension-points can be developed as well.

What’s IndexingFilter

IndexingFilter extension-point is an interface for adding meta-data into search-index.
Indexer MapTask on Hadoop invokes it. Other extensions are also invoked by various MapTask jobs (Injecter, Generator, Fetcher, Parser and so on).
So the debug of Nutch extensions is as well the debug for Hadoop MapTask.

IndexingFilter is a Java interface, org.apache.nutch.indexer.IndexingFilter, which has 2 methods.

void addIndexBackendOptions(Configuration conf);
To define fields and their constraints which are added into search-index. Use LuceneWriter.addFieldOptions() .
NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks);
To extract meta-data from NutchDocument object and add them into search-index.

If the extension-point extends org.apache.hadoop.conf.Configurable (most ones do so), it has 2 methods in addition.

void setConf(Configuration conf);
To preserve org.apache.hadoop.conf.Configuration object and initialize the extension. Configuration object provides access to initial parameters on nutch-(default|site).xml .
Configuration getConf();
To return the preserved Configuration object.

Hence there are no documents of these methods(…), read sources of the standard plugins of Nutch using target extension-points.

Skelton of IndexingFilter

For Nutch plugin development, we need the following libraries in apache-nutch-1.2-bin.tar.gz archive.

  • nutch-1.2.jar
  • lib/hadoop-0.20.2-core.jar
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.nutch.indexer.lucene.LuceneWriter;
import org.apache.nutch.metadata.Metadata;
import org.apache.nutch.parse.Parse;

public class SampleFilter implements IndexingFilter {
	private Configuration conf = null;

	public SampleFilter() {
		// Constructor with no arguments
	}

	public void setConf(Configuration conf) {
		this.conf = conf;
		// TODO: initialization of this extension
		//       with conf.getFloat([PARAMATER NAME], [DEFAULT VALUE]) or some methods
	}

	public Configuration getConf() {
		return this.conf;
	}

	// The above is common for most extension-points.

	// The below is for IndexingFilter extension-point.

	public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
			CrawlDatum datum, Inlinks inlinks) throws IndexingException {
		// TODO: Main procedure of this extension
	}

	public void addIndexBackendOptions(Configuration conf) {
		// Definition of meta-data fields to add

		// Example: 
		LuceneWriter.addFieldOptions("lang", LuceneWriter.STORE.YES,
				LuceneWriter.INDEX.UNTOKENIZED, conf);
	}
}
Advertisements
This entry was posted in Java, Nutch, Search Engine. Bookmark the permalink.

2 Responses to How to develop Apache Nutch’s plugin (4) IndexingFilter extension-point

  1. Pingback: How to develop Apache Nutch’s plugin (5) Sample Code (Language Detection Plugin) | Shuyo's Weblog

  2. Pingback: Language Detection Plugin for Apache Nutch | Shuyo's Weblog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s