Whether language profile should be bundled or not?

I’m going to support maven for language-detection, but I have some troubles about language profiles…

language-detection have separated language profiles from jar file because they are lots of (meanwhile jar without profiles has only 100KB around, jar with ones has over 1MB.)
However there are some reasonable requests that want to bundle them in jar file.

  • Profile-separated one makes the application know the installed directory.
  • Hadoop can’t distribute language profiles outside jar file.
  • I want to register the library into Maven Central, then profile-bundled one is more useful.

And a user told me how to generate jar file with selected profiles with maven’s template, so a developer who hope the slim-size library can generate a jar file with necessary profiles only (also no profiles!).

Then I begin to consider that the jar file packaged in language-detection library can bundle all language profiles. How do you think?

Hadoop Development Environment with Eclipse

A usual Hadoop application needs a jar file for distributed nodes. It is very troublesome to repeat creating jar frequently in development…
So I’ve enable Eclipse to run Hadoop application(Map-Reduce Job) on its standalone mode. It make debug easier so it enable to Run as Debug.

I assume to finish Eclipse setup. See here if not yet.
By the way, the eclipse-plugin of Hadoop is not used. I tried it but couldn’t run it. One of Hadoop 0.20 can’t Run on Hadoop. One of Hadoop 0.21 can’t resister Hadoop’s Name Node furthermore.

Create a project for Hadoop

In this article, Hadoop version is 0.20.2 though the last one is 0.21.0 at a point of writing.
Because 0.21 has many toubles and no document for updating API and Mahout(0.4 and 0.5-SNAPSHOT) supports Hadoop 0.20.2 only.

Now, import Hadoop as an Eclipse project by the following. Need to do manually Hence Hadoop doesn’t have any Maven project.

  • Create a new Java Project in Eclipce and name it “hadoop-0.20.2”.
  • Download an archive file of Hadoop 0.20.2 (hadoop-0.20.2.tar.gz) and import it into the above project.
    Open Archive File Import dialog(File > Import > General > Archive File from Eclipse menu) and specify hadoop-0.20.2.tar.gz as archive file. Set the above project as Into Folder.
  • Set Apache Ant library(ant.jar) into the library folder of the project.
    Download apache-ant-1.8.2-bin.zip from here and extract ant.jar from it and copy into $WORKSPACE/hadoop-0.20.2/hadoop-0.20.2/lib/ .
  • Rewrite $WORKSPACE/hadoop-0.20.2/.classpath to add source folders and necessary libraries into the project.
    <?xml version="1.0" encoding="UTF-8"?>
    <classpath>
    	<classpathentry kind="src" path="hadoop-0.20.2/src/core"/>
    	<classpathentry kind="src" path="hadoop-0.20.2/src/mapred"/>
    	<classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER/org.eclipse.jdt.internal.debug.ui.launcher.StandardVMType/JavaSE-1.6"/>
    	<classpathentry kind="lib" path="hadoop-0.20.2/lib/commons-logging-1.0.4.jar"/>
    	<classpathentry kind="lib" path="hadoop-0.20.2/lib/xmlenc-0.52.jar"/>
    	<classpathentry kind="lib" path="hadoop-0.20.2/lib/commons-net-1.4.1.jar"/>
    	<classpathentry kind="lib" path="hadoop-0.20.2/lib/kfs-0.2.2.jar"/>
    	<classpathentry kind="lib" path="hadoop-0.20.2/lib/jets3t-0.6.1.jar"/>
    	<classpathentry kind="lib" path="hadoop-0.20.2/lib/servlet-api-2.5-6.1.14.jar"/>
    	<classpathentry kind="lib" path="hadoop-0.20.2/lib/jetty-6.1.14.jar"/>
    	<classpathentry kind="lib" path="hadoop-0.20.2/lib/jetty-util-6.1.14.jar"/>
    	<classpathentry kind="lib" path="hadoop-0.20.2/lib/commons-codec-1.3.jar"/>
    	<classpathentry kind="lib" path="hadoop-0.20.2/lib/log4j-1.2.15.jar"/>
    	<classpathentry kind="lib" path="hadoop-0.20.2/lib/commons-cli-1.2.jar"/>
    	<classpathentry kind="lib" path="hadoop-0.20.2/lib/ant.jar"/>
    	<classpathentry kind="lib" path="hadoop-0.20.2/lib/commons-httpclient-3.0.1.jar"/>
    	<classpathentry kind="output" path="bin"/>
    </classpath>
    
  • Refresh the project. Right click the hadoop-0.20.2 project and select “Refresh”.

Then importing Hadoop finishes. It is already running To build Hadoop in background.

Run Word Count Sample

We shall run a sample application on Hadoop in Eclipse for confirmation.

  • Create a Java Project for a sample application and name appropriately(e.g. wordcount).
    Click Next in Create Java Project dialog (not Finish!) and set up references of projects and libraries.

    • Add the above hadoop-0.20.2 project at Projects tab.
    • Add hadoop-0.20.2/hadoop-0.20.2/lib/commons-cli-1.2.jar at Libraries tab.

    If you’ve clicked Finish, open a Java Build Path dialog from project Properties.

  • Write classes of the sample application.
    // WordCountDriver.java
    import java.io.IOException;
    import java.util.Date;
    import java.util.Formatter;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
    import org.apache.hadoop.util.GenericOptionsParser;
    
    public class WordCountDriver {
    
        public static void main(String[] args) throws IOException,
                InterruptedException, ClassNotFoundException {
            Configuration conf = new Configuration();
            GenericOptionsParser parser = new GenericOptionsParser(conf, args);
            args = parser.getRemainingArgs();
    
            Job job = new Job(conf, "wordcount");
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
    
            job.setInputFormatClass(TextInputFormat.class);
            job.setOutputFormatClass(TextOutputFormat.class);
    
            Formatter formatter = new Formatter();
            String outpath = "Out"
                    + formatter.format("%1$tm%1$td%1$tH%1$tM%1$tS", new Date());
            FileInputFormat.setInputPaths(job, new Path("In"));
            FileOutputFormat.setOutputPath(job, new Path(outpath));
    
            job.setMapperClass(WordCountMapper.class);
            job.setReducerClass(WordCountReducer.class);
    
            System.out.println(job.waitForCompletion(true));
        }
    }
    
    // WordCountMapper.java
    import java.io.IOException;
    import java.util.StringTokenizer;
    
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Mapper;
    
    public class WordCountMapper extends
            Mapper<LongWritable, Text, Text, IntWritable> {
        private Text word = new Text();
        private final static IntWritable one = new IntWritable(1);
    
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                context.write(word, one);
            }
        }
    }
    
    // WordCountReducer.java
    import java.io.IOException;
    
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    
    public class WordCountReducer extends
            Reducer<Text, IntWritable, Text, IntWritable> {
        protected void reduce(Text key, Iterable<IntWritable> values,
                Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable value : values) {
                sum += value.get();
            }
            context.write(key, new IntWritable(sum));
        }
    }
    
  • Create $WORKSPACE/wordcount/bin/log4j.properties as the following.
    log4j.rootLogger=INFO,console
    log4j.appender.console=org.apache.log4j.ConsoleAppender
    log4j.appender.console.target=System.err
    log4j.appender.console.layout=org.apache.log4j.PatternLayout
    log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
    

    This enables Hadoop logs to output Eclipse console.

  • Create a sample text data into $WORKSPACE/wordcount/In .
  • Run the application. Click Run > Run from Eclipse menu or press ctrl+F11.

It succeeds if the application outputs logs to Eclipse console as the following.

11/02/18 19:52:39 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
11/02/18 19:52:39 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
11/02/18 19:52:39 INFO input.FileInputFormat: Total input paths to process : 1
11/02/18 19:52:39 INFO mapred.JobClient: Running job: job_local_0001
11/02/18 19:52:39 INFO input.FileInputFormat: Total input paths to process : 1
11/02/18 19:52:39 INFO mapred.MapTask: io.sort.mb = 100
11/02/18 19:52:39 INFO mapred.MapTask: data buffer = 79691776/99614720
11/02/18 19:52:39 INFO mapred.MapTask: record buffer = 262144/327680
11/02/18 19:52:40 INFO mapred.MapTask: Starting flush of map output
11/02/18 19:52:40 INFO mapred.MapTask: Finished spill 0
11/02/18 19:52:40 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
11/02/18 19:52:40 INFO mapred.LocalJobRunner: 
11/02/18 19:52:40 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
11/02/18 19:52:40 INFO mapred.LocalJobRunner: 
11/02/18 19:52:40 INFO mapred.Merger: Merging 1 sorted segments
11/02/18 19:52:40 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 23563 bytes
11/02/18 19:52:40 INFO mapred.LocalJobRunner: 
11/02/18 19:52:40 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
11/02/18 19:52:40 INFO mapred.LocalJobRunner: 
11/02/18 19:52:40 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now
11/02/18 19:52:40 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to Out0218195239
11/02/18 19:52:40 INFO mapred.LocalJobRunner: reduce > reduce
11/02/18 19:52:40 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
11/02/18 19:52:40 INFO mapred.JobClient:  map 100% reduce 100%
11/02/18 19:52:40 INFO mapred.JobClient: Job complete: job_local_0001
11/02/18 19:52:40 INFO mapred.JobClient: Counters: 12
11/02/18 19:52:40 INFO mapred.JobClient:   FileSystemCounters
11/02/18 19:52:40 INFO mapred.JobClient:     FILE_BYTES_READ=73091
11/02/18 19:52:40 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=110186
11/02/18 19:52:40 INFO mapred.JobClient:   Map-Reduce Framework
11/02/18 19:52:40 INFO mapred.JobClient:     Reduce input groups=956
11/02/18 19:52:40 INFO mapred.JobClient:     Combine output records=0
11/02/18 19:52:40 INFO mapred.JobClient:     Map input records=94
11/02/18 19:52:40 INFO mapred.JobClient:     Reduce shuffle bytes=0
11/02/18 19:52:40 INFO mapred.JobClient:     Reduce output records=956
11/02/18 19:52:40 INFO mapred.JobClient:     Spilled Records=4130
11/02/18 19:52:40 INFO mapred.JobClient:     Map output bytes=19431
11/02/18 19:52:40 INFO mapred.JobClient:     Combine input records=0
11/02/18 19:52:40 INFO mapred.JobClient:     Map output records=2065
11/02/18 19:52:40 INFO mapred.JobClient:     Reduce input records=2065
true

The result is outputed into $WORKSPACE/wordcount/Out**********/part-r-00000 .
Click Run > Debug if you would like to run it in debug mode. You can also use step runs and breakpoints as well as other Java applications.

If you encounter an Out of Memory error when running, change suitably the heap size configuration -Xmx in eclipse.ini. (thanks to Rick).

-Xmx384m

Mahout Development Environment with Maven and Eclipse (2)

Sample Codes of “Mahout in Action”

The sample codes of “Mahout in Action”, which is a Mahout book from Manning, are published at here. They include source codes in Chapter 2 to 6.

Now, We’ll build them on the Eclipse environment constructec in the previous post.

At first, generate a Maven project for sample codes on the Eclipse workspace directory.

$ cd C:/Users/shuyo/workspace
$ mvn archetype:create -DgroupId=mia.recommender -DartifactId=recommender

Do the following.

  • Delete a generated skelton code src/main/App.java and copy the sample code of “Mahout in Action” into src/main/java/mia/recommender/ch02 ~ ch06 of the ‘recommender’ project.
  • Convert the Maven project into Eclipse project.
    $ cd C:/Users/shuyo/workspace/recommender
    $ mvn eclipse:eclipse
    
  • Import the project into Eclipse.
    Open File > Import > General > Existing Projects into Workspace from Eclipse menu and select the ‘recommender’ project.
  • Then the ‘recommender’ project is available on Eclipse workspace, but all classes have errors because of no Mahout library reference.

    Right click the ‘recommender’ project, select Properties > Java Build Path > Projects from pop-up menu and click ‘Add’ and select the below Mahout projects.

    • mahout-core
    • mahout-examples
    • mahout-taste-webapp

    Then only 4 errors remain.

    Hence they are conflicts with updated APIs, these error correction need to modify codes.
    For example, open mia.recommender.ch03.IREvaluatorBooleanPrefIntro2 and press ctrl+1 at error line in it.

    This error says that the code does not catch or declare a exception of TasteException which NearestNUserNeighborhood’s constructor throws. So you can choise whichever you like a solution in the pop up menu. Others as well.

    The classes which has main() function can be executed on Eclipse.
    For example, select mia.recommender.ch02.RecommenderIntro and click Run > Run in Eclipse’s menu (or may press ctrl+F11 insted). Then It throws an exception as ‘Exception in thread “main” java.io.FileNotFoundException: intro.csv’.
    To make it read a sample data file ‘intro.csv’ in src/mia/recommender/ch02, click Run > Run Configurations in Eclipse’s menu and select the configuration of RecommenderIntro which is created by the above execution. Then set mia/recommender/ch02 to Working directory in Arguments tab(see the below figure). Click “Workspace…” button and select the directory.

    Then it outputs a result like “RecommendedItem[item:104, value:4.257081]”.
    If you want to make a project, repeat from Maven project creation.

Mahout Development Environment with Maven and Eclipse (1)

I’m reading “Mahout in Action” MEAP Edition, but it doesn’t teach how to construct a development environment of Mahout…
So I wrote the way of that by testing sample codes of “Mahout in Action”.

Install

I examine based on Windows 2008 x64.
Install several packages.

  • Cygwin
  • Java SDK 6u23 x64
  • Eclipse 3.6(helios) SR1 x64
  • Maven 3.0.2
  • (Hadoop 0.21.0)

Hadoop is not used in this article.

Maven

I am not good at Maven… So I’ve read the following documents.

Maven 3 has “Maven 2 Repository”! 😛

Source of Mahout

Use not the binary but the source code of Mahout, because reference them in Eclipse.

I used Mahout 0.4, but 0.5 SNAPSHOT may be better since Mahout’s API is fluid.

At first, start Eclipse and create a workspace. We take it “C:\Users\shuyo\workspace” for the present.
Extract the source of Mahout below the workspace. It is “C:\Users\shuyo\workspace\mahout-distribution-0.4” for the present.
Convert Maven project of Mahout into Eclipse project with the below command.

cd C:\Users\shuyo\workspace\mahout-distribution-0.4
mvn eclipse:eclipse

Now set the classpath variable M2_REPO of Eclipse to Maven 2 local repository.

mvn -Declipse.workspace= eclipse:add-maven-repo

But “Maven – Guide to using Eclipse with Maven 2.x” says “Issue: The command does not work”. So set it in Eclipse directly.

  • Open Window > Preferences > Java > Build Path > Classpath Valirables from Eclipse’s menu.
  • Press “New” and Add Name as “M2_REPO” and Path as Maven 2 repository path (its default is .m2/repository at your user directory).

When M2_REPO doesn’t be set, the following errors are thrown.

The project cannot be built until build path errors are resolved
Unbound classpath variable: 'M2_REPO/junit/junit/3.8.1/junit-3.8.1.jar' in project '********'

Finally import the converted Eclipse project of Mahout.

  • Open File > Import > General > Existing Projects into Workspace from Eclipse menu.
  • Select the project directory C:\Users\shuyo\workspace\mahout-distribution-0.4 and all projects.

Continued on the next post.

Updated LangDetect Library (4x faster)

I’ve updated LangDetect (Language Detection Library for Java).

This updating has 4x faster detection based on Posted improvement code by Elmer Garduno. Very Thanks!!


table: 100 times detection time for test data(ms). left: the previous version, right: the updated version

lang prev updated
ar 937 812
ar 203 79
en 610 31
en 1656 93
fa 156 31
fa 219 62
fa 328 172
fa 344 203
fa 343 156
fa 266 94
fa 406 188
fa 375 156
fa 421 250
fa 547 390
fa 390 234
fa 390 203
fa 187 47
gu 78 31
gu 47 15
it 563 78
it 547 78
it 578 78
it 563 78
ja 110 15
ja 109 15
ja 125 31
zh-cn 78 16
hi 391 16
mr 719 47
ne 640 47
hi 1000 78
mr 296 15
ne 906 62
mr 203 16
mr 187 16
mr 266 47
mr 156 16
nl 1454 94
ru 406 93
tl 391 141
tr 1016 47
zh-tw 203 31
ur 235 141
ur 250 125
ur 265 172
sum 19560 4840

How to develop Apache Nutch’s plugin (5) Sample Code (Language Detection Plugin)

Now, as a Nutch plugin sample code, we shall see a Language Detection plugin with our LangDetect library.

In 3 extensions which Apache Nutch’s Language Identificaiton plugin has, we will replace a IndexingFilter extension only (see the previous post).
So find the way of using the rest 2 extensions without change.

plugin.xml Sample

The following plugin.xml is enable to replace IndexingFilter extension only.

<plugin
   id="language-detector"
   name="Language Detection Parser/Filter"
   version="1.0.0"
   provider-name="labs.cybozu.co.jp">

    <runtime>
      <library name="language-identifier.jar">
         <export name="*"/>
      </library>

      <library name="langdetect-nutch.jar"> 
         <export name="*"/>
      </library>                             <!-- ADD -->
      <library name="jsonic-1.2.0.jar" />    <!-- ADD -->
      <library name="langdetect.jar" />      <!-- ADD -->
   </runtime>

   <requires>
      <import plugin="nutch-extensionpoints"/>
   </requires>

   <extension id="org.apache.nutch.analysis.lang.LanguageParser"
              name="Nutch language Parser"
              point="org.apache.nutch.parse.HtmlParseFilter">
      <implementation id="LanguageParser"
                      class="org.apache.nutch.analysis.lang.HTMLLanguageParser"/>
   </extension>

   <extension id="com.cybozu.labs.nutch.plugin.LanguageDetectionFilter"
              name="language detection filter"
              point="org.apache.nutch.indexer.IndexingFilter">
      <implementation id="LanguageDetectionFilter"
                      class="com.cybozu.labs.nutch.plugin.LanguageDetectionFilter"/>
   </extension>    <!-- MODIFY -->

   <extension id="org.apache.nutch.analysis.lang.LanguageQueryFilter"
              name="Nutch Language Query Filter"
              point="org.apache.nutch.searcher.QueryFilter">
      <implementation id="LanguageQueryFilter"
                      class="org.apache.nutch.analysis.lang.LanguageQueryFilter">
        <parameter name="raw-fields" value="lang"/>
      </implementation>
   </extension>

</plugin>

This plugin.xml is based on one of Nutch’s Language Identification plugin.

Changed points:
  • Add LangDetect library, JSONIC and our plugin into libraries
  • Replace IndexingFilter extension

If you want to add a library with an implementation of your extensions, specify <export name=”*”/> in <library>. It is not necessary For other libraries which are used in the extensions.
The class loader of each plugin is separative, so It is perhaps able to use extensions with the same class name without conflict.

Code Sample (setConf method)

We need to implement two methods for IndexingFilter extension.

  • The extension initialization on setConf()
  • The language detection procedure on filger()

First, the sample implementation of setConf() is the following.

    private static final int TEXTSIZE_UPPER_LIMIT_DEFAULT = 10000;
    private Configuration conf = null;
    private LangDetectException cause = null;
    private int textsize_upper_limit;

    /** Initalization using parameters specified in nutch-site.xml */
    public void setConf(Configuration conf) {
        if (this.conf == null) {    /* only once initialization */
            try {
                // Initialization of Language Detection module with profile directory 
                DetectorFactory.loadProfile(conf.get("langdetect.profile.dir"));

                // text size upper limit (cut down size over)
                textsize_upper_limit = conf.getInt("langdetect.textsize", TEXTSIZE_UPPER_LIMIT_DEFAULT);
            } catch (LangDetectException e) {
                // throw exception when filter() is called
                cause = e;
            }
        }
        this.conf = conf;    /* reserve for getConf() */
    }

The Language Detection library needs some initial parameters, a language profile directory path and so on. In Apache Nutch’s plugin framework, those parameters can be written in nutch-(default|site).xml . Extensions can obtain them via a Configuration object, the parameter of setConf().

The above sample has another point that it runs the initialization only once.
As usual a Map Task which invokes each extension is one task one process, so there are no troubles without this constraint.
But a single process has plural Map Tasks on Hadoop standalone mode. So if it don’t want to run multi initialization on single process, take some measures.
However the above code isn’t correct as concurrent programming strictly speaking, I think of test only.

Code Sample (filter method)

A filter() method is an implementation of language detection procedure.

    public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
            CrawlDatum datum, Inlinks inlinks) throws IndexingException {

        // if meta-tag has language, use it
        String lang = parse.getData().getParseMeta().get(Metadata.LANGUAGE);
        if (lang == null) {

            // estimate language from statistics using LangDetect library
            StringBuilder text = new StringBuilder();
            text.append(parse.getData().getTitle()).append(" ").append(parse.getText());
            try {
                Detector detector = DetectorFactory.create();
                detector.append(text.toString());
                lang = detector.detect();
            } catch (LangDetectException e) {
                throw new IndexingException("Detection failed.", e);
            }
        }
        if (lang == null) lang = "unknown";

        // add language meta-data into document index
        doc.add("lang", lang);
        return doc;
    }

This implementation is straightforward.

filter method has some parameters.
NutchDocument parameter has the document body and its index data.
Parse parameter has intermediate data which are extracted on HTMLLanguageParser extension for example.

The standard language identification plugin takes Language property from HTTP header of the document if specified. But within my experience, language information in HTTP header are considerably wrong. So our LangDetect plugin don’t use it.

How to develop Apache Nutch’s plugin (4) IndexingFilter extension-point

In previous post, I introduced that Nutch’s Language Identificaiton plugin has 3 extensions on HtmlParseFilter, IndexingFilter and QueryFilter. In particular, the IndexingFilter extension handles a procedure of language identification.
So we’ll research the way of developing an extension plugin on IndexingFilter extension-point. Other extension-points can be developed as well.

What’s IndexingFilter

IndexingFilter extension-point is an interface for adding meta-data into search-index.
Indexer MapTask on Hadoop invokes it. Other extensions are also invoked by various MapTask jobs (Injecter, Generator, Fetcher, Parser and so on).
So the debug of Nutch extensions is as well the debug for Hadoop MapTask.

IndexingFilter is a Java interface, org.apache.nutch.indexer.IndexingFilter, which has 2 methods.

void addIndexBackendOptions(Configuration conf);
To define fields and their constraints which are added into search-index. Use LuceneWriter.addFieldOptions() .
NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks);
To extract meta-data from NutchDocument object and add them into search-index.

If the extension-point extends org.apache.hadoop.conf.Configurable (most ones do so), it has 2 methods in addition.

void setConf(Configuration conf);
To preserve org.apache.hadoop.conf.Configuration object and initialize the extension. Configuration object provides access to initial parameters on nutch-(default|site).xml .
Configuration getConf();
To return the preserved Configuration object.

Hence there are no documents of these methods(…), read sources of the standard plugins of Nutch using target extension-points.

Skelton of IndexingFilter

For Nutch plugin development, we need the following libraries in apache-nutch-1.2-bin.tar.gz archive.

  • nutch-1.2.jar
  • lib/hadoop-0.20.2-core.jar
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.nutch.indexer.lucene.LuceneWriter;
import org.apache.nutch.metadata.Metadata;
import org.apache.nutch.parse.Parse;

public class SampleFilter implements IndexingFilter {
	private Configuration conf = null;

	public SampleFilter() {
		// Constructor with no arguments
	}

	public void setConf(Configuration conf) {
		this.conf = conf;
		// TODO: initialization of this extension
		//       with conf.getFloat([PARAMATER NAME], [DEFAULT VALUE]) or some methods
	}

	public Configuration getConf() {
		return this.conf;
	}

	// The above is common for most extension-points.

	// The below is for IndexingFilter extension-point.

	public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
			CrawlDatum datum, Inlinks inlinks) throws IndexingException {
		// TODO: Main procedure of this extension
	}

	public void addIndexBackendOptions(Configuration conf) {
		// Definition of meta-data fields to add

		// Example: 
		LuceneWriter.addFieldOptions("lang", LuceneWriter.STORE.YES,
				LuceneWriter.INDEX.UNTOKENIZED, conf);
	}
}

How to develop Apache Nutch’s plugin (3) Research of an example

In order to plugin developments, We shall research the structure of a sample plugin.
I want to develop a language detection plugin, so our research target is then ‘language-identifier’ plugin in the Nutch’s standard plugins.

The ‘language-identifier’ plugin has three extensions, HTMLLanguageParser, LanguageIdentifier and LanguageQueryFilter.

HTMLLanguageParser

This is a meta-data extraction plugin which extracts various information from <meta> tags of HTML and HTTP headers.
It is implemented at HtmlParseFilter extension-point.

LanguageIdentifier

This is a language insertion plugin which detects the language of resources from meta-data or body of text and inserts the detected language into the Nutch’s index.
In the case of using its text, its language is estimated with a stochastic method.
It is implemented at IndexingFilter extension-point.

LanguageQueryFilter

This is a search criteria plugin which adds a language search into query.
It is implemented at QueryFilter extension-point.

plugin.xml

The plugin.xml of ‘language-identifier’ plugin is the following.

<plugin
   id="language-identifier"
   name="Language Identification Parser/Filter"
   version="1.0.0"
   provider-name="nutch.org">

    <runtime>
      <library name="language-identifier.jar">
         <export name="*"/>
      </library>
   </runtime>

   <requires>
      <import plugin="nutch-extensionpoints"/>
   </requires>

   <extension id="org.apache.nutch.analysis.lang.LanguageParser"
              name="Nutch language Parser"
              point="org.apache.nutch.parse.HtmlParseFilter">
      <implementation id="LanguageParser"
                      class="org.apache.nutch.analysis.lang.HTMLLanguageParser"/>
   </extension>

   <extension id="org.apache.nutch.analysis.lang"
              name="Nutch language identifier filter"
              point="org.apache.nutch.indexer.IndexingFilter">
      <implementation id="LanguageIdentifier"
                      class="org.apache.nutch.analysis.lang.LanguageIndexingFilter"/>
   </extension>

   <extension id="org.apache.nutch.analysis.lang.LanguageQueryFilter"
              name="Nutch Language Query Filter"
              point="org.apache.nutch.searcher.QueryFilter">
      <implementation id="LanguageQueryFilter"
                      class="org.apache.nutch.analysis.lang.LanguageQueryFilter">
        <parameter name="raw-fields" value="lang"/>
      </implementation>
   </extension>

</plugin>

The extension@point attributes makes it clear which extension-point each extension belongs.
The implementation@class attribute is an implementaion class name of the extension. The sources of Nutch’s standard plugins are published, so we can read their sources as sample codes.

Apache Nutch’s Architecture

Apache Nutch
http://nutch.apache.org/

Apache Nutch is a web search engine which consists of Lucene, Solr, web crawler, page scoring(Page Rank) and plugable distributed system.
Nutch’s crawler has a language identification plugin

I’ll want to substitute Nutch’s LanguageIdentifier for our Language Detection library, but I’m afraid that Apache Nutch’s document is quite poor. So I researched the structure of Nutch.

I referred to the Nutch’s Wiki and the following presentations.

Strages

Nutch has 3 strages (crawl db, link db and segments) and index/indexes.
It is able to put on the distributed system with Hadoop HDFS, but I leave it.

crawl db
scheduling for crawl
link db
linked graph information for page scoring
segments
resources themselves(texts of web page, snippets, intermediate resources)
index
search index
indexex
differences of index for each segment of incremental update

Commands of nutch_readdb and nutch_segread are available to confirmation of the link db and the segments.

luke, the Lucene’s index administrative tool, is available to confirmation of index.

Works of Nutch Crawler

Command nutch_crawl works injecting seed urls, crawling web pages, parsing contents and generating index automatically.

Based on the actual behaviors of nutch_crawl, Nutch indexing work is the following.

  1. inject : add seed urls to webdb (a table in crawl db?) and link db(as vertexes?)
  2. generate : generate fetchlist (a intermediate resource in segments?) from webdb
  3. fetch : get contents (according to fetchlist?) and store them into segments
  4. parse : parse contents in segments (Apache Tika, parser for each mime type)
  5. update crawldb : add linked url which extracted by parsing into webdb?
  6. go back 2 if webdb is not empty
  7. update linkdb : update link db with outputs of fetcher (and parser?)
  8. index : generate indexes from each segment
  9. Dedup : delete duplicated documents (example : http://www.cnn.com and cnn.com)
  10. IndexMerger : merge updated indexes into index