How to develop Apache Nutch’s plugin (5) Sample Code (Language Detection Plugin)

Now, as a Nutch plugin sample code, we shall see a Language Detection plugin with our LangDetect library.

In 3 extensions which Apache Nutch’s Language Identificaiton plugin has, we will replace a IndexingFilter extension only (see the previous post).
So find the way of using the rest 2 extensions without change.

plugin.xml Sample

The following plugin.xml is enable to replace IndexingFilter extension only.

<plugin
   id="language-detector"
   name="Language Detection Parser/Filter"
   version="1.0.0"
   provider-name="labs.cybozu.co.jp">

    <runtime>
      <library name="language-identifier.jar">
         <export name="*"/>
      </library>

      <library name="langdetect-nutch.jar"> 
         <export name="*"/>
      </library>                             <!-- ADD -->
      <library name="jsonic-1.2.0.jar" />    <!-- ADD -->
      <library name="langdetect.jar" />      <!-- ADD -->
   </runtime>

   <requires>
      <import plugin="nutch-extensionpoints"/>
   </requires>

   <extension id="org.apache.nutch.analysis.lang.LanguageParser"
              name="Nutch language Parser"
              point="org.apache.nutch.parse.HtmlParseFilter">
      <implementation id="LanguageParser"
                      class="org.apache.nutch.analysis.lang.HTMLLanguageParser"/>
   </extension>

   <extension id="com.cybozu.labs.nutch.plugin.LanguageDetectionFilter"
              name="language detection filter"
              point="org.apache.nutch.indexer.IndexingFilter">
      <implementation id="LanguageDetectionFilter"
                      class="com.cybozu.labs.nutch.plugin.LanguageDetectionFilter"/>
   </extension>    <!-- MODIFY -->

   <extension id="org.apache.nutch.analysis.lang.LanguageQueryFilter"
              name="Nutch Language Query Filter"
              point="org.apache.nutch.searcher.QueryFilter">
      <implementation id="LanguageQueryFilter"
                      class="org.apache.nutch.analysis.lang.LanguageQueryFilter">
        <parameter name="raw-fields" value="lang"/>
      </implementation>
   </extension>

</plugin>

This plugin.xml is based on one of Nutch’s Language Identification plugin.

Changed points:
  • Add LangDetect library, JSONIC and our plugin into libraries
  • Replace IndexingFilter extension

If you want to add a library with an implementation of your extensions, specify <export name=”*”/> in <library>. It is not necessary For other libraries which are used in the extensions.
The class loader of each plugin is separative, so It is perhaps able to use extensions with the same class name without conflict.

Code Sample (setConf method)

We need to implement two methods for IndexingFilter extension.

  • The extension initialization on setConf()
  • The language detection procedure on filger()

First, the sample implementation of setConf() is the following.

    private static final int TEXTSIZE_UPPER_LIMIT_DEFAULT = 10000;
    private Configuration conf = null;
    private LangDetectException cause = null;
    private int textsize_upper_limit;

    /** Initalization using parameters specified in nutch-site.xml */
    public void setConf(Configuration conf) {
        if (this.conf == null) {    /* only once initialization */
            try {
                // Initialization of Language Detection module with profile directory 
                DetectorFactory.loadProfile(conf.get("langdetect.profile.dir"));

                // text size upper limit (cut down size over)
                textsize_upper_limit = conf.getInt("langdetect.textsize", TEXTSIZE_UPPER_LIMIT_DEFAULT);
            } catch (LangDetectException e) {
                // throw exception when filter() is called
                cause = e;
            }
        }
        this.conf = conf;    /* reserve for getConf() */
    }

The Language Detection library needs some initial parameters, a language profile directory path and so on. In Apache Nutch’s plugin framework, those parameters can be written in nutch-(default|site).xml . Extensions can obtain them via a Configuration object, the parameter of setConf().

The above sample has another point that it runs the initialization only once.
As usual a Map Task which invokes each extension is one task one process, so there are no troubles without this constraint.
But a single process has plural Map Tasks on Hadoop standalone mode. So if it don’t want to run multi initialization on single process, take some measures.
However the above code isn’t correct as concurrent programming strictly speaking, I think of test only.

Code Sample (filter method)

A filter() method is an implementation of language detection procedure.

    public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
            CrawlDatum datum, Inlinks inlinks) throws IndexingException {

        // if meta-tag has language, use it
        String lang = parse.getData().getParseMeta().get(Metadata.LANGUAGE);
        if (lang == null) {

            // estimate language from statistics using LangDetect library
            StringBuilder text = new StringBuilder();
            text.append(parse.getData().getTitle()).append(" ").append(parse.getText());
            try {
                Detector detector = DetectorFactory.create();
                detector.append(text.toString());
                lang = detector.detect();
            } catch (LangDetectException e) {
                throw new IndexingException("Detection failed.", e);
            }
        }
        if (lang == null) lang = "unknown";

        // add language meta-data into document index
        doc.add("lang", lang);
        return doc;
    }

This implementation is straightforward.

filter method has some parameters.
NutchDocument parameter has the document body and its index data.
Parse parameter has intermediate data which are extracted on HTMLLanguageParser extension for example.

The standard language identification plugin takes Language property from HTTP header of the document if specified. But within my experience, language information in HTTP header are considerably wrong. So our LangDetect plugin don’t use it.

Advertisements
This entry was posted in Java, Nutch, Plugin, Search Engine. Bookmark the permalink.

One Response to How to develop Apache Nutch’s plugin (5) Sample Code (Language Detection Plugin)

  1. Pingback: Language Detection Plugin for Apache Nutch | Shuyo's Weblog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s