Now, as a Nutch plugin sample code, we shall see a Language Detection plugin with our LangDetect library.
In 3 extensions which Apache Nutch’s Language Identificaiton plugin has, we will replace a IndexingFilter extension only (see the previous post).
So find the way of using the rest 2 extensions without change.
plugin.xml Sample
The following plugin.xml is enable to replace IndexingFilter extension only.
<plugin
id="language-detector"
name="Language Detection Parser/Filter"
version="1.0.0"
provider-name="labs.cybozu.co.jp">
<runtime>
<library name="language-identifier.jar">
<export name="*"/>
</library>
<library name="langdetect-nutch.jar">
<export name="*"/>
</library> <!-- ADD -->
<library name="jsonic-1.2.0.jar" /> <!-- ADD -->
<library name="langdetect.jar" /> <!-- ADD -->
</runtime>
<requires>
<import plugin="nutch-extensionpoints"/>
</requires>
<extension id="org.apache.nutch.analysis.lang.LanguageParser"
name="Nutch language Parser"
point="org.apache.nutch.parse.HtmlParseFilter">
<implementation id="LanguageParser"
class="org.apache.nutch.analysis.lang.HTMLLanguageParser"/>
</extension>
<extension id="com.cybozu.labs.nutch.plugin.LanguageDetectionFilter"
name="language detection filter"
point="org.apache.nutch.indexer.IndexingFilter">
<implementation id="LanguageDetectionFilter"
class="com.cybozu.labs.nutch.plugin.LanguageDetectionFilter"/>
</extension> <!-- MODIFY -->
<extension id="org.apache.nutch.analysis.lang.LanguageQueryFilter"
name="Nutch Language Query Filter"
point="org.apache.nutch.searcher.QueryFilter">
<implementation id="LanguageQueryFilter"
class="org.apache.nutch.analysis.lang.LanguageQueryFilter">
<parameter name="raw-fields" value="lang"/>
</implementation>
</extension>
</plugin>
This plugin.xml is based on one of Nutch’s Language Identification plugin.
Changed points:
- Add LangDetect library, JSONIC and our plugin into libraries
- Replace IndexingFilter extension
If you want to add a library with an implementation of your extensions, specify <export name=”*”/> in <library>. It is not necessary For other libraries which are used in the extensions.
The class loader of each plugin is separative, so It is perhaps able to use extensions with the same class name without conflict.
Code Sample (setConf method)
We need to implement two methods for IndexingFilter extension.
- The extension initialization on setConf()
- The language detection procedure on filger()
First, the sample implementation of setConf() is the following.
private static final int TEXTSIZE_UPPER_LIMIT_DEFAULT = 10000;
private Configuration conf = null;
private LangDetectException cause = null;
private int textsize_upper_limit;
/** Initalization using parameters specified in nutch-site.xml */
public void setConf(Configuration conf) {
if (this.conf == null) { /* only once initialization */
try {
// Initialization of Language Detection module with profile directory
DetectorFactory.loadProfile(conf.get("langdetect.profile.dir"));
// text size upper limit (cut down size over)
textsize_upper_limit = conf.getInt("langdetect.textsize", TEXTSIZE_UPPER_LIMIT_DEFAULT);
} catch (LangDetectException e) {
// throw exception when filter() is called
cause = e;
}
}
this.conf = conf; /* reserve for getConf() */
}
The Language Detection library needs some initial parameters, a language profile directory path and so on. In Apache Nutch’s plugin framework, those parameters can be written in nutch-(default|site).xml . Extensions can obtain them via a Configuration object, the parameter of setConf().
The above sample has another point that it runs the initialization only once.
As usual a Map Task which invokes each extension is one task one process, so there are no troubles without this constraint.
But a single process has plural Map Tasks on Hadoop standalone mode. So if it don’t want to run multi initialization on single process, take some measures.
However the above code isn’t correct as concurrent programming strictly speaking, I think of test only.
Code Sample (filter method)
A filter() method is an implementation of language detection procedure.
public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
CrawlDatum datum, Inlinks inlinks) throws IndexingException {
// if meta-tag has language, use it
String lang = parse.getData().getParseMeta().get(Metadata.LANGUAGE);
if (lang == null) {
// estimate language from statistics using LangDetect library
StringBuilder text = new StringBuilder();
text.append(parse.getData().getTitle()).append(" ").append(parse.getText());
try {
Detector detector = DetectorFactory.create();
detector.append(text.toString());
lang = detector.detect();
} catch (LangDetectException e) {
throw new IndexingException("Detection failed.", e);
}
}
if (lang == null) lang = "unknown";
// add language meta-data into document index
doc.add("lang", lang);
return doc;
}
This implementation is straightforward.
filter method has some parameters.
NutchDocument parameter has the document body and its index data.
Parse parameter has intermediate data which are extracted on HTMLLanguageParser extension for example.
The standard language identification plugin takes Language property from HTTP header of the document if specified. But within my experience, language information in HTTP header are considerably wrong. So our LangDetect plugin don’t use it.