How to develop Apache Nutch’s plugin (3) Research of an example

In order to plugin developments, We shall research the structure of a sample plugin.
I want to develop a language detection plugin, so our research target is then ‘language-identifier’ plugin in the Nutch’s standard plugins.

The ‘language-identifier’ plugin has three extensions, HTMLLanguageParser, LanguageIdentifier and LanguageQueryFilter.


This is a meta-data extraction plugin which extracts various information from <meta> tags of HTML and HTTP headers.
It is implemented at HtmlParseFilter extension-point.


This is a language insertion plugin which detects the language of resources from meta-data or body of text and inserts the detected language into the Nutch’s index.
In the case of using its text, its language is estimated with a stochastic method.
It is implemented at IndexingFilter extension-point.


This is a search criteria plugin which adds a language search into query.
It is implemented at QueryFilter extension-point.


The plugin.xml of ‘language-identifier’ plugin is the following.

   name="Language Identification Parser/Filter"

      <library name="language-identifier.jar">
         <export name="*"/>

      <import plugin="nutch-extensionpoints"/>

   <extension id="org.apache.nutch.analysis.lang.LanguageParser"
              name="Nutch language Parser"
      <implementation id="LanguageParser"

   <extension id="org.apache.nutch.analysis.lang"
              name="Nutch language identifier filter"
      <implementation id="LanguageIdentifier"

   <extension id="org.apache.nutch.analysis.lang.LanguageQueryFilter"
              name="Nutch Language Query Filter"
      <implementation id="LanguageQueryFilter"
        <parameter name="raw-fields" value="lang"/>


The extension@point attributes makes it clear which extension-point each extension belongs.
The implementation@class attribute is an implementaion class name of the extension. The sources of Nutch’s standard plugins are published, so we can read their sources as sample codes.

