How to develop Apache Nutch’s plugin (3) Research of an example

In order to plugin developments, We shall research the structure of a sample plugin.
I want to develop a language detection plugin, so our research target is then ‘language-identifier’ plugin in the Nutch’s standard plugins.

The ‘language-identifier’ plugin has three extensions, HTMLLanguageParser, LanguageIdentifier and LanguageQueryFilter.

HTMLLanguageParser

This is a meta-data extraction plugin which extracts various information from <meta> tags of HTML and HTTP headers.
It is implemented at HtmlParseFilter extension-point.

LanguageIdentifier

This is a language insertion plugin which detects the language of resources from meta-data or body of text and inserts the detected language into the Nutch’s index.
In the case of using its text, its language is estimated with a stochastic method.
It is implemented at IndexingFilter extension-point.

LanguageQueryFilter

This is a search criteria plugin which adds a language search into query.
It is implemented at QueryFilter extension-point.

plugin.xml

The plugin.xml of ‘language-identifier’ plugin is the following.

<plugin
   id="language-identifier"
   name="Language Identification Parser/Filter"
   version="1.0.0"
   provider-name="nutch.org">

    <runtime>
      <library name="language-identifier.jar">
         <export name="*"/>
      </library>
   </runtime>

   <requires>
      <import plugin="nutch-extensionpoints"/>
   </requires>

   <extension id="org.apache.nutch.analysis.lang.LanguageParser"
              name="Nutch language Parser"
              point="org.apache.nutch.parse.HtmlParseFilter">
      <implementation id="LanguageParser"
                      class="org.apache.nutch.analysis.lang.HTMLLanguageParser"/>
   </extension>

   <extension id="org.apache.nutch.analysis.lang"
              name="Nutch language identifier filter"
              point="org.apache.nutch.indexer.IndexingFilter">
      <implementation id="LanguageIdentifier"
                      class="org.apache.nutch.analysis.lang.LanguageIndexingFilter"/>
   </extension>

   <extension id="org.apache.nutch.analysis.lang.LanguageQueryFilter"
              name="Nutch Language Query Filter"
              point="org.apache.nutch.searcher.QueryFilter">
      <implementation id="LanguageQueryFilter"
                      class="org.apache.nutch.analysis.lang.LanguageQueryFilter">
        <parameter name="raw-fields" value="lang"/>
      </implementation>
   </extension>

</plugin>

The extension@point attributes makes it clear which extension-point each extension belongs.
The implementation@class attribute is an implementaion class name of the extension. The sources of Nutch’s standard plugins are published, so we can read their sources as sample codes.

Advertisements
This entry was posted in Java, Nutch, Search Engine. Bookmark the permalink.

2 Responses to How to develop Apache Nutch’s plugin (3) Research of an example

  1. Pingback: How to develop Apache Nutch’s plugin (4) IndexingFilter extension-point | Shuyo's Weblog

  2. Pingback: Language Detection Plugin for Apache Nutch | Shuyo's Weblog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s