To be enable plugins is to set them into $NUTCH_HOME/plugins/ and to add the plugin names into conf/nutch-site.xml.
$NUTCH_HOME/plugins/ [plugin name]/ plugin.xml plugin information xml file [some name].jar plugin implemented jar file
The “plugin.includes” property has enables plugin names in conf/nutch-site.xml, so write the default plugin names in conf/nutch-default.xml and your plugin names.
Example: Default enabled plugins
<property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> </property>
How to prepare plugin.xml
“plugin.xml” provides the implemented jar file name, extension-points, classes and parameters.
An extension is an implement of extension-points, and a plugin has more than one extension.
The implemented jar file name is written at plugin/runtime node.
<runtime> <library name="language-identifier.jar"> <export name="*"/> </library> </runtime>
The information of extensions are written at plugin/extension nodes.
<extension id="org.apache.nutch.analysis.lang.LanguageQueryFilter" name="Nutch Language Query Filter" point="org.apache.nutch.searcher.QueryFilter"> <implementation id="LanguageQueryFilter" class="org.apache.nutch.analysis.lang.LanguageQueryFilter"> <parameter name="raw-fields" value="lang"/> </implementation> </extension>
The following is about each node and attribute.
- The id of the extension which is equals to the class name normally.
- The extension-point name which is also an interface name.
- The id of the extension implementation. (I don’t yet know the difference between extension@id and it…)
- The class name of the extension implementation.
- Parameters of the extension. They can be referred at Nutch’s plugin framework, but not at plugin itself. If you want to handle some parameters in plugins, write the parameters in conf/nutch-site.xml .