Category Archives: Nutch

Language Detection Plugin for Apache Nutch

I developed a Language Detection plugin for Apache Nutch with our LangDetect library. Download (bundled in the LangDetect library) Setup manual Compatible to the Standard language identification plugin of Nutch 99% over accuracy Supports 49 languages Afrikaans, Albanian, Arabic, Bengali, … Continue reading

Posted in i18n, NLP, Nutch, Plugin, Search Engine, text analysis | Leave a comment

How to develop Apache Nutch’s plugin (5) Sample Code (Language Detection Plugin)

Now, as a Nutch plugin sample code, we shall see a Language Detection plugin with our LangDetect library. In 3 extensions which Apache Nutch’s Language Identificaiton plugin has, we will replace a IndexingFilter extension only (see the previous post). So … Continue reading

Posted in Java, Nutch, Plugin, Search Engine | 1 Comment

How to develop Apache Nutch’s plugin (4) IndexingFilter extension-point

In previous post, I introduced that Nutch’s Language Identificaiton plugin has 3 extensions on HtmlParseFilter, IndexingFilter and QueryFilter. In particular, the IndexingFilter extension handles a procedure of language identification. So we’ll research the way of developing an extension plugin on … Continue reading

Posted in Java, Nutch, Search Engine | 2 Comments

How to develop Apache Nutch’s plugin (3) Research of an example

In order to plugin developments, We shall research the structure of a sample plugin. I want to develop a language detection plugin, so our research target is then ‘language-identifier’ plugin in the Nutch’s standard plugins. The ‘language-identifier’ plugin has three … Continue reading

Posted in Java, Nutch, Search Engine | 2 Comments

How to develop Apache Nutch’s plugin (2) plugin.xml

To be enable plugins is to set them into $NUTCH_HOME/plugins/ and to add the plugin names into conf/nutch-site.xml. $NUTCH_HOME/plugins/ [plugin name]/ plugin.xml plugin information xml file [some name].jar plugin implemented jar file The “plugin.includes” property has enables plugin names in … Continue reading

Posted in Nutch, Search Engine | 1 Comment

How to develop Apache Nutch’s plugin (1) extension-points list

I’ll want to substitute Apache Nutch’s LanguageIdentifier for our Language Detection library, so I’m summerizing researched information while developing. extension-points “Extension-points” are and they are interfaces that implement org.apache.nutch.plugin.Pluggable. list of extension-points Nutch Summarizer (org.apache.nutch.searcher.Summarizer) No documents. Summarizer of search … Continue reading

Posted in Nutch, Search Engine | 1 Comment

Apache Nutch’s Architecture

Apache Nutch http://nutch.apache.org/ Apache Nutch is a web search engine which consists of Lucene, Solr, web crawler, page scoring(Page Rank) and plugable distributed system. Nutch’s crawler has a language identification plugin I’ll want to substitute Nutch’s LanguageIdentifier for our Language … Continue reading

Posted in Java, Nutch, Search Engine | 4 Comments