Calendar
May 2013 M T W T F S S « Aug 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 -
Recent Posts
Categories
Category Archives: Nutch
Language Detection Plugin for Apache Nutch
I developed a Language Detection plugin for Apache Nutch with our LangDetect library. Download (bundled in the LangDetect library) Setup manual Compatible to the Standard language identification plugin of Nutch 99% over accuracy Supports 49 languages Afrikaans, Albanian, Arabic, Bengali, … Continue reading
Posted in i18n, NLP, Nutch, Plugin, Search Engine, text analysis
Leave a comment
How to develop Apache Nutch’s plugin (5) Sample Code (Language Detection Plugin)
Now, as a Nutch plugin sample code, we shall see a Language Detection plugin with our LangDetect library. In 3 extensions which Apache Nutch’s Language Identificaiton plugin has, we will replace a IndexingFilter extension only (see the previous post). So … Continue reading
Posted in Java, Nutch, Plugin, Search Engine
1 Comment
How to develop Apache Nutch’s plugin (4) IndexingFilter extension-point
In previous post, I introduced that Nutch’s Language Identificaiton plugin has 3 extensions on HtmlParseFilter, IndexingFilter and QueryFilter. In particular, the IndexingFilter extension handles a procedure of language identification. So we’ll research the way of developing an extension plugin on … Continue reading
Posted in Java, Nutch, Search Engine
2 Comments
How to develop Apache Nutch’s plugin (3) Research of an example
In order to plugin developments, We shall research the structure of a sample plugin. I want to develop a language detection plugin, so our research target is then ‘language-identifier’ plugin in the Nutch’s standard plugins. The ‘language-identifier’ plugin has three … Continue reading
Posted in Java, Nutch, Search Engine
2 Comments
How to develop Apache Nutch’s plugin (2) plugin.xml
To be enable plugins is to set them into $NUTCH_HOME/plugins/ and to add the plugin names into conf/nutch-site.xml. $NUTCH_HOME/plugins/ [plugin name]/ plugin.xml plugin information xml file [some name].jar plugin implemented jar file The “plugin.includes” property has enables plugin names in … Continue reading
Posted in Nutch, Search Engine
1 Comment
How to develop Apache Nutch’s plugin (1) extension-points list
I’ll want to substitute Apache Nutch’s LanguageIdentifier for our Language Detection library, so I’m summerizing researched information while developing. extension-points “Extension-points” are and they are interfaces that implement org.apache.nutch.plugin.Pluggable. list of extension-points Nutch Summarizer (org.apache.nutch.searcher.Summarizer) No documents. Summarizer of search … Continue reading
Posted in Nutch, Search Engine
1 Comment
Apache Nutch’s Architecture
Apache Nutch http://nutch.apache.org/ Apache Nutch is a web search engine which consists of Lucene, Solr, web crawler, page scoring(Page Rank) and plugable distributed system. Nutch’s crawler has a language identification plugin I’ll want to substitute Nutch’s LanguageIdentifier for our Language … Continue reading
Posted in Java, Nutch, Search Engine
2 Comments