Calendar
May 2013 M T W T F S S « Aug 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 -
Recent Posts
Categories
Category Archives: Java
Whether language profile should be bundled or not?
I’m going to support maven for language-detection, but I have some troubles about language profiles… language-detection have separated language profiles from jar file because they are lots of (meanwhile jar without profiles has only 100KB around, jar with ones has … Continue reading
Posted in Java, Language Detection
3 Comments
Hadoop Development Environment with Eclipse
A usual Hadoop application needs a jar file for distributed nodes. It is very troublesome to repeat creating jar frequently in development… So I’ve enable Eclipse to run Hadoop application(Map-Reduce Job) on its standalone mode. It make debug easier so … Continue reading
Posted in Development, Eclipse, Hadoop, Java
83 Comments
Mahout Development Environment with Maven and Eclipse (2)
Sample Codes of “Mahout in Action” The sample codes of “Mahout in Action”, which is a Mahout book from Manning, are published at here. They include source codes in Chapter 2 to 6. Now, We’ll build them on the Eclipse … Continue reading
Posted in Eclipse, Java, Machine Learning, Mahout, Maven
21 Comments
Mahout Development Environment with Maven and Eclipse (1)
I’m reading “Mahout in Action” MEAP Edition, but it doesn’t teach how to construct a development environment of Mahout… So I wrote the way of that by testing sample codes of “Mahout in Action”. Install I examine based on Windows … Continue reading
Posted in Development, Eclipse, Java, Machine Learning, Mahout, Maven
10 Comments
Updated LangDetect Library (4x faster)
I’ve updated LangDetect (Language Detection Library for Java). http://code.google.com/p/language-detection/ Download a package “langdetect-01-24-2011.zip” from Download List This updating has 4x faster detection based on Posted improvement code by Elmer Garduno. Very Thanks!! table: 100 times detection time for test data(ms). … Continue reading
Posted in i18n, Java, Language Detection, NLP
4 Comments
How to develop Apache Nutch’s plugin (5) Sample Code (Language Detection Plugin)
Now, as a Nutch plugin sample code, we shall see a Language Detection plugin with our LangDetect library. In 3 extensions which Apache Nutch’s Language Identificaiton plugin has, we will replace a IndexingFilter extension only (see the previous post). So … Continue reading
Posted in Java, Nutch, Plugin, Search Engine
1 Comment
How to develop Apache Nutch’s plugin (4) IndexingFilter extension-point
In previous post, I introduced that Nutch’s Language Identificaiton plugin has 3 extensions on HtmlParseFilter, IndexingFilter and QueryFilter. In particular, the IndexingFilter extension handles a procedure of language identification. So we’ll research the way of developing an extension plugin on … Continue reading
Posted in Java, Nutch, Search Engine
2 Comments
How to develop Apache Nutch’s plugin (3) Research of an example
In order to plugin developments, We shall research the structure of a sample plugin. I want to develop a language detection plugin, so our research target is then ‘language-identifier’ plugin in the Nutch’s standard plugins. The ‘language-identifier’ plugin has three … Continue reading
Posted in Java, Nutch, Search Engine
2 Comments
Apache Nutch’s Architecture
Apache Nutch http://nutch.apache.org/ Apache Nutch is a web search engine which consists of Lucene, Solr, web crawler, page scoring(Page Rank) and plugable distributed system. Nutch’s crawler has a language identification plugin I’ll want to substitute Nutch’s LanguageIdentifier for our Language … Continue reading
Posted in Java, Nutch, Search Engine
2 Comments
Language Detection Library for Java
I developed a Language Detection library for Java which is able to detect 49 languages for given text (English, Japanese, Chinese, …). http://code.google.com/p/language-detection/ This library has 99% over accuracy for news corpus (see below presentation). I’ll try to substitute Apache … Continue reading
Posted in i18n, Java, NLP, text analysis
1 Comment