I developed a Language Detection plugin for Apache Nutch with our LangDetect library.
- Download (bundled in the LangDetect library)
- Setup manual
- Compatible to the Standard language identification plugin of Nutch
- 99% over accuracy
- Supports 49 languages
- Afrikaans, Albanian, Arabic, Bengali, Bulgarian, Chinese(Simplified/Traditional)
- Croatian, Czech, Dannish, Dutch, English, Finnish, French, German, Greek
- Gujarati, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Kannada
- Korean, Macedonian, Malayalam, Marathi, Nepali, Norwegian, Persian, Polish
- Portuguese, Punjabi, Romanian, Russian, Slovak, Somali, Spanish, Swahili
- Swedish, Tagalog, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Vietnamese
The above is sample of detected indexs which are crawled from several sites.
If you confirm as well, display the Nutch’s index using luke after crawl.
I also wrote the articles how to develop Nutch’s plugin. Read them if you are interested.
- (1) extension-points list
- (2) plugin.xml
- (3) Research of an example
- (4) IndexingFilter extension-point
- (5) Sample Code (Language Detection Plugin)