Category Archives: text analysis

Latent Dirichlet Allocation in Python

Latent Dirichlet Allocation (LDA) is a language topic model. In LDA, each document has a topic distribution and each topic has a word distribution. Words are generated from topic-word distribution with respect to the drawn topics in the document. However … Continue reading

Posted in LDA, Machine Learning, NLP, Python, text analysis | 18 Comments

CICLing 2011 retrospective

In Feb 25, I attended the last day of CICLing 2011 (International Conference on Intelligent Text Processing and Computational Linguistics) at Waseda University, Japan. I enjoyed very much so this is my first time to attend international conferences. Well, I’ll … Continue reading

Posted in NLP, text analysis | Leave a comment

Language Detection Plugin for Apache Nutch

I developed a Language Detection plugin for Apache Nutch with our LangDetect library. Download (bundled in the LangDetect library) Setup manual Compatible to the Standard language identification plugin of Nutch 99% over accuracy Supports 49 languages Afrikaans, Albanian, Arabic, Bengali, … Continue reading

Posted in i18n, NLP, Nutch, Plugin, Search Engine, text analysis | Leave a comment

Language Detection Library for Java

I developed a Language Detection library for Java which is able to detect 49 languages for given text (English, Japanese, Chinese, …). http://code.google.com/p/language-detection/ This library has 99% over accuracy for news corpus (see below presentation). I’ll try to substitute Apache … Continue reading

Posted in i18n, Java, NLP, text analysis | 3 Comments

Extract body from Project Gutenberg’s text

Project Gutenberg’s texts are convenient to experiment text analysis and information retrieval. But they have header and footer which are extremely free format… I tried to write a Ruby script to extract body of Project Gutenberg’s texts with regular expressions. … Continue reading

Posted in text analysis | Tagged | Leave a comment