Calendar
June 2013 M T W T F S S « Aug 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 -
Recent Posts
Categories
Category Archives: text analysis
Latent Dirichlet Allocation in Python
Latent Dirichlet Allocation (LDA) is a language topic model. In LDA, each document has a topic distribution and each topic has a word distribution. Words are generated from topic-word distribution with respect to the drawn topics in the document. However … Continue reading
Posted in LDA, Machine Learning, NLP, Python, text analysis
10 Comments
CICLing 2011 retrospective
In Feb 25, I attended the last day of CICLing 2011 (International Conference on Intelligent Text Processing and Computational Linguistics) at Waseda University, Japan. I enjoyed very much so this is my first time to attend international conferences. Well, I’ll … Continue reading
Posted in NLP, text analysis
Leave a comment
Language Detection Plugin for Apache Nutch
I developed a Language Detection plugin for Apache Nutch with our LangDetect library. Download (bundled in the LangDetect library) Setup manual Compatible to the Standard language identification plugin of Nutch 99% over accuracy Supports 49 languages Afrikaans, Albanian, Arabic, Bengali, … Continue reading
Posted in i18n, NLP, Nutch, Plugin, Search Engine, text analysis
Leave a comment
Language Detection Library for Java
I developed a Language Detection library for Java which is able to detect 49 languages for given text (English, Japanese, Chinese, …). http://code.google.com/p/language-detection/ This library has 99% over accuracy for news corpus (see below presentation). I’ll try to substitute Apache … Continue reading
Posted in i18n, Java, NLP, text analysis
1 Comment
Extract body from Project Gutenberg’s text
Project Gutenberg’s texts are convenient to experiment text analysis and information retrieval. But they have header and footer which are extremely free format… I tried to write a Ruby script to extract body of Project Gutenberg’s texts with regular expressions. … Continue reading