Category Archives: Development

Language Detection for twitter with 99.1% Accuracy

I released a newer prototype of Language Detection ( Language Identification ) with Infinity-Gram (ldig), a language detection prototype for twitter. https://github.com/shuyo/ldig It detects tweets in 17 languages with 99.1% accuracy (Czech, Dannish, German, English, Spanish, Finnish, French, Indonesian, Italian, … Continue reading

Posted in Language Detection, NLP, Python | 25 Comments

Repository Migration from subversion into git on Google Code Project

I migrated language-detection’s repository on Google Code Project from subversion into git. It is because its directory layout must be changed much for maven-support . (I HATE the branching of subversion! :D) Google Code Project supports subversion, git and Mercurial … Continue reading

Posted in Development | Leave a comment

Whether language profile should be bundled or not?

I’m going to support maven for language-detection, but I have some troubles about language profiles… language-detection have separated language profiles from jar file because they are lots of (meanwhile jar without profiles has only 100KB around, jar with ones has … Continue reading

Posted in Java, Language Detection | 4 Comments

Collapsed Gibbs Sampling Estimation for Latent Dirichlet Allocation (3)

In the previous article, I introduced the simple implement of the collapsed gibbs sampling estimation for Latent Dirichlet Allocation(LDA). However each word topic z_mn is initialized to a random topic in this implement, there are some toubles. First, it needs … Continue reading

Posted in LDA, Python | 6 Comments

Collapsed Gibbs Sampling Estimation for Latent Dirichlet Allocation (2)

Before iterations of LDA estimation, it is necessary to initialize parameters. Collapsed Gibbs Sampling (CGS) estimation has the following parameters. z_mn : topic of word n of document m n_mz : word count of document m with topic z n_tz … Continue reading

Posted in LDA, Machine Learning, Python | 14 Comments

Latent Dirichlet Allocation in Python

Latent Dirichlet Allocation (LDA) is a language topic model. In LDA, each document has a topic distribution and each topic has a word distribution. Words are generated from topic-word distribution with respect to the drawn topics in the document. However … Continue reading

Posted in LDA, Machine Learning, NLP, Python, text analysis | 18 Comments

Hadoop Development Environment with Eclipse

A usual Hadoop application needs a jar file for distributed nodes. It is very troublesome to repeat creating jar frequently in development… So I’ve enable Eclipse to run Hadoop application(Map-Reduce Job) on its standalone mode. It make debug easier so … Continue reading

Posted in Development, Eclipse, Hadoop, Java | 89 Comments