Calendar
May 2013 M T W T F S S « Aug 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 -
Recent Posts
Categories
Category Archives: Development
Language Detection for twitter with 99.1% Accuracy
I released a newer prototype of Language Detection ( Language Identification ) with Infinity-Gram (ldig), a language detection prototype for twitter. https://github.com/shuyo/ldig It detects tweets in 17 languages with 99.1% accuracy (Czech, Dannish, German, English, Spanish, Finnish, French, Indonesian, Italian, … Continue reading
Posted in Language Detection, NLP, Python
16 Comments
Repository Migration from subversion into git on Google Code Project
I migrated language-detection’s repository on Google Code Project from subversion into git. It is because its directory layout must be changed much for maven-support . (I HATE the branching of subversion! ) Google Code Project supports subversion, git and Mercurial … Continue reading
Posted in Development
Leave a comment
Whether language profile should be bundled or not?
I’m going to support maven for language-detection, but I have some troubles about language profiles… language-detection have separated language profiles from jar file because they are lots of (meanwhile jar without profiles has only 100KB around, jar with ones has … Continue reading
Posted in Java, Language Detection
3 Comments
Collapsed Gibbs Sampling Estimation for Latent Dirichlet Allocation (3)
In the previous article, I introduced the simple implement of the collapsed gibbs sampling estimation for Latent Dirichlet Allocation(LDA). However each word topic z_mn is initialized to a random topic in this implement, there are some toubles. First, it needs … Continue reading
Posted in LDA, Python
4 Comments
Collapsed Gibbs Sampling Estimation for Latent Dirichlet Allocation (2)
Before iterations of LDA estimation, it is necessary to initialize parameters. Collapsed Gibbs Sampling (CGS) estimation has the following parameters. z_mn : topic of word n of document m n_mz : word count of document m with topic z n_tz … Continue reading
Posted in LDA, Machine Learning, Python
8 Comments
Latent Dirichlet Allocation in Python
Latent Dirichlet Allocation (LDA) is a language topic model. In LDA, each document has a topic distribution and each topic has a word distribution. Words are generated from topic-word distribution with respect to the drawn topics in the document. However … Continue reading
Posted in LDA, Machine Learning, NLP, Python, text analysis
10 Comments
Hadoop Development Environment with Eclipse
A usual Hadoop application needs a jar file for distributed nodes. It is very troublesome to repeat creating jar frequently in development… So I’ve enable Eclipse to run Hadoop application(Map-Reduce Job) on its standalone mode. It make debug easier so … Continue reading
Posted in Development, Eclipse, Hadoop, Java
83 Comments
Mahout Development Environment with Maven and Eclipse (2)
Sample Codes of “Mahout in Action” The sample codes of “Mahout in Action”, which is a Mahout book from Manning, are published at here. They include source codes in Chapter 2 to 6. Now, We’ll build them on the Eclipse … Continue reading
Posted in Eclipse, Java, Machine Learning, Mahout, Maven
21 Comments
Mahout Development Environment with Maven and Eclipse (1)
I’m reading “Mahout in Action” MEAP Edition, but it doesn’t teach how to construct a development environment of Mahout… So I wrote the way of that by testing sample codes of “Mahout in Action”. Install I examine based on Windows … Continue reading
Posted in Development, Eclipse, Java, Machine Learning, Mahout, Maven
10 Comments
Updated LangDetect Library (4x faster)
I’ve updated LangDetect (Language Detection Library for Java). http://code.google.com/p/language-detection/ Download a package “langdetect-01-24-2011.zip” from Download List This updating has 4x faster detection based on Posted improvement code by Elmer Garduno. Very Thanks!! table: 100 times detection time for test data(ms). … Continue reading
Posted in i18n, Java, Language Detection, NLP
4 Comments