Repository Migration from subversion into git on Google Code Project

I migrated language-detection’s repository on Google Code Project from subversion into git.
It is because its directory layout must be changed much for maven-support . (I HATE the branching of subversion! :D )

Google Code Project supports subversion, git and Mercurial as Version Control System. Each repository is independent and exclusive, and it is necessary to migrate between them by manual.
So I wrote migration process from subversion into git as step-bt-step.

1. Prepare git and git-svn

I used them on Cygwin. But I reckon it is easier on Linux :D

2. Migrate into local git repositories

At first, migrate the subversion repository into local git repositories (because of their exclusiveness).
The subversion repository of Google Code Project has histories of not only codes(trunk/branch/tag) but also wikis. In the case of git, codes and wikis are stored in each repository.
So you need to migrate 2 repositories if you didn’t use branches/tagges

Execute the below commands on an appropriate directory.

$ cd your-working-directory
$ git svn clone -s http://your-project-name.googlecode.com/svn/
$ git svn clone http://your-project-name.googlecode.com/svn/wiki/

“your-working-directory” and “your-project-name” must be replaced into your suitable names. It matters that the second command does not has -s option.
If you use branches and tagges, then you might need several steps for tagging on the git repository, but I didn’t research its methods because I didn’t use them :P

If you run into an error like the below with git-svn on Cygwin (it may often occur at Windows 7 x64???),

10192702 [main] perl 9904 C:\cygwin\bin\perl.exe: *** fatal error - unable to re
map C:\cygwin\bin\cygsvn_client-1-0.dll to same address as parent: 0xAB0000 != 0xAE0000

then execute rebaseall on cmd.exe as the below.

$ cd c:\cygwin\bin
$ ash rebaseall -v

3. Switch the project’s VCS into git

Switching into git is available at the source tab on the administrator’s console of the project.
The operation does not affect no inference to the subversion repository, but limit the only selected system to access. So you can restore the subversion at any time.

4. Push the local repositories into new ones

Execute the below commands.

$ cd svn
$ git push https://code.google.com/p/your-project-name/ master
$ cd ..
$ cd wiki
$ git push https://code.google.com/p/your-project-name.wiki/ master

Then the git repositories will be available. You shoud check whether the project’s wiki is as before.
I like that the repositories of codes and wikis are separeted.

Reference:
- http://d.hatena.ne.jp/kkobayashi_a/20090615/p1 (in Japanese)

Posted in Development | Leave a comment

Whether language profile should be bundled or not?

I’m going to support maven for language-detection, but I have some troubles about language profiles…

language-detection have separated language profiles from jar file because they are lots of (meanwhile jar without profiles has only 100KB around, jar with ones has over 1MB.)
However there are some reasonable requests that want to bundle them in jar file.

  • Profile-separated one makes the application know the installed directory.
  • Hadoop can’t distribute language profiles outside jar file.
  • I want to register the library into Maven Central, then profile-bundled one is more useful.

And a user told me how to generate jar file with selected profiles with maven’s template, so a developer who hope the slim-size library can generate a jar file with necessary profiles only (also no profiles!).

Then I begin to consider that the jar file packaged in language-detection library can bundle all language profiles. How do you think?

Posted in Java, Language Detection | 3 Comments

language-detection supported 17 language profiles for short messages

language-detection( http://code.google.com/p/language-detection/ , langdetect) supported newly 17 language detection generated from twitter corpus.
These are published at trunk of langdetect repository (which will be packaged sooner or later).

Those 17 languages are as the below.

  • cs : Czech
  • da : Dannish
  • de : German
  • en : English
  • es : Spanish
  • fi : Finnish
  • fr : French
  • id : Indonesian
  • it : Italian
  • nl : Dutch
  • no : Norwegian
  • pl : Polish
  • pt : Portuguese
  • ro : Romanian
  • sv : Swedish
  • tr : Turkish
  • vi : Vietnamese

These profiles perform more 2 point for short messages detection (tweet and so on) than the bundled profiles generated from Wikipedia abstracts.

language test size original profiles new profiles
correct accuracy correct accuracy
cs 4269 4261 100.0% 4258 100.0%
da 5484 5255 96.0% 5188 95.0%
de 9608 8495 88.0% 9020 94.0%
en 9630 8796 91.0% 9188 95.0%
es 10133 9721 96.0% 9943 98.0%
fi 2241 2238 100.0% 2236 100.0%
fr 10067 9719 97.0% 9906 98.0%
id 10184 9869 97.0% 10061 99.0%
it 10167 9844 97.0% 9960 98.0%
nl 9680 8449 87.0% 9399 97.0%
no 10505 10148 97.0% 10015 95.0%
pl 9886 9833 99.0% 9852 100.0%
pt 9456 8720 92.0% 9170 97.0%
ro 4057 3791 93.0% 3993 98.0%
sv 9932 9670 97.0% 9762 98.0%
tr 10309 10145 98.0% 10251 99.0%
vi 10932 10832 99.0% 10832 99.0%
total 146540 139786 95.4% 143034 97.6%

The Twitter corpus (training and test) used by generating profiles are based on collected tweets via ‘sample’ method of Twitter Streaming API, all of which are annotated by myself.
Those corpus size are as the below.

language training test
cs : Czech 3514 4342
da : Dannish 4007 5645
de : German 44115 9998
en : English 44335 10167
es : Spanish 44976 10296
fi : Finnish 3400 2310
fr : French 44279 10410
id : Indonesian 44912 10395
it : Italian 44183 10562
nl : Dutch 45033 10109
no : Norwegian 4272 10721
pl : Polish 5040 10282
pt : Portuguese 44505 9749
ro : Romanian 3490 4151
sv : Swedish 44330 10232
tr : Turkish 45024 10828
vi : Vietnamese 5029 11065

Those corpus are originally in order to a new short text detection, but as profile for langdetect, I confirmed they show higher performance of short message detection than the bundled profiles. So the new profiles are published too.
In using this profiles, text to detect should be converted into the lower case. Meanwhile langdetect tend to remove all upper case word as an acronym, twitter-like short messages are often written as all upper case sentence for emphasis.

Of cource the present profiles are bundled as until now. They have higher accuracy for news text and so on! :D

The prototype of the language detection for short texts are published at https://github.com/shuyo/ldig .
It is shortened from “Language Detection with Infinity-Gram”.
I wrote the presentation of ldig, but yet in Japanese only. I’ll translate it in English later…

Posted in Language Detection, NLP | Leave a comment

langdetect is updated(added profiles of Estonian / Lithuanian / Latvian / Slovene, and so on)

My language detection library “langdetect” was updated.

The added features are the following:

  • Added 4 language profiles of Estonian, Lithuanian, Latvian and Slovene.
  • Supported retrieving a list of loaded language profiles as getLangList()
  • Supported generating a language profile from plain text

and fixed some bugs.

Then I published a test data set for 21 languages based on Europarl Parallel Corpus ( http://www.statmt.org/europarl/ ) so that anyone are able to verify the library on the same condition.

It randomly samples 1000 sentences(lines) for each language from Europarl corpus.
Each line forms “[language code]\t[plain text by UTF-8]” to be avaiable for batch test tool of this langdetect library.
Europarl corpus has some of very short sentence(e.g. 2 words only!) that langdetect is not very good at. I remained them for fairness! :D

Then the following is a result with this test data.

bg (985/1000=0.99): {bg=985, ru=4, mk=11}
cs (993/1000=0.99): {sk=6, en=1, cs=993}
da (971/1000=0.97): {da=971, no=28, en=1}
de (998/1000=1.00): {de=998, da=1, af=1}
el (1000/1000=1.00): {el=1000}
en (997/1000=1.00): {fr=1, en=997, nl=1, af=1}
es (995/1000=1.00): {pt=4, en=1, es=995}
et (996/1000=1.00): {de=1, fi=2, et=996, af=1}
fi (998/1000=1.00): {fi=998, et=2}
fr (998/1000=1.00): {it=1, sv=1, fr=998}
hu (999/1000=1.00): {id=1, hu=999}
it (998/1000=1.00): {it=998, es=2}
lt (998/1000=1.00): {lv=2, lt=998}
lv (999/1000=1.00): {pt=1, lv=999}
nl (977/1000=0.98): {de=1, sv=1, nl=977, af=21}
pl (999/1000=1.00): {pl=999, nl=1}
pt (994/1000=0.99): {it=1, hu=1, pt=994, en=1, es=3}
ro (999/1000=1.00): {ro=999, fr=1}
sk (987/1000=0.99): {sl=2, sk=987, ro=1, lt=1, et=1, cs=8}
sl (972/1000=0.97): {hr=27, sl=972, en=1}
sv (990/1000=0.99): {da=2, no=8, sv=990}
total: 20843/21000 = 0.993

This is obtained by batchtest tool.

java -jar lib/langdetect.jar -d profiles -s 0 --batchtest europarl.test

The random seed(-s) is set to 0 for reproduce.

Hence Danish(da), Dutch(nl) and Slovene(sl) are very similar to Norwegian(no), Afrikaans(af) and Croatian(hr) respectively, their detection accuracies are lower than others.
(While Norwegian has some proper letters which are not used in Dutch, so its accuracy leaves higher)

Posted in Uncategorized | 1 Comment

twitter replaces a string ‘\u2028′ into ‘\u2070′

I posted a tweet about Unicode’s line feed code, including a string ‘\u2028′.
Then it was replaced ‘\u2070′!

Hence not only ‘\u2028′(LINE SEPARATOR) but also ‘\u2029′(PARAGRAPH SEPARATOR) is done so, twitter intends to do something (awful? :P ) for line feed codes.
But ‘\u2028′ is mere a string in ascii…

Posted in Uncategorized | Leave a comment

Collapsed Gibbs Sampling Estimation for Latent Dirichlet Allocation (3)

In the previous article, I introduced the simple implement of the collapsed gibbs sampling estimation for Latent Dirichlet Allocation(LDA).
However each word topic z_mn is initialized to a random topic in this implement, there are some toubles.

First, it needs many iterations before its perplexity begins to decrease.
Second, almost topics have some stopwords like ‘the’ and ‘of’ with high probabilities when converging its perplexity.
Moreover there are some topic groups which share similar word distributions. Therefore the substantial topics are less than topic size parameter K.

Instead of the random initialization, draw from the posterior probability form as sampling the new topic incrementally.

Furthermore, Hence n_zt is only used with beta additionally, n_mz with alpha, and n_z with V * beta, it is an efficient implement to add the corresponding parameters to these variables beforehand.

Then, a sample initialization code is the following,

# docs : documents which consists of word array
# K : number of topics
# V : vocaburary size

z_m_n = [] # topics of words of documents
n_m_z = numpy.zeros((len(self.docs), K)) + alpha     # word count of each document and topic
n_z_t = numpy.zeros((K, V)) + beta # word count of each topic and vocabulary
n_z = numpy.zeros(K) + V * beta    # word count of each topic

for m, doc in enumerate(docs):
    z_n = []
    for t in doc:
        # draw from the posterior
        p_z = n_z_t[:, t] * n_m_z[m] / n_z
        z = numpy.random.multinomial(1, p_z / p_z.sum()).argmax()

        z_n.append(z)
        n_m_z[m, z] += 1
        n_z_t[z, t] += 1
        n_z[z] += 1
    z_m_n.append(numpy.array(z_n))

and so is a sample inference code.

for m, doc in enumerate(docs):
    for n, t in enumerate(doc):
        # discount for n-th word t with topic z
        z = z_m_n[m][n]
        n_m_z[m, z] -= 1
        n_z_t[z, t] -= 1
        n_z[z] -= 1

        # sampling new topic
        p_z = n_z_t[:, t] * n_m_z[m] / n_z # Here is only changed.
        new_z = numpy.random.multinomial(1, p_z / p_z.sum()).argmax()

        # preserve the new topic and increase the counters
        z_m_n[m][n] = new_z
        n_m_z[m, new_z] += 1
        n_z_t[new_z, t] += 1
        n_z[new_z] += 1

(The inference code is similar to the previous version but is simplified at the posterior calculation.)

Posted in LDA, Python | Leave a comment

Collapsed Gibbs Sampling Estimation for Latent Dirichlet Allocation (2)

Before iterations of LDA estimation, it is necessary to initialize parameters.
Collapsed Gibbs Sampling (CGS) estimation has the following parameters.

  • z_mn : topic of word n of document m
  • n_mz : word count of document m with topic z
  • n_tz : count of word t with topic z
  • n_z : word count with topic z

The most simple initialization is to assign each word to a random topic and increase the corresponding counters n_mz, n_tz and n_z.

# docs : documents which consists of word array
# K : number of topics
# V : vocaburary size

z_m_n = [] # topics of words of documents
n_m_z = numpy.zeros((len(docs), K))     # word count of each document and topic
n_z_t = numpy.zeros((K, V)) # word count of each topic and vocabulary
n_z = numpy.zeros(K)        # word count of each topic

for m, doc in enumerate(docs):
    z_n = []
    for t in doc:
        z = numpy.random.randint(0, K)
        z_n.append(z)
        n_m_z[m, z] += 1
        n_z_t[z, t] += 1
        n_z[z] += 1
    z_m_n.append(numpy.array(z_n))

Then the iterative inference with the full-conditional in the previous article is the following. That is repeated until the perplexity gets stable.

for m, doc in enumerate(docs):
    for n, t in enumerate(doc):
        # discount for n-th word t with topic z
        z = z_m_n[m][n]
        n_m_z[m, z] -= 1
        n_z_t[z, t] -= 1
        n_z[z] -= 1

        # sampling new topic
        p_z = (n_z_t[:, t] + beta) * (n_m_z[m] + alpha) / (n_z + V * beta)
        new_z = numpy.random.multinomial(1, p_z / p_z.sum()).argmax()

        # preserve the new topic and increase the counters
        z_m_n[m][n] = new_z
        n_m_z[m, new_z] += 1
        n_z_t[new_z, t] += 1
        n_z[new_z] += 1

It is using numpy.random.multinomial() method with set 1 to number of experiments for drawing from multinomial distribution.
Hence this method returns a array of which a certain k-th element is 1 and the remainder are 0, argmax() retrives the k value. This is a little wasteful…

The next article will show another efficient initialization.

Posted in LDA, Machine Learning, Python | Leave a comment

Collapsed Gibbs Sampling Estimation for Latent Dirichlet Allocation (1)

Latent Dirichlet Allocation (LDA) is a generative model which is used as a language topic model and so on.

Graphical model of LDA

Graphical model of LDA (this figure from Wikipedia)

Each random variable means the following

  • θ : document-topic distribution, document-topic multinomial drawn from Dirichlet distribution
  • φ : topic-word distribution, topic-word multinomial drawn from Dirichlet distribution
  • Z : word topic, word topic drawn from multinomial
  • W : word, word drawn from multinomial

There are some populaer estimation methods for LDA, and Collapsed Gibbs sampling (CGS) is one of them.
This method is to integral out random variables except for word topic {z_mn} and draw each z_mn from posterior.
The posterior of z_mn is the following:

Collapsed Gibbs sampling of LDA

where n_mz is a word count of document m with topic z, n_tz is a count of word t with topic z, n_z is a word count with topic z and -mn means “except z_mn.”
The estimation iterates until its perplexity converges or appropriate times.

Perplexity of LDA

where

Topic-word distributions
Document-topic distributions

and n_m is a word count of document m.
However perplexities usually decrease as learnings are progressing, my experiment told some different tendencies.

Continued on the next post.

Posted in LDA, Machine Learning | Leave a comment

Latent Dirichlet Allocation in Python

Latent Dirichlet Allocation (LDA) is a language topic model.

In LDA, each document has a topic distribution and each topic has a word distribution.
Words are generated from topic-word distribution with respect to the drawn topics in the document.

However LDA’s estimation uses Variational Bayesian originally (Blei+ 2003), Collapsed Gibbs sampling (CGS) method is known as a more precise estimation.
So I tried implementing the CGS estimation of LDA in Python.

It requires Python 2.6, numpy and NLTK.


$ ./lda.py -h
Usage: lda.py [options]

Options:
-h, --help show this help message and exit
-f FILENAME corpus filename
-c CORPUS using range of Brown corpus' files(start:end)
--alpha=ALPHA parameter alpha
--beta=BETA parameter beta
-k K number of topics
-i ITERATION iteration count
-s smart initialize of parameters
--stopwords exclude stop words
--seed=SEED random seed
--df=DF threshold of document freaquency to cut words

$ ./lda.py -c 0:20 -k 10 --alpha=0.5 --beta=0.5 -i 50

This command outputs perplexities of every iteration and the estimated topic-word distribution(top 20 words in their probabilities).
I will explain the main point of this implementation in the next article.

Posted in LDA, Machine Learning, NLP, Python, text analysis | Leave a comment

Minimalist Program respects to Erlangen Program?

I’m learning Linguistics, mainly Syntax and generative grammars.
Minimalist Program puts me in mind of Klein’s Erlangen Program.
Had Chomsky taken it into account?

Posted in Linguistics | Leave a comment