Latent Dirichlet Allocation (LDA) is a language topic model.

In LDA, each document has a topic distribution and each topic has a word distribution.

Words are generated from topic-word distribution with respect to the drawn topics in the document.

However LDA’s estimation uses Variational Bayesian originally (Blei+ 2003), Collapsed Gibbs sampling (CGS) method is known as a more precise estimation.

So I tried implementing the CGS estimation of LDA in Python.

- https://github.com/shuyo/iir/blob/master/lda/lda.py
- https://github.com/shuyo/iir/blob/master/lda/vocabulary.py

It requires Python 2.6, numpy and NLTK.

$ ./lda.py -h

Usage: lda.py [options]

```
```Options:

-h, --help show this help message and exit

-f FILENAME corpus filename

-c CORPUS using range of Brown corpus' files(start:end)

--alpha=ALPHA parameter alpha

--beta=BETA parameter beta

-k K number of topics

-i ITERATION iteration count

-s smart initialize of parameters

--stopwords exclude stop words

--seed=SEED random seed

--df=DF threshold of document freaquency to cut words

`$ ./lda.py -c 0:20 -k 10 --alpha=0.5 --beta=0.5 -i 50`

This command outputs perplexities of every iteration and the estimated topic-word distribution(top 20 words in their probabilities).

I will explain the main point of this implementation in the next article.

This could be very helpful for a project that I’m working on, my question is.. how can I test it?

You can check out and try it. Thanks.

Do you have the sample dataset? I don’t know the correct corpus file data format, thanks to reply.

lda.py uses Brown corpus in NLTK with -c option.

If you have some corpus you want to use, you can use it with -f option.

Its format is as 1 document on 1 line, each word is separated with space.

Oh, I see! Thanks to reply.

Thanks for this blog post, and the the very useful code! I used this python LDA in a couple of little projects and really enjoyed learning about LDA as well as using the outcomes! It was a little tough to learn how to choose initialization values, but overall this was very accessible! Also, Blei’s talk on You Tube was hlepful.

How did you choose alpha and beta?

I found this guide:

Appropriate values for ALPHA and BETA depend on the number of topics and the number of words in vocabulary. For most applications, good results can be obtained by setting ALPHA = 50 / T and BETA = 200 / W

but alpha values are > 1.0 in this case, is alpha restricted to the range of 0 to 1?

Sorry for my late response.

I suppose it has no advantage of Dirichlet parameters greater than 1 on topic model.

I always choose parameters as small as possible, e.g. ALPHA=0.01/T and so on.

Say I need the LDA to train on the brown corpus and test it on a sample file, how do I pass the parameters ?

-h option tells how to pass the parameters.

it doesn’t work for other language data set such as Arabic

The code works really good. Thanks.

This code gives me the topic distribution for the whole corpus. Supposing that I have done it and now I want to attribute topics to a certain document (new document), then how do I do it? Its simple, I know but have you written the code for the same as well?

It is troublesome to calculate topic-distribution of a new document on LDA.

It is also one way to do Gibbs sampling partially.

I have not written such code as I have not been necessary it…

Hello Sir,

I did try the code but m facing a difficulty as it is giving the following error

error: need corpus filename(-f) or corpus range(-c)

I tried but as i’m not very familiar with the optparse i’m not able to tackle the problem. Please can u provide a solution for this.

Thank you.

Hello,

My script outputs its help message in -h options.

I hope it will be helpful to you.

I have not run the code yet, but I look at it and it does not look like it outputs the document-topic distribution for each document. I only see the topic-word distribution. Is this not included?

You can obtain it by lda.n_m_z / lda.n_m_z.sum(axis=1)[:,numpy.newaxis] if you want.

Hi,

Do you know of any similar posts for LDA using variational bayes inference.

Thank you