Latent Dirichlet Allocation in Python

Latent Dirichlet Allocation (LDA) is a language topic model.

In LDA, each document has a topic distribution and each topic has a word distribution.
Words are generated from topic-word distribution with respect to the drawn topics in the document.

However LDA’s estimation uses Variational Bayesian originally (Blei+ 2003), Collapsed Gibbs sampling (CGS) method is known as a more precise estimation.
So I tried implementing the CGS estimation of LDA in Python.

It requires Python 2.6, numpy and NLTK.


$ ./lda.py -h
Usage: lda.py [options]

Options:
-h, --help show this help message and exit
-f FILENAME corpus filename
-c CORPUS using range of Brown corpus' files(start:end)
--alpha=ALPHA parameter alpha
--beta=BETA parameter beta
-k K number of topics
-i ITERATION iteration count
-s smart initialize of parameters
--stopwords exclude stop words
--seed=SEED random seed
--df=DF threshold of document freaquency to cut words

$ ./lda.py -c 0:20 -k 10 --alpha=0.5 --beta=0.5 -i 50

This command outputs perplexities of every iteration and the estimated topic-word distribution(top 20 words in their probabilities).
I will explain the main point of this implementation in the next article.

19 thoughts on “Latent Dirichlet Allocation in Python

    1. lda.py uses Brown corpus in NLTK with -c option.
      If you have some corpus you want to use, you can use it with -f option.
      Its format is as 1 document on 1 line, each word is separated with space.

      1. What’s the -c option and -f option .I’m sorry,I’m a primary student of python.I met the same problem with him.Can you tell me how to use Brown corpus in NLTK with -c option or -f option

  1. Thanks for this blog post, and the the very useful code! I used this python LDA in a couple of little projects and really enjoyed learning about LDA as well as using the outcomes! It was a little tough to learn how to choose initialization values, but overall this was very accessible! Also, Blei’s talk on You Tube was hlepful.

  2. How did you choose alpha and beta?

    I found this guide:
    Appropriate values for ALPHA and BETA depend on the number of topics and the number of words in vocabulary. For most applications, good results can be obtained by setting ALPHA = 50 / T and BETA = 200 / W

    but alpha values are > 1.0 in this case, is alpha restricted to the range of 0 to 1?

    1. Sorry for my late response.
      I suppose it has no advantage of Dirichlet parameters greater than 1 on topic model.
      I always choose parameters as small as possible, e.g. ALPHA=0.01/T and so on.

  3. The code works really good. Thanks.
    This code gives me the topic distribution for the whole corpus. Supposing that I have done it and now I want to attribute topics to a certain document (new document), then how do I do it? Its simple, I know but have you written the code for the same as well?

    1. It is troublesome to calculate topic-distribution of a new document on LDA.
      It is also one way to do Gibbs sampling partially.
      I have not written such code as I have not been necessary it…

    2. Hello Sir,
      I did try the code but m facing a difficulty as it is giving the following error
      error: need corpus filename(-f) or corpus range(-c)
      I tried but as i’m not very familiar with the optparse i’m not able to tackle the problem. Please can u provide a solution for this.
      Thank you.

  4. I have not run the code yet, but I look at it and it does not look like it outputs the document-topic distribution for each document. I only see the topic-word distribution. Is this not included?

Leave a reply to Mridul Cancel reply