Latent Dirichlet Allocation in Python

May 18, 2011May 24, 2011 ~ shuyo

Latent Dirichlet Allocation (LDA) is a language topic model.

In LDA, each document has a topic distribution and each topic has a word distribution.
Words are generated from topic-word distribution with respect to the drawn topics in the document.

However LDA’s estimation uses Variational Bayesian originally (Blei+ 2003), Collapsed Gibbs sampling (CGS) method is known as a more precise estimation.
So I tried implementing the CGS estimation of LDA in Python.

It requires Python 2.6, numpy and NLTK.

$ ./lda.py -h Usage: lda.py [options]


Options:

  -h, --help     show this help message and exit

  -f FILENAME    corpus filename

  -c CORPUS      using range of Brown corpus' files(start:end)

  --alpha=ALPHA  parameter alpha

  --beta=BETA    parameter beta

  -k K           number of topics

  -i ITERATION   iteration count

  -s             smart initialize of parameters

  --stopwords    exclude stop words

  --seed=SEED    random seed

  --df=DF        threshold of document freaquency to cut words

$ ./lda.py -c 0:20 -k 10 --alpha=0.5 --beta=0.5 -i 50

This command outputs perplexities of every iteration and the estimated topic-word distribution(top 20 words in their probabilities).
I will explain the main point of this implementation in the next article.

Published by shuyo

View all posts by shuyo

19 thoughts on “Latent Dirichlet Allocation in Python”

Yerko Antonio says:

June 3, 2012 at 2:02 pm

This could be very helpful for a project that I’m working on, my question is.. how can I test it?

Reply
1. shuyo says:
  
  June 4, 2012 at 11:05 am
  
  You can check out and try it. Thanks.
  
  Reply
fukuball says:

September 12, 2012 at 5:49 pm

Do you have the sample dataset? I don’t know the correct corpus file data format, thanks to reply.

Reply
1. shuyo says:
  
  September 13, 2012 at 12:53 pm
  
  lda.py uses Brown corpus in NLTK with -c option.
  If you have some corpus you want to use, you can use it with -f option.
  Its format is as 1 document on 1 line, each word is separated with space.
  
  Reply
  1. fukuball says:
    
    September 13, 2012 at 12:55 pm
    
    Oh, I see! Thanks to reply.
  2. Yang. says:
    
    April 30, 2016 at 11:51 am
    
    What’s the -c option and -f option .I’m sorry,I’m a primary student of python.I met the same problem with him.Can you tell me how to use Brown corpus in NLTK with -c option or -f option
Mark Menkhus (@menkhus) says:

March 26, 2013 at 2:20 am

Thanks for this blog post, and the the very useful code! I used this python LDA in a couple of little projects and really enjoyed learning about LDA as well as using the outcomes! It was a little tough to learn how to choose initialization values, but overall this was very accessible! Also, Blei’s talk on You Tube was hlepful.

Reply
Marc Maxson says:

April 11, 2013 at 7:53 am

How did you choose alpha and beta?

I found this guide:
Appropriate values for ALPHA and BETA depend on the number of topics and the number of words in vocabulary. For most applications, good results can be obtained by setting ALPHA = 50 / T and BETA = 200 / W

but alpha values are > 1.0 in this case, is alpha restricted to the range of 0 to 1?

Reply
1. shuyo says:
  
  May 1, 2013 at 4:56 pm
  
  Sorry for my late response.
  I suppose it has no advantage of Dirichlet parameters greater than 1 on topic model.
  I always choose parameters as small as possible, e.g. ALPHA=0.01/T and so on.
  
  Reply
dheeraj rajagopal (@dheerajgopal) says:

April 28, 2013 at 6:37 pm

Say I need the LDA to train on the brown corpus and test it on a sample file, how do I pass the parameters ?

Reply
1. shuyo says:
  
  May 1, 2013 at 6:37 pm
  
  -h option tells how to pass the parameters.
  
  Reply
fdgfd says:

July 18, 2013 at 2:39 am

it doesn’t work for other language data set such as Arabic

Reply
Mridul says:

August 2, 2013 at 3:45 am

The code works really good. Thanks.
This code gives me the topic distribution for the whole corpus. Supposing that I have done it and now I want to attribute topics to a certain document (new document), then how do I do it? Its simple, I know but have you written the code for the same as well?

Reply
1. shuyo says:
  
  August 2, 2013 at 7:45 pm
  
  It is troublesome to calculate topic-distribution of a new document on LDA.
  It is also one way to do Gibbs sampling partially.
  I have not written such code as I have not been necessary it…
  
  Reply
2. Purvi Skumar says:
  
  February 9, 2014 at 9:24 pm
  
  Hello Sir,
  I did try the code but m facing a difficulty as it is giving the following error
  error: need corpus filename(-f) or corpus range(-c)
  I tried but as i’m not very familiar with the optparse i’m not able to tackle the problem. Please can u provide a solution for this.
  Thank you.
  
  Reply
  1. shuyo says:
    
    February 11, 2014 at 11:51 pm
    
    Hello,
    My script outputs its help message in -h options.
    I hope it will be helpful to you.
alex says:

September 9, 2013 at 10:43 pm

I have not run the code yet, but I look at it and it does not look like it outputs the document-topic distribution for each document. I only see the topic-word distribution. Is this not included?

Reply
1. shuyo says:
  
  September 10, 2013 at 1:02 am
  
  You can obtain it by lda.n_m_z / lda.n_m_z.sum(axis=1)[:,numpy.newaxis] if you want.
  
  Reply
Anirudh Acharya says:

January 30, 2014 at 4:59 am

Hi,
Do you know of any similar posts for LDA using variational bayes inference.

Thank you

Reply