Latent Dirichlet Allocation in Python

Latent Dirichlet Allocation (LDA) is a language topic model.

In LDA, each document has a topic distribution and each topic has a word distribution.
Words are generated from topic-word distribution with respect to the drawn topics in the document.

However LDA’s estimation uses Variational Bayesian originally (Blei+ 2003), Collapsed Gibbs sampling (CGS) method is known as a more precise estimation.
So I tried implementing the CGS estimation of LDA in Python.

It requires Python 2.6, numpy and NLTK.


$ ./lda.py -h
Usage: lda.py [options]

Options:
-h, --help show this help message and exit
-f FILENAME corpus filename
-c CORPUS using range of Brown corpus' files(start:end)
--alpha=ALPHA parameter alpha
--beta=BETA parameter beta
-k K number of topics
-i ITERATION iteration count
-s smart initialize of parameters
--stopwords exclude stop words
--seed=SEED random seed
--df=DF threshold of document freaquency to cut words

$ ./lda.py -c 0:20 -k 10 --alpha=0.5 --beta=0.5 -i 50

This command outputs perplexities of every iteration and the estimated topic-word distribution(top 20 words in their probabilities).
I will explain the main point of this implementation in the next article.

Advertisements
This entry was posted in LDA, Machine Learning, NLP, Python, text analysis. Bookmark the permalink.

18 Responses to Latent Dirichlet Allocation in Python

  1. This could be very helpful for a project that I’m working on, my question is.. how can I test it?

  2. fukuball says:

    Do you have the sample dataset? I don’t know the correct corpus file data format, thanks to reply.

  3. Thanks for this blog post, and the the very useful code! I used this python LDA in a couple of little projects and really enjoyed learning about LDA as well as using the outcomes! It was a little tough to learn how to choose initialization values, but overall this was very accessible! Also, Blei’s talk on You Tube was hlepful.

  4. Marc Maxson says:

    How did you choose alpha and beta?

    I found this guide:
    Appropriate values for ALPHA and BETA depend on the number of topics and the number of words in vocabulary. For most applications, good results can be obtained by setting ALPHA = 50 / T and BETA = 200 / W

    but alpha values are > 1.0 in this case, is alpha restricted to the range of 0 to 1?

    • shuyo says:

      Sorry for my late response.
      I suppose it has no advantage of Dirichlet parameters greater than 1 on topic model.
      I always choose parameters as small as possible, e.g. ALPHA=0.01/T and so on.

  5. Say I need the LDA to train on the brown corpus and test it on a sample file, how do I pass the parameters ?

  6. fdgfd says:

    it doesn’t work for other language data set such as Arabic

  7. Mridul says:

    The code works really good. Thanks.
    This code gives me the topic distribution for the whole corpus. Supposing that I have done it and now I want to attribute topics to a certain document (new document), then how do I do it? Its simple, I know but have you written the code for the same as well?

    • shuyo says:

      It is troublesome to calculate topic-distribution of a new document on LDA.
      It is also one way to do Gibbs sampling partially.
      I have not written such code as I have not been necessary it…

    • Purvi Skumar says:

      Hello Sir,
      I did try the code but m facing a difficulty as it is giving the following error
      error: need corpus filename(-f) or corpus range(-c)
      I tried but as i’m not very familiar with the optparse i’m not able to tackle the problem. Please can u provide a solution for this.
      Thank you.

  8. alex says:

    I have not run the code yet, but I look at it and it does not look like it outputs the document-topic distribution for each document. I only see the topic-word distribution. Is this not included?

  9. Hi,
    Do you know of any similar posts for LDA using variational bayes inference.

    Thank you

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s