Python implementation of Labeled LDA (Ramage+ EMNLP2009)

Labeled LDA (D. Ramage, D. Hall, R. Nallapati and C.D. Manning; EMNLP2009) is a supervised topic model derived from LDA (Blei+ 2003).
While LDA’s estimated topics don’t often equal to human’s expectation because it is unsupervised, Labeled LDA is to treat documents with multiple labels.

I implemented Labeled LDA in python. The code is here:

llda.py is LLDA estimation for arbitrary test. Its input file consits on each line as a document.
Each document can be given label(s) to put in brackets at the head of the line, like the following:

[label1,label2,...]some text

To give documents without labels, it works like semi-supervised.

llda_nltk.py is a sample with llda.py. It estimates LLDA model with 100 sampled documents from Reuters corpus in NLTK and outputs top 20 words for each label.

(Ramage+ EMNLP2009) doesn’t figure out LLDA’s perplexity, then I derived document-topic distributions and topic-word ones that it requires:


where λ(d) means a topic set corresponding to labels of the document d, and Md is a size of the set.

I’m glad you to point out if these are wrong.

LLDA is neccessary that labels assign to topics explicitly and exactly. But it is very diffucult to know how many topics to assign to each label for better estimation.
Moreover it is natural that some categories may have common topics (e.g. “baseball” category and “soccer” category).

DP-MRM (Kim+ ICML2012), as I introduced in this blog, is a model to extend LLDA to estimate label-topic corresponding.

https://shuyo.wordpress.com/2012/07/31/kim-icml12-dirichlet-process-with-mixed-random-measures/

Though I want to implement it too, it is very complex…

16 thoughts on “Python implementation of Labeled LDA (Ramage+ EMNLP2009)

    1. Hi,
      I wrote it in this article.

      > It estimates LLDA model with 100 sampled documents from Reuters corpus in NLTK and outputs top 20 words for each label.

      1. Sorry for the wrong phrasing of the question. I meant how do we use the learned model to predict the category/label of any random article? Thanks in advance.

      2. I see.
        It is difficult to predict topic of unobserved documents on unsupervised topic models like LDA and LLDA. It needs to estimate them using Gibbs sampling or something.
        Though I may implement it if I want to use LLDA as some applications, I didn’t so it is experiments.

  1. Hi Shuyo,

    Much thanks for this review on LLda. Trying to implement your approach to actually categorize the some topics from my corpus into 45 categories. Is it possible to have documents that coresponds to multiple labels e.g my categories are more like phrases ‘ Generalization of economic development ‘ so maybe ‘Generalization’,’economic’ ‘development’ as the class label for a specific document, how can I categorize my documents under this label in this regard.

  2. Hi shuyo,

    I am working with dataset where some of the documents are labeled and some are not. Can the corpus for llda be such a dataset, i.e., mixture of labeled and unlabeled documents

  3. Hi Shuyo,

    Thank you for the generous sharing, for both the explanation and the code. I was directed to your github code from various sources but found it would be more user friendly and beneficial to the community if you can put some documents readme alongside your handy implementation. Do you think you would be able to do it?

    Great thanks!

    Yo

  4. How can we apply this code on our own dataset? Also, how should we prepare the dataset, how do we assign labels to the training set?

Leave a reply to Ciddhi Cancel reply