Python implementation of Labeled LDA (Ramage+ EMNLP2009)

Labeled LDA (D. Ramage, D. Hall, R. Nallapati and C.D. Manning; EMNLP2009) is a supervised topic model derived from LDA (Blei+ 2003).
While LDA’s estimated topics don’t often equal to human’s expectation because it is unsupervised, Labeled LDA is to treat documents with multiple labels.

I implemented Labeled LDA in python. The code is here:

llda.py is LLDA estimation for arbitrary test. Its input file consits on each line as a document.
Each document can be given label(s) to put in brackets at the head of the line, like the following:

[label1,label2,...]some text

To give documents without labels, it works like semi-supervised.

llda_nltk.py is a sample with llda.py. It estimates LLDA model with 100 sampled documents from Reuters corpus in NLTK and outputs top 20 words for each label.

(Ramage+ EMNLP2009) doesn’t figure out LLDA’s perplexity, then I derived document-topic distributions and topic-word ones that it requires:


where λ(d) means a topic set corresponding to labels of the document d, and Md is a size of the set.

I’m glad you to point out if these are wrong.

LLDA is neccessary that labels assign to topics explicitly and exactly. But it is very diffucult to know how many topics to assign to each label for better estimation.
Moreover it is natural that some categories may have common topics (e.g. “baseball” category and “soccer” category).

DP-MRM (Kim+ ICML2012), as I introduced in this blog, is a model to extend LLDA to estimate label-topic corresponding.

https://shuyo.wordpress.com/2012/07/31/kim-icml12-dirichlet-process-with-mixed-random-measures/

Though I want to implement it too, it is very complex…

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

6 Responses to Python implementation of Labeled LDA (Ramage+ EMNLP2009)

  1. Ciddhi says:

    How is the llda tested ?

    • shuyo says:

      Hi,
      I wrote it in this article.

      > It estimates LLDA model with 100 sampled documents from Reuters corpus in NLTK and outputs top 20 words for each label.

      • Ciddhi says:

        Sorry for the wrong phrasing of the question. I meant how do we use the learned model to predict the category/label of any random article? Thanks in advance.

      • shuyo says:

        I see.
        It is difficult to predict topic of unobserved documents on unsupervised topic models like LDA and LLDA. It needs to estimate them using Gibbs sampling or something.
        Though I may implement it if I want to use LLDA as some applications, I didn’t so it is experiments.

  2. teejay says:

    Hi Shuyo,

    Much thanks for this review on LLda. Trying to implement your approach to actually categorize the some topics from my corpus into 45 categories. Is it possible to have documents that coresponds to multiple labels e.g my categories are more like phrases ‘ Generalization of economic development ‘ so maybe ‘Generalization’,’economic’ ‘development’ as the class label for a specific document, how can I categorize my documents under this label in this regard.

  3. Hi shuyo,

    I am working with dataset where some of the documents are labeled and some are not. Can the corpus for llda be such a dataset, i.e., mixture of labeled and unlabeled documents

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s