Python implementation of Labeled LDA (Ramage+ EMNLP2009)

July 24, 2013February 10, 2021 ~ shuyo

Labeled LDA (D. Ramage, D. Hall, R. Nallapati and C.D. Manning; EMNLP2009) is a supervised topic model derived from LDA (Blei+ 2003).
While LDA’s estimated topics don’t often equal to human’s expectation because it is unsupervised, Labeled LDA is to treat documents with multiple labels.

I implemented Labeled LDA in python. The code is here:

llda.py is LLDA estimation for arbitrary test. Its input file consits on each line as a document.
Each document can be given label(s) to put in brackets at the head of the line, like the following:

[label1,label2,...]some text

To give documents without labels, it works like semi-supervised.

llda_nltk.py is a sample with llda.py. It estimates LLDA model with 100 sampled documents from Reuters corpus in NLTK and outputs top 20 words for each label.

(Ramage+ EMNLP2009) doesn’t figure out LLDA’s perplexity, then I derived document-topic distributions and topic-word ones that it requires:

$\theta_{dz}=\mathbb{E}[p(z|d)]=\begin{cases}\frac{n_{dz}+\alpha}{n_{d\cdot}+M_d\alpha}\hspace{20}&\text{if }z\in\lambda^{(d)}\\0&\text{otherwize}\end{cases}$
$\phi_{zw}=\mathbb{E}[p(w|z)]=\frac{n_{zw}+\eta}{n_{z\cdot}+V\eta}$

where λ^(d) means a topic set corresponding to labels of the document d, and M_d is a size of the set.

I’m glad you to point out if these are wrong.

LLDA is neccessary that labels assign to topics explicitly and exactly. But it is very diffucult to know how many topics to assign to each label for better estimation.
Moreover it is natural that some categories may have common topics (e.g. “baseball” category and “soccer” category).

DP-MRM (Kim+ ICML2012), as I introduced in this blog, is a model to extend LLDA to estimate label-topic corresponding.

https://shuyo.wordpress.com/2012/07/31/kim-icml12-dirichlet-process-with-mixed-random-measures/

Though I want to implement it too, it is very complex…

Published by shuyo

View all posts by shuyo

16 thoughts on “Python implementation of Labeled LDA (Ramage+ EMNLP2009)”

Ciddhi says:

December 17, 2013 at 8:20 pm

How is the llda tested ?

Reply
1. shuyo says:
  
  December 18, 2013 at 7:55 pm
  
  Hi,
  I wrote it in this article.
  
  > It estimates LLDA model with 100 sampled documents from Reuters corpus in NLTK and outputs top 20 words for each label.
  
  Reply
  1. Ciddhi says:
    
    December 18, 2013 at 8:10 pm
    
    Sorry for the wrong phrasing of the question. I meant how do we use the learned model to predict the category/label of any random article? Thanks in advance.
  2. shuyo says:
    
    December 20, 2013 at 5:01 pm
    
    I see.
    It is difficult to predict topic of unobserved documents on unsupervised topic models like LDA and LLDA. It needs to estimate them using Gibbs sampling or something.
    Though I may implement it if I want to use LLDA as some applications, I didn’t so it is experiments.
teejay says:

July 28, 2015 at 11:05 am

Hi Shuyo,

Much thanks for this review on LLda. Trying to implement your approach to actually categorize the some topics from my corpus into 45 categories. Is it possible to have documents that coresponds to multiple labels e.g my categories are more like phrases ‘ Generalization of economic development ‘ so maybe ‘Generalization’,’economic’ ‘development’ as the class label for a specific document, how can I categorize my documents under this label in this regard.

Reply
orajavasolutions says:

January 28, 2016 at 4:26 pm

Hi shuyo,

I am working with dataset where some of the documents are labeled and some are not. Can the corpus for llda be such a dataset, i.e., mixture of labeled and unlabeled documents

Reply
Yo Hsiao says:

March 11, 2016 at 2:08 am

Hi Shuyo,

Thank you for the generous sharing, for both the explanation and the code. I was directed to your github code from various sources but found it would be more user friendly and beneficial to the community if you can put some documents readme alongside your handy implementation. Do you think you would be able to do it?

Great thanks!

Yo

Reply
ABHINAV RAVI says:

March 12, 2016 at 6:41 pm

How can we get Documet-topic distribution (theta) here?

Reply
ABHINAV RAVI says:

March 12, 2016 at 6:42 pm

How can we obtain document- topic distribution(theta). This was available in gensim package.

Reply
ABHINAV RAVI says:

March 12, 2016 at 6:43 pm

How to obtain the document-topic distribution (theta)?

Reply
ABHINAV RAVI says:

March 12, 2016 at 6:44 pm

Hi shuyo, How to obtain the document-topic distribution (theta)?

Reply
raviabhinav says:

March 12, 2016 at 6:55 pm

Hi shuyo, How to get Document-topic distribution(theta) in your code? It was available in gensim package.

Reply
ABHINAV RAVI says:

March 12, 2016 at 9:13 pm

Hi suyo, How can i get the documt-topic distribution(theta)?

Reply
Gopika Bhardwaj says:

July 14, 2017 at 4:43 pm

How can we apply this code on our own dataset? Also, how should we prepare the dataset, how do we assign labels to the training set?

Reply
Alem Memic says:

March 21, 2018 at 1:50 pm

Dear shuyo,
I would like to mention that you used ‘beta’ instead of ‘eta’ in your code for ‘phi’ function. What is ‘eta’ anyway?

Reply
Kebede Assefa says:

January 25, 2020 at 5:06 pm

does LLDA work in windows OS…?

Reply