We held a private reading meeting for ICML 2012.
I took and introduced [Kim+ ICML12] “Dirichlet Process with Mixed Random Measures : A Nonparametric Topic Model for Labeled Data.”
This is the presentation for it.
DP-MRM [Kim+ ICML12] is a supervised topic model like sLDA [Blei+ 2007], DiscLDA [Lacoste-Julien+ 2008] and MedLDA [Zhu+ 2009], and is regarded as a nonparametric version of Labeled LDA [Ramage+ 2009] in particular.
Although Labeled LDA is easy to implement (my implementation is here), it has a disadvantage that you must specify label-topic correspondings explicitly and manually.
On the other hand, DP-MRM can automatically decide label-topic correspondings as distributions. I am very interested in it.
But it is hard to implement because it is a nonparametric bayesian modal.
Hence I don’t want infinite topics but hierarchical label-topic correspondings, I guess that it will become very useful and handy and fast to replace DPs into normal Dirichlet distributions in this model… I am going to try it!
Hello,
Thanks a lot for all your open-source help. I have been trying to use your LLDA.py. I was hoping if I could somehow find out how the dataset looks like and how do we provide the preset labels.
Thanks
My llda.py requires a file which contains a document each line.
Labels of a document is given the head of a line in the form of [labels] if necessary.
A example is like the following.
[label1,label2,...]some text
I’m afraid it is troublesome to prepare dataset with this format,
so I just added a sample script to to use nltk corpus for dataset of llda.
https://github.com/shuyo/iir/blob/master/lda/llda_nltk.py
This script estimates Labeled LDA with 100 sample documents from nltk.corpus.reuters and outputs top 20 words for each label.
Thanks a lot!
Hello again,
How can I effectively estimate alpha and beta ? The literature on the topic is not very helpful..
I don’t know it much because I have no motivation to estimate hyper parameters…
I feel that it is effective in perplexity improvement, but not in quality improvement, e.g. topic-word distribution.
If you want the latter, I guess it is much more effective to try various pre-processings.