Hierarchical Dirichlet Processes (Teh+ 2006) are a nonparametric bayesian topic model which can treat infinite topics.
In particular, HDP-LDA is interesting as an extention of LDA.
(Teh+ 2006) introduced updates of Collapsed Gibbs sampling for a general framework of HDP, but not for HDP-LDA.
To obtain updates of HDP-LDA, it is necessary to apply the base measure H and the emission F(phi) on HDP-LDA’s setting into the below equation:
, (eq. 30 on [Teh+ 2006])
where h is a probabilistic density function of H and f is one of F.
In the case of HDP-LDA, H is a Dirichlet distribution over vocabulary and F is a topic-word multinominal distribution, that is
where
,
.
To substitute these for equation (30), we obtain
,
where
We also need f_k^new when t takes a new table. It is obtained as the following:
And it is necessary to write down f_k(x_jt) also for sampling k.
For
(it means “term count of word w with topic k”)
(excluding
),
When implementation in Python, it is faster not to unfold Gamma functions than another. It is necessary to use these logarithms in either case, or f_k(x_jt) must overflow float range.
Finally,
Hi, thank you so much for your explanation here. I have a question about this process. I found that in chong wang’s code, he “sample_tables(d_state, q, f) ” for each document after he sampled all words in this doc. I am curious why he did this. Do you have any idea?
Though I cannot tell certain things, that is to shorten learning, isn’t it?
I’m trying to fill in the steps in your derivation of (30). Do you have any insight on the missing steps here: http://mathb.in/34749?key=f1b1b8e9c8ef6386abf89eb81f6a23347485e887. It is something to do with the conjugacy, right?
I think I got it: http://i.imgur.com/tKAe9Yr.png. Needed to see that there are two normalizing coefficients of a Dirichet distribution hidden in there.
Dear Shuyo:
Thank your very much for deriving the formula and implementing the hdp LDA in python, I’m trying to learn hdp LDA these days and learned a lot from your code , but i am still confused about your code in the implementation of word distribution, document distribution and perplexity.Could you write another article to derive these formula, It will help me a lot to understand hdp.
Exactly, i have been learning hdp lda for half a month, and still don’t how to draw the graphical model of hdp lda that contained the indicator varables z, t. I don’t know where to ask for help, nobody around me knows about graphical model ,nor hdp. Any help would be very appreciate.
I hope you can send a email at your earliest convenience, this is my Gmail: ljessons93@gmail.com.
Best wishes to you, ljessons.
Finally, by forcing myself to read Teh’s paper over and over again, I know how to draw the graphical model of hdp lda. In fact, I knew how to draw graphical model of the hierarchical DP mixture model, it is just i didn’t know why Teh first sampling t and next sampling k, and that made me not confident to draw the graphical model of hdp lda(contained indicator variable of table t and dish k). And the reason why Teh first sampling t and next sampling k is due to collapsed gibbs sampling rule, nothing else.
Any way , Shuyo,thank you very much for derive the above formulations. It’s very good work.
Thank you so much for your kind sharing. I am confused about the codes to get the “topic distribution for document” when calculating the perplexity. Could you share me the mathematic equation on how to get that? I think there are two Dirichlet distribution for (t&k) to multiple a Multinomial distribution for Nm. I have some problems in solving this step.
Thanks so much!
sorry for the typing error. It should be “for (t&k) to multiply a Multinomial distribution”
Many thanks for your excellent detailed derivative!