Collapsed Gibbs Sampling Estimation for Latent Dirichlet Allocation (1)

Latent Dirichlet Allocation (LDA) is a generative model which is used as a language topic model and so on.

Graphical model of LDA

Graphical model of LDA (this figure from Wikipedia)

Each random variable means the following

  • θ : document-topic distribution, document-topic multinomial drawn from Dirichlet distribution
  • φ : topic-word distribution, topic-word multinomial drawn from Dirichlet distribution
  • Z : word topic, word topic drawn from multinomial
  • W : word, word drawn from multinomial

There are some populaer estimation methods for LDA, and Collapsed Gibbs sampling (CGS) is one of them.
This method is to integral out random variables except for word topic {z_mn} and draw each z_mn from posterior.
The posterior of z_mn is the following:

Collapsed Gibbs sampling of LDA

where n_mz is a word count of document m with topic z, n_tz is a count of word t with topic z, n_z is a word count with topic z and -mn means “except z_mn.”
The estimation iterates until its perplexity converges or appropriate times.

Perplexity of LDA

where

Topic-word distributions
Document-topic distributions

and n_m is a word count of document m.
However perplexities usually decrease as learnings are progressing, my experiment told some different tendencies.

Continued on the next post.

About these ads
This entry was posted in LDA, Machine Learning. Bookmark the permalink.

One Response to Collapsed Gibbs Sampling Estimation for Latent Dirichlet Allocation (1)

  1. Pingback: Quora

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s