Collapsed Gibbs Sampling Estimation for Latent Dirichlet Allocation (2)

Before iterations of LDA estimation, it is necessary to initialize parameters.
Collapsed Gibbs Sampling (CGS) estimation has the following parameters.

  • z_mn : topic of word n of document m
  • n_mz : word count of document m with topic z
  • n_tz : count of word t with topic z
  • n_z : word count with topic z

The most simple initialization is to assign each word to a random topic and increase the corresponding counters n_mz, n_tz and n_z.

# docs : documents which consists of word array
# K : number of topics
# V : vocaburary size

z_m_n = [] # topics of words of documents
n_m_z = numpy.zeros((len(docs), K))     # word count of each document and topic
n_z_t = numpy.zeros((K, V)) # word count of each topic and vocabulary
n_z = numpy.zeros(K)        # word count of each topic

for m, doc in enumerate(docs):
    z_n = []
    for t in doc:
        z = numpy.random.randint(0, K)
        z_n.append(z)
        n_m_z[m, z] += 1
        n_z_t[z, t] += 1
        n_z[z] += 1
    z_m_n.append(numpy.array(z_n))

Then the iterative inference with the full-conditional in the previous article is the following. That is repeated until the perplexity gets stable.

for m, doc in enumerate(docs):
    for n, t in enumerate(doc):
        # discount for n-th word t with topic z
        z = z_m_n[m][n]
        n_m_z[m, z] -= 1
        n_z_t[z, t] -= 1
        n_z[z] -= 1

        # sampling new topic
        p_z = (n_z_t[:, t] + beta) * (n_m_z[m] + alpha) / (n_z + V * beta)
        new_z = numpy.random.multinomial(1, p_z / p_z.sum()).argmax()

        # preserve the new topic and increase the counters
        z_m_n[m][n] = new_z
        n_m_z[m, new_z] += 1
        n_z_t[new_z, t] += 1
        n_z[new_z] += 1

It is using numpy.random.multinomial() method with set 1 to number of experiments for drawing from multinomial distribution.
Hence this method returns a array of which a certain k-th element is 1 and the remainder are 0, argmax() retrives the k value. This is a little wasteful…

The next article will show another efficient initialization.

Advertisements
This entry was posted in LDA, Machine Learning, Python. Bookmark the permalink.

14 Responses to Collapsed Gibbs Sampling Estimation for Latent Dirichlet Allocation (2)

  1. tanu says:

    I am not convinced if your code or finding posterior is correct.
    p_z = (n_z_t[:, t] + beta) * (n_m_z[m] + alpha) / (n_z + V * beta)
    You are not normalizing

    • shuyo says:

      It is normalized at ‘p_z / p_z.sum()’ in the next line.

      • tanu says:

        Thanks Shuyo.. I realized this soon after I posted my comment 🙂
        Another confusion: I referred to Griffith’s paper for topic modeling. I saw for calculating p_z they have an additional term in the denominator, which I do not see in your code (or may be I am missing something). something like n_m which denotes the document topic sum, just like you have n_z for the topic term sum.
        So I think it should be something like:
        p_z = (n_z_t[:, t] + beta) * (n_m_z[m] + alpha) / ((n_z + V * beta) * (n_m + K * alpha ))

        Correct me if I am wrong, or if I am missing something

      • shuyo says:

        Yes, the paper has (n_mz + alpha) term.
        But the term is eliminated in normalization, so it is popular to omit from the beginning 😀

  2. Tien Vu Nguyen says:

    Hi, your tutorial is extremely useful.
    But I am confusing at p_z = (n_z_t[:, t] + beta) * (n_m_z[m] + alpha) / (n_z + V * beta)
    where:
    n_z_t = numpy.zeros((K, V))
    beta is V-dimension
    n_z_t[ : , t] = get column t => k-dimension
    how can you add: n_z_t[:, t] + beta ???
    similarly,
    n_z: K-dimension
    V is scalar
    n_z + V*beta ???

  3. ellma says:

    Thank you for your great post.
    I have a question. Why do you discount for n-th word t with topic z?

  4. Kevin says:

    Hello,

    I have two questions about your implementation.

    (1) can I understand n_z[z] as the same as sum of n_m_z[m, z] over the m dimension?
    (2) How is the repeated word in each document handled? For example, I have one word “thank” appear twice in the document? How does this occurrence frequency is reflected?

    Thanks very much.
    Kevin

  5. Lily says:

    I have a question about the input.
    Is it the list of word in document? i.e. doc=[[‘a’,’b’,…],[‘c’,’d’,…],….]
    […] inside is word in that document.
    If some words appear many times, should I include all of them or just one time?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s