Collapsed Gibbs Sampling Estimation for Latent Dirichlet Allocation (2)

Before iterations of LDA estimation, it is necessary to initialize parameters.
Collapsed Gibbs Sampling (CGS) estimation has the following parameters.

  • z_mn : topic of word n of document m
  • n_mz : word count of document m with topic z
  • n_tz : count of word t with topic z
  • n_z : word count with topic z

The most simple initialization is to assign each word to a random topic and increase the corresponding counters n_mz, n_tz and n_z.

# docs : documents which consists of word array
# K : number of topics
# V : vocaburary size

z_m_n = [] # topics of words of documents
n_m_z = numpy.zeros((len(docs), K))     # word count of each document and topic
n_z_t = numpy.zeros((K, V)) # word count of each topic and vocabulary
n_z = numpy.zeros(K)        # word count of each topic

for m, doc in enumerate(docs):
    z_n = []
    for t in doc:
        z = numpy.random.randint(0, K)
        z_n.append(z)
        n_m_z[m, z] += 1
        n_z_t[z, t] += 1
        n_z[z] += 1
    z_m_n.append(numpy.array(z_n))

Then the iterative inference with the full-conditional in the previous article is the following. That is repeated until the perplexity gets stable.

for m, doc in enumerate(docs):
    for n, t in enumerate(doc):
        # discount for n-th word t with topic z
        z = z_m_n[m][n]
        n_m_z[m, z] -= 1
        n_z_t[z, t] -= 1
        n_z[z] -= 1

        # sampling new topic
        p_z = (n_z_t[:, t] + beta) * (n_m_z[m] + alpha) / (n_z + V * beta)
        new_z = numpy.random.multinomial(1, p_z / p_z.sum()).argmax()

        # preserve the new topic and increase the counters
        z_m_n[m][n] = new_z
        n_m_z[m, new_z] += 1
        n_z_t[new_z, t] += 1
        n_z[new_z] += 1

It is using numpy.random.multinomial() method with set 1 to number of experiments for drawing from multinomial distribution.
Hence this method returns a array of which a certain k-th element is 1 and the remainder are 0, argmax() retrives the k value. This is a little wasteful…

The next article will show another efficient initialization.

Advertisement
This entry was posted in LDA, Machine Learning, Python. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s