Before iterations of LDA estimation, it is necessary to initialize parameters.
Collapsed Gibbs Sampling (CGS) estimation has the following parameters.
- z_mn : topic of word n of document m
- n_mz : word count of document m with topic z
- n_tz : count of word t with topic z
- n_z : word count with topic z
The most simple initialization is to assign each word to a random topic and increase the corresponding counters n_mz, n_tz and n_z.
# docs : documents which consists of word array # K : number of topics # V : vocaburary size z_m_n =  # topics of words of documents n_m_z = numpy.zeros((len(docs), K)) # word count of each document and topic n_z_t = numpy.zeros((K, V)) # word count of each topic and vocabulary n_z = numpy.zeros(K) # word count of each topic for m, doc in enumerate(docs): z_n =  for t in doc: z = numpy.random.randint(0, K) z_n.append(z) n_m_z[m, z] += 1 n_z_t[z, t] += 1 n_z[z] += 1 z_m_n.append(numpy.array(z_n))
Then the iterative inference with the full-conditional in the previous article is the following. That is repeated until the perplexity gets stable.
for m, doc in enumerate(docs): for n, t in enumerate(doc): # discount for n-th word t with topic z z = z_m_n[m][n] n_m_z[m, z] -= 1 n_z_t[z, t] -= 1 n_z[z] -= 1 # sampling new topic p_z = (n_z_t[:, t] + beta) * (n_m_z[m] + alpha) / (n_z + V * beta) new_z = numpy.random.multinomial(1, p_z / p_z.sum()).argmax() # preserve the new topic and increase the counters z_m_n[m][n] = new_z n_m_z[m, new_z] += 1 n_z_t[new_z, t] += 1 n_z[new_z] += 1
It is using numpy.random.multinomial() method with set 1 to number of experiments for drawing from multinomial distribution.
Hence this method returns a array of which a certain k-th element is 1 and the remainder are 0, argmax() retrives the k value. This is a little wasteful…
The next article will show another efficient initialization.