In the previous article, I introduced the simple implement of the collapsed gibbs sampling estimation for Latent Dirichlet Allocation(LDA).
However each word topic z_mn is initialized to a random topic in this implement, there are some toubles.
First, it needs many iterations before its perplexity begins to decrease.
Second, almost topics have some stopwords like ‘the’ and ‘of’ with high probabilities when converging its perplexity.
Moreover there are some topic groups which share similar word distributions. Therefore the substantial topics are less than topic size parameter K.
Instead of the random initialization, draw from the posterior probability form as sampling the new topic incrementally.
Furthermore, Hence n_zt is only used with beta additionally, n_mz with alpha, and n_z with V * beta, it is an efficient implement to add the corresponding parameters to these variables beforehand.
Then, a sample initialization code is the following,
# docs : documents which consists of word array # K : number of topics # V : vocaburary size z_m_n =  # topics of words of documents n_m_z = numpy.zeros((len(self.docs), K)) + alpha # word count of each document and topic n_z_t = numpy.zeros((K, V)) + beta # word count of each topic and vocabulary n_z = numpy.zeros(K) + V * beta # word count of each topic for m, doc in enumerate(docs): z_n =  for t in doc: # draw from the posterior p_z = n_z_t[:, t] * n_m_z[m] / n_z z = numpy.random.multinomial(1, p_z / p_z.sum()).argmax() z_n.append(z) n_m_z[m, z] += 1 n_z_t[z, t] += 1 n_z[z] += 1 z_m_n.append(numpy.array(z_n))
and so is a sample inference code.
for m, doc in enumerate(docs): for n, t in enumerate(doc): # discount for n-th word t with topic z z = z_m_n[m][n] n_m_z[m, z] -= 1 n_z_t[z, t] -= 1 n_z[z] -= 1 # sampling new topic p_z = n_z_t[:, t] * n_m_z[m] / n_z # Here is only changed. new_z = numpy.random.multinomial(1, p_z / p_z.sum()).argmax() # preserve the new topic and increase the counters z_m_n[m][n] = new_z n_m_z[m, new_z] += 1 n_z_t[new_z, t] += 1 n_z[new_z] += 1
(The inference code is similar to the previous version but is simplified at the posterior calculation.)