Before iterations of LDA estimation, it is necessary to initialize parameters.
Collapsed Gibbs Sampling (CGS) estimation has the following parameters.
- z_mn : topic of word n of document m
- n_mz : word count of document m with topic z
- n_tz : count of word t with topic z
- n_z : word count with topic z
The most simple initialization is to assign each word to a random topic and increase the corresponding counters n_mz, n_tz and n_z.
# docs : documents which consists of word array
# K : number of topics
# V : vocaburary size
z_m_n = [] # topics of words of documents
n_m_z = numpy.zeros((len(docs), K)) # word count of each document and topic
n_z_t = numpy.zeros((K, V)) # word count of each topic and vocabulary
n_z = numpy.zeros(K) # word count of each topic
for m, doc in enumerate(docs):
z_n = []
for t in doc:
z = numpy.random.randint(0, K)
z_n.append(z)
n_m_z[m, z] += 1
n_z_t[z, t] += 1
n_z[z] += 1
z_m_n.append(numpy.array(z_n))
Then the iterative inference with the full-conditional in the previous article is the following. That is repeated until the perplexity gets stable.
for m, doc in enumerate(docs):
for n, t in enumerate(doc):
# discount for n-th word t with topic z
z = z_m_n[m][n]
n_m_z[m, z] -= 1
n_z_t[z, t] -= 1
n_z[z] -= 1
# sampling new topic
p_z = (n_z_t[:, t] + beta) * (n_m_z[m] + alpha) / (n_z + V * beta)
new_z = numpy.random.multinomial(1, p_z / p_z.sum()).argmax()
# preserve the new topic and increase the counters
z_m_n[m][n] = new_z
n_m_z[m, new_z] += 1
n_z_t[new_z, t] += 1
n_z[new_z] += 1
It is using numpy.random.multinomial() method with set 1 to number of experiments for drawing from multinomial distribution.
Hence this method returns a array of which a certain k-th element is 1 and the remainder are 0, argmax() retrives the k value. This is a little wasteful…
The next article will show another efficient initialization.