Before iterations of LDA estimation, it is necessary to initialize parameters.

Collapsed Gibbs Sampling (CGS) estimation has the following parameters.

- z_mn : topic of word n of document m
- n_mz : word count of document m with topic z
- n_tz : count of word t with topic z
- n_z : word count with topic z

The most simple initialization is to assign each word to a random topic and increase the corresponding counters n_mz, n_tz and n_z.

# docs : documents which consists of word array
# K : number of topics
# V : vocaburary size
z_m_n = [] # topics of words of documents
n_m_z = numpy.zeros((len(docs), K)) # word count of each document and topic
n_z_t = numpy.zeros((K, V)) # word count of each topic and vocabulary
n_z = numpy.zeros(K) # word count of each topic
for m, doc in enumerate(docs):
z_n = []
for t in doc:
z = numpy.random.randint(0, K)
z_n.append(z)
n_m_z[m, z] += 1
n_z_t[z, t] += 1
n_z[z] += 1
z_m_n.append(numpy.array(z_n))

Then the iterative inference with the full-conditional in the previous article is the following. That is repeated until the perplexity gets stable.

for m, doc in enumerate(docs):
for n, t in enumerate(doc):
# discount for n-th word t with topic z
z = z_m_n[m][n]
n_m_z[m, z] -= 1
n_z_t[z, t] -= 1
n_z[z] -= 1
# sampling new topic
p_z = (n_z_t[:, t] + beta) * (n_m_z[m] + alpha) / (n_z + V * beta)
new_z = numpy.random.multinomial(1, p_z / p_z.sum()).argmax()
# preserve the new topic and increase the counters
z_m_n[m][n] = new_z
n_m_z[m, new_z] += 1
n_z_t[new_z, t] += 1
n_z[new_z] += 1

It is using numpy.random.multinomial() method with set 1 to number of experiments for drawing from multinomial distribution.

Hence this method returns a array of which a certain k-th element is 1 and the remainder are 0, argmax() retrives the k value. This is a little wasteful…

The next article will show another efficient initialization.

### Like this:

Like Loading...

*Related*

I am not convinced if your code or finding posterior is correct.

p_z = (n_z_t[:, t] + beta) * (n_m_z[m] + alpha) / (n_z + V * beta)

You are not normalizing

It is normalized at ‘p_z / p_z.sum()’ in the next line.

Thanks Shuyo.. I realized this soon after I posted my comment ðŸ™‚

Another confusion: I referred to Griffith’s paper for topic modeling. I saw for calculating p_z they have an additional term in the denominator, which I do not see in your code (or may be I am missing something). something like n_m which denotes the document topic sum, just like you have n_z for the topic term sum.

So I think it should be something like:

p_z = (n_z_t[:, t] + beta) * (n_m_z[m] + alpha) / ((n_z + V * beta) * (n_m + K * alpha ))

Correct me if I am wrong, or if I am missing something

Yes, the paper has (n_mz + alpha) term.

But the term is eliminated in normalization, so it is popular to omit from the beginning ðŸ˜€

Hi, your tutorial is extremely useful.

But I am confusing at p_z = (n_z_t[:, t] + beta) * (n_m_z[m] + alpha) / (n_z + V * beta)

where:

n_z_t = numpy.zeros((K, V))

beta is V-dimension

n_z_t[ : , t] = get column t => k-dimension

how can you add: n_z_t[:, t] + beta ???

similarly,

n_z: K-dimension

V is scalar

n_z + V*beta ???

You should learn numpy’s class broadcast. See at http://onlamp.com/pub/a/python/2000/09/27/numerically.html for example.

Thank you for your great post.

I have a question. Why do you discount for n-th word t with topic z?

It is because the target random variable own must be removed on full-conditional posterior on Gibbs sampling.

Hello,

I have two questions about your implementation.

(1) can I understand n_z[z] as the same as sum of n_m_z[m, z] over the m dimension?

(2) How is the repeated word in each document handled? For example, I have one word “thank” appear twice in the document? How does this occurrence frequency is reflected?

Thanks very much.

Kevin

Hi,

(1) Almost yes, but they differ prior parameters.

(2) Each “thank” has individual variable z_m_n.

I have a question about the input.

Is it the list of word in document? i.e. doc=[[‘a’,’b’,…],[‘c’,’d’,…],….]

[…] inside is word in that document.

If some words appear many times, should I include all of them or just one time?

Hi,

It inputs doc=[[‘a’, ‘a’, ..], ..], doesn’t it?

I tried running and I think it needs to be index (number)? Did I misunderstand something

Of course, you need to replace them into some id numbers.