In the previous article, I introduced the simple implement of the collapsed gibbs sampling estimation for Latent Dirichlet Allocation(LDA).

However each word topic z_mn is initialized to a random topic in this implement, there are some toubles.

First, it needs many iterations before its perplexity begins to decrease.

Second, almost topics have some stopwords like ‘the’ and ‘of’ with high probabilities when converging its perplexity.

Moreover there are some topic groups which share similar word distributions. Therefore the substantial topics are less than topic size parameter K.

Instead of the random initialization, draw from the posterior probability form as sampling the new topic incrementally.

Furthermore, Hence n_zt is only used with beta additionally, n_mz with alpha, and n_z with V * beta, it is an efficient implement to add the corresponding parameters to these variables beforehand.

Then, a sample initialization code is the following,

# docs : documents which consists of word array
# K : number of topics
# V : vocaburary size
z_m_n = [] # topics of words of documents
n_m_z = numpy.zeros((len(self.docs), K)) + alpha # word count of each document and topic
n_z_t = numpy.zeros((K, V)) + beta # word count of each topic and vocabulary
n_z = numpy.zeros(K) + V * beta # word count of each topic
for m, doc in enumerate(docs):
z_n = []
for t in doc:
# draw from the posterior
p_z = n_z_t[:, t] * n_m_z[m] / n_z
z = numpy.random.multinomial(1, p_z / p_z.sum()).argmax()
z_n.append(z)
n_m_z[m, z] += 1
n_z_t[z, t] += 1
n_z[z] += 1
z_m_n.append(numpy.array(z_n))

and so is a sample inference code.

for m, doc in enumerate(docs):
for n, t in enumerate(doc):
# discount for n-th word t with topic z
z = z_m_n[m][n]
n_m_z[m, z] -= 1
n_z_t[z, t] -= 1
n_z[z] -= 1
# sampling new topic
p_z = n_z_t[:, t] * n_m_z[m] / n_z # Here is only changed.
new_z = numpy.random.multinomial(1, p_z / p_z.sum()).argmax()
# preserve the new topic and increase the counters
z_m_n[m][n] = new_z
n_m_z[m, new_z] += 1
n_z_t[new_z, t] += 1
n_z[new_z] += 1

(The inference code is similar to the previous version but is simplified at the posterior calculation.)

### Like this:

Like Loading...

*Related*

Hi Shuyo,

Your blog is extremely useful. Topic modeling suggests that one needs to train the model with say 70% of the data and the remaining 30% of the data is used as the test set to draw inference. Then the perplexity of the held-out set is plotted. I didn’t find any separation of training and test set in your LDA implementation. Can you advise how can I use the output from the trained model to test new test data?

It is difficult for LDA to calculate the document-topic distribution of unknown documents.

So it is popular to separate some ratio of words in the documents into training data and test one and use the document-topic distribution of the training data to calculate the perplexity of the test one.

In my repository, lda_test2.py is to calculate perplexity for such test set.

Thanks Shuyo for your reply. Do you have any code in your repository which outputs the document_topic distribution as well? I see that you do output topic_word distribution, but I was wondering if you also output or print the document-topic distribution of the training data?

Although my code does not have the method for document-topic distribution, it can be calculated like self.n_m_z / self.n_m_z.sum(axis=1) .

Thanks Very much for the useful code,

I have tried your implementation and it works fine for me.

However, I have a question.

As I understand your code generates the topic-word distribution.

Ok then, if I want to model this distribution for new documents, (i.e. I have new documents then I want to assign topic probabilities for them), does your code support this ?

if so could you please guide me or give me some hints.

Thanks a lot,

Hi,

My code doesn’t support to estimate document-topic distribution of a new document.

That needs Gibbs sampling with some approximation.

I might write about it if I feel like doing ğŸ˜€