Python implementation of Labeled LDA (Ramage+ EMNLP2009)

Labeled LDA (D. Ramage, D. Hall, R. Nallapati and C.D. Manning; EMNLP2009) is a supervised topic model derived from LDA (Blei+ 2003).
While LDA’s estimated topics don’t often equal to human’s expectation because it is unsupervised, Labeled LDA is to treat documents with multiple labels.

I implemented Labeled LDA in python. The code is here: is LLDA estimation for arbitrary test. Its input file consits on each line as a document.
Each document can be given label(s) to put in brackets at the head of the line, like the following:

[label1,label2,...]some text

To give documents without labels, it works like semi-supervised. is a sample with It estimates LLDA model with 100 sampled documents from Reuters corpus in NLTK and outputs top 20 words for each label.

(Ramage+ EMNLP2009) doesn’t figure out LLDA’s perplexity, then I derived document-topic distributions and topic-word ones that it requires:

where λ(d) means a topic set corresponding to labels of the document d, and Md is a size of the set.

I’m glad you to point out if these are wrong.

LLDA is neccessary that labels assign to topics explicitly and exactly. But it is very diffucult to know how many topics to assign to each label for better estimation.
Moreover it is natural that some categories may have common topics (e.g. “baseball” category and “soccer” category).

DP-MRM (Kim+ ICML2012), as I introduced in this blog, is a model to extend LLDA to estimate label-topic corresponding.

Though I want to implement it too, it is very complex…

Posted in Uncategorized | 6 Comments

HDP-LDA updates

Hierarchical Dirichlet Processes (Teh+ 2006) are a nonparametric bayesian topic model which can treat infinite topics.
In particular, HDP-LDA is interesting as an extention of LDA.

(Teh+ 2006) introduced updates of Collapsed Gibbs sampling for a general framework of HDP, but not for HDP-LDA.
To obtain updates of HDP-LDA, it is necessary to apply the base measure H and the emission F(phi) on HDP-LDA’s setting into the below equation:

, (eq. 30 on [Teh+ 2006])

where h is a probabilistic density function of H and f is one of F.
In the case of HDP-LDA, H is a Dirichlet distribution over vocabulary and F is a topic-word multinominal distribution, that is

where ,

To substitute these for equation (30), we obtain



We also need f_k^new when t takes a new table. It is obtained as the following:

And it is necessary to write down f_k(x_jt) also for sampling k.


(it means “term count of word w with topic k”)
(excluding ),

When implementation in Python, it is faster not to unfold Gamma functions than another. It is necessary to use these logarithms in either case, or f_k(x_jt) must overflow float range.


Posted in LDA, Machine Learning, Nonparametric Bayesian | 4 Comments

[Kim+ ICML12] Dirichlet Process with Mixed Random Measures

We held a private reading meeting for ICML 2012.
I took and introduced [Kim+ ICML12] “Dirichlet Process with Mixed Random Measures : A Nonparametric Topic Model for Labeled Data.”
This is the presentation for it.

DP-MRM [Kim+ ICML12] is a supervised topic model like sLDA [Blei+ 2007], DiscLDA [Lacoste-Julien+ 2008] and MedLDA [Zhu+ 2009], and is regarded as a nonparametric version of Labeled LDA [Ramage+ 2009] in particular.

Although Labeled LDA is easy to implement (my implementation is here), it has a disadvantage that you must specify label-topic correspondings explicitly and manually.
On the other hand, DP-MRM can automatically decide label-topic correspondings as distributions. I am very interested in it.
But it is hard to implement because it is a nonparametric bayesian modal.
Hence I don’t want infinite topics but hierarchical label-topic correspondings, I guess that it will become very useful and handy and fast to replace DPs into normal Dirichlet distributions in this model… I am going to try it!😀

Posted in LDA, Machine Learning, Nonparametric Bayesian | 6 Comments

Short Text Language Detection with Infinity-Gram

I talked about language detection (language identification) for twitter at NAIST(NARA Institute of Science and Technology).
This is its slide.

Tweets are too short to detect their languages precisely. I guess that one reason is because features extracted from a short text are not enough to detect.
Another reason is because tweets have some unique representations, for example, u as you, 4 as for, LOL, F4F, various face marks and so on.

I developed ldig, a prototype of short text language detection, that solved those problems.
ldig can detect langages of tweets with over 99% accuracy for 19 languages.

The above slide explains how ldig solves those problems.

Posted in Language Detection, NLP | 24 Comments

[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems

In April 2012, We held a private reading meeting for NIPS 2011.
I read “Iterative Learning for Reliable Crowdsourcing Systems” [Karger+ NIPS11].

This paper targets Amazon Mechanical Turk(AMT) which separates a large task into microtasks. Each worker in AMT may be a spammer (who answers randomly to earn fee) or a hammer (who answers correctly).
This paper’s model simply assumes that each microtask has a coherent binary answer, each worker has probability to answer correctly which is independent on tasks. On the assumption, it estimates an average error rate when task size is enough large.
I don’t mind it needs simple strong assumption, but I’m sorry the model parameter q can’t be known its true value so that a practical problem can’t fit the model if the assumption was accepted.

Posted in NLP | Leave a comment

[Freedman+ EMNLP11] Extreme Extraction – Machine Reading in a Week

In December 2011, We held a private reading meeting for EMNLP 2011.
I read “Extreme Extraction – Machine Reading in a Week” [Freedman+ EMNLP11].

This paper is to construct a concept and relation extraction system rapidly.
I’m afraid that there are no definite construction methods (they might be confidential…) but length of time to do each task in it.
Hence I have not yet learned this field very much yet, I knew what tasks the information extraction consists of generally.

Posted in NLP | Leave a comment

Why is Norwegian and Danish identification difficult?

I re-post the estimation table of ldig (twitter language detection).

lang size detected correct precision recall
cs 5329 5330 5319 0.9979 0.9981
da 5478 5483 5311 0.9686 0.9695
de 10065 10076 10014 0.9938 0.9949
en 9701 9670 9569 0.9896 0.9864
es 10066 10075 9989 0.9915 0.9924
fi 4490 4472 4459 0.9971 0.9931
fr 10098 10097 10048 0.9951 0.9950
id 10181 10233 10167 0.9936 0.9986
it 10150 10191 10109 0.9920 0.9960
nl 9671 9579 9521 0.9939 0.9845
no 8560 8442 8219 0.9736 0.9602
pl 10070 10079 10054 0.9975 0.9984
pt 9422 9441 9354 0.9908 0.9928
ro 5914 5831 5822 0.9985 0.9844
sv 9990 10034 9866 0.9833 0.9876
tr 10310 10321 10300 0.9980 0.9990
vi 10494 10486 10479 0.9993 0.9986
total 149989 148600 0.9907

This shows that the accuracies of Norwegian and Danish are lower than others.
It is because Norwegian and Danish are very similar.

Here is top 25 of the word distribution of Norwegian and Danish.

word Danish Norwegian amount
1 er 0.0311 0.0238 0.0311
2 det 0.0287 0.0228 0.0287
3 i 0.0253 0.0275 0.0275
4 0.0165 0.0263 0.0263
5 jeg 0.0185 0.0202 0.0202
6 og 0.0188 0.0202 0.0202
7 at 0.0183 0.0083 0.0183
8 å 0.0001 0.0167 0.0167
9 til 0.0157 0.0140 0.0157
10 en 0.0149 0.0120 0.0149
11 ikke 0.0119 0.0146 0.0146
12 har 0.0122 0.0132 0.0132
13 med 0.0120 0.0117 0.0120
14 som 0.0044 0.0116 0.0116
15 for 0.0097 0.0115 0.0115
16 du 0.0111 0.0093 0.0111
17 0.0110 0.0072 0.0110
18 der 0.0101 0.0016 0.0101
19 av 0.0000 0.0093 0.0093
20 den 0.0089 0.0056 0.0089
21 af 0.0089 0.0000 0.0089
22 om 0.0052 0.0073 0.0073
23 kan 0.0073 0.0044 0.0073
24 men 0.0062 0.0072 0.0072
25 de 0.0066 0.0054 0.0066

Most of high frequency words, ‘er'(it nearly corresponds to ‘is’ in English, following is the same), ‘det'(‘it’) and’i'(‘in’) are common function words between 2 languages, so it is very difficult to identify them.
Useful words for the identification are a pretty few. For example, ‘of’ in English corresponds to ‘af’ in Danish or ‘av’ in Norwegian, ‘me’ corresponds to ‘mig'(da) or ‘meg'(no), ‘now’ corresponds to ‘nu'(da) or ‘nå'(no), ‘just’ correspond to ‘lige'(da) or ‘like'(no), ‘what’ corresponds to ‘hvad'(da) or ‘hva'(no), ‘no’ corresponds to ‘nej'(da) or ‘nei'(no), and so on.

The most serious problem is my language identification skill for Danish and Norwegian…
I collected tens of thousands of tweets in them, but I’m afraid that they contains several percent errors.
Of cource, that causes detection errors.

If you know native Norwegian or Danish who are interested in language detection and corpus annotation, I’m glad you to introduce me!😀

Posted in Language Detection | 8 Comments