HDP-LDA updates

Hierarchical Dirichlet Processes (Teh+ 2006) are a nonparametric bayesian topic model which can treat infinite topics.
In particular, HDP-LDA is interesting as an extention of LDA.

(Teh+ 2006) introduced updates of Collapsed Gibbs sampling for a general framework of HDP, but not for HDP-LDA.
To obtain updates of HDP-LDA, it is necessary to apply the base measure H and the emission F(phi) on HDP-LDA’s setting into the below equation:

, (eq. 30 on [Teh+ 2006])

where h is a probabilistic density function of H and f is one of F.
In the case of HDP-LDA, H is a Dirichlet distribution over vocabulary and F is a topic-word multinominal distribution, that is

where ,
.

To substitute these for equation (30), we obtain




,

where

We also need f_k^new when t takes a new table. It is obtained as the following:


And it is necessary to write down f_k(x_jt) also for sampling k.


For

(it means “term count of word w with topic k”)
(excluding ),


When implementation in Python, it is faster not to unfold Gamma functions than another. It is necessary to use these logarithms in either case, or f_k(x_jt) must overflow float range.

Finally,

Posted in LDA, Machine Learning, Nonparametric Bayesian | Leave a comment

[Kim+ ICML12] Dirichlet Process with Mixed Random Measures

We held a private reading meeting for ICML 2012.
I took and introduced [Kim+ ICML12] “Dirichlet Process with Mixed Random Measures : A Nonparametric Topic Model for Labeled Data.”
This is the presentation for it.

DP-MRM [Kim+ ICML12] is a supervised topic model like sLDA [Blei+ 2007], DiscLDA [Lacoste-Julien+ 2008] and MedLDA [Zhu+ 2009], and is regarded as a nonparametric version of Labeled LDA [Ramage+ 2009] in particular.

Although Labeled LDA is easy to implement (my implementation is here), it has a disadvantage that you must specify label-topic correspondings explicitly and manually.
On the other hand, DP-MRM can automatically decide label-topic correspondings as distributions. I am very interested in it.
But it is hard to implement because it is a nonparametric bayesian modal.
Hence I don’t want infinite topics but hierarchical label-topic correspondings, I guess that it will become very useful and handy and fast to replace DPs into normal Dirichlet distributions in this model… I am going to try it! :D

Posted in LDA, Machine Learning, Nonparametric Bayesian | 5 Comments

Short Text Language Detection with Infinity-Gram

I talked about language detection (language identification) for twitter at NAIST(NARA Institute of Science and Technology).
This is its slide.


Tweets are too short to detect their languages precisely. I guess that one reason is because features extracted from a short text are not enough to detect.
Another reason is because tweets have some unique representations, for example, u as you, 4 as for, LOL, F4F, various face marks and so on.

I developed ldig, a prototype of short text language detection, that solved those problems.
ldig can detect langages of tweets with over 99% accuracy for 19 languages.

The above slide explains how ldig solves those problems.

Posted in Language Detection, NLP | 22 Comments

[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems

In April 2012, We held a private reading meeting for NIPS 2011.
I read “Iterative Learning for Reliable Crowdsourcing Systems” [Karger+ NIPS11].

This paper targets Amazon Mechanical Turk(AMT) which separates a large task into microtasks. Each worker in AMT may be a spammer (who answers randomly to earn fee) or a hammer (who answers correctly).
This paper’s model simply assumes that each microtask has a coherent binary answer, each worker has probability to answer correctly which is independent on tasks. On the assumption, it estimates an average error rate when task size is enough large.
I don’t mind it needs simple strong assumption, but I’m sorry the model parameter q can’t be known its true value so that a practical problem can’t fit the model if the assumption was accepted.

Posted in NLP | Leave a comment

[Freedman+ EMNLP11] Extreme Extraction – Machine Reading in a Week

In December 2011, We held a private reading meeting for EMNLP 2011.
I read “Extreme Extraction – Machine Reading in a Week” [Freedman+ EMNLP11].

This paper is to construct a concept and relation extraction system rapidly.
I’m afraid that there are no definite construction methods (they might be confidential…) but length of time to do each task in it.
Hence I have not yet learned this field very much yet, I knew what tasks the information extraction consists of generally.

Posted in NLP | Leave a comment

Why is Norwegian and Danish identification difficult?

I re-post the estimation table of ldig (twitter language detection).

lang size detected correct precision recall
cs 5329 5330 5319 0.9979 0.9981
da 5478 5483 5311 0.9686 0.9695
de 10065 10076 10014 0.9938 0.9949
en 9701 9670 9569 0.9896 0.9864
es 10066 10075 9989 0.9915 0.9924
fi 4490 4472 4459 0.9971 0.9931
fr 10098 10097 10048 0.9951 0.9950
id 10181 10233 10167 0.9936 0.9986
it 10150 10191 10109 0.9920 0.9960
nl 9671 9579 9521 0.9939 0.9845
no 8560 8442 8219 0.9736 0.9602
pl 10070 10079 10054 0.9975 0.9984
pt 9422 9441 9354 0.9908 0.9928
ro 5914 5831 5822 0.9985 0.9844
sv 9990 10034 9866 0.9833 0.9876
tr 10310 10321 10300 0.9980 0.9990
vi 10494 10486 10479 0.9993 0.9986
total 149989 148600 0.9907

This shows that the accuracies of Norwegian and Danish are lower than others.
It is because Norwegian and Danish are very similar.

Here is top 25 of the word distribution of Norwegian and Danish.

word Danish Norwegian amount
1 er 0.0311 0.0238 0.0311
2 det 0.0287 0.0228 0.0287
3 i 0.0253 0.0275 0.0275
4 0.0165 0.0263 0.0263
5 jeg 0.0185 0.0202 0.0202
6 og 0.0188 0.0202 0.0202
7 at 0.0183 0.0083 0.0183
8 å 0.0001 0.0167 0.0167
9 til 0.0157 0.0140 0.0157
10 en 0.0149 0.0120 0.0149
11 ikke 0.0119 0.0146 0.0146
12 har 0.0122 0.0132 0.0132
13 med 0.0120 0.0117 0.0120
14 som 0.0044 0.0116 0.0116
15 for 0.0097 0.0115 0.0115
16 du 0.0111 0.0093 0.0111
17 0.0110 0.0072 0.0110
18 der 0.0101 0.0016 0.0101
19 av 0.0000 0.0093 0.0093
20 den 0.0089 0.0056 0.0089
21 af 0.0089 0.0000 0.0089
22 om 0.0052 0.0073 0.0073
23 kan 0.0073 0.0044 0.0073
24 men 0.0062 0.0072 0.0072
25 de 0.0066 0.0054 0.0066

Most of high frequency words, ‘er’(it nearly corresponds to ‘is’ in English, following is the same), ‘det’(‘it’) and’i'(‘in’) are common function words between 2 languages, so it is very difficult to identify them.
Useful words for the identification are a pretty few. For example, ‘of’ in English corresponds to ‘af’ in Danish or ‘av’ in Norwegian, ‘me’ corresponds to ‘mig’(da) or ‘meg’(no), ‘now’ corresponds to ‘nu’(da) or ‘nå’(no), ‘just’ correspond to ‘lige’(da) or ‘like’(no), ‘what’ corresponds to ‘hvad’(da) or ‘hva’(no), ‘no’ corresponds to ‘nej’(da) or ‘nei’(no), and so on.

The most serious problem is my language identification skill for Danish and Norwegian…
I collected tens of thousands of tweets in them, but I’m afraid that they contains several percent errors.
Of cource, that causes detection errors.

If you know native Norwegian or Danish who are interested in language detection and corpus annotation, I’m glad you to introduce me! :D

Posted in Uncategorized | 8 Comments

Estimation of ldig (twitter Language Detection) for LIGA dataset

LIGA[Tromp+ 11] is a twitter language detection for 6 languages (German, English, Spanish, French, Italian and Dutch).
It uses a graph with 3-grams for long distance features and detects 95-98% accuracy.
They open their dataset here which has 9066 tweets, so it is possible to compare ldig (Language Detection with Infinity-Gram: site, blog) to their result.

At first, it needs to comvert their dataset into ldig-available format.
Here is a ruby script to convert it.

#!/usr/bin/env ruby
# -*- coding: utf-8 -*-

open("liga_dataset.txt", "wb:UTF-8") do |f|
  ["de_DE","en_UK","es_ES","fr_FR","it_IT","nl_NL"].each do |dir|
    lang = dir[0,2]
    Dir.glob("#{dir}/*.txt") do |file|
      line = open(file, "rb:UTF-8") {|f| f.read}.gsub(/[\u0000-\u001f]/, " ").strip
      f.puts "#{lang}\t#{line}"
    end
  end
end

Here is a estimation of ldig for the generated dataset, liga_dataset.txt.

lang size detect correct precision recall
de 1479 1469 1463 0.9959 0.9892
en 1505 1504 1489 0.9900 0.9894
es 1562 1550 1541 0.9942 0.9866
fr 1551 1545 1539 0.9961 0.9923
it 1539 1532 1526 0.9961 0.9916
nl 1430 1429 1425 0.9972 0.9965
total 9066 8983 0.9908

It shows that ldig can detect over 99% accuracy for their dataset.

Reference

  • [Erik Tromp and Mykola Pechenizkiy 11] “Graph-Based N-gram Language Identification on Short Texts”
Posted in Language Detection, NLP, twitter | Leave a comment