Bayesian inference of Randomized Response: Python implementation

February 24, 2021 ~ shuyo ~ Leave a comment

In the previous article, I introduced three estimation methods of Randomized Response: Maximum Likelihood, Gibbs Sampling and Collapsed Variational Bayesian.

Bayesian inference of Randomized Response with Gibbs Sampling and Collapsed Variational Bayes – Shuyo’s Weblog

I implemented and experimented with these methods in Python.

iir/privacy/randomized-response at master · shuyo/iir · GitHub
- rr-mle.py
  - Maximum Likelihood Estimation
- rr-gibbs.py
  - Gibbs Sampling
- rr-vb.py
  - Collapsed Variational Bayes

rr-mle.py and rr-gibbs.py draw the result as charts after 10000 trials.
rr-vb.py separates inference trials and result drawing because of the very long execution time.

$ python rr-vb.py 100 # 1000 inference trials (parallel execution)
$ python rr-vb.py     # result report

On a survey with $D(=4)$ choices, it estimates the ratio of true answers from N(=100,1000,10000) answers at ratio of 1:2:3:4 (i.e. $\boldsymbol{\pi}=(0.1, 0.2, 0.3, 0.4)$ ) randomized by Randomized Response mechanism with a design matrix with p=1/5 (see the previous article for notation).

Maximum Likelihood Estimation (MLE)

These figures are histograms of the 10000 estimates from randomized answers.

MLE has the same variance regardless of the true value because they are solutions to a linear equation. The smaller N, the larger the variance, so it may have more than 1 or a negative estimate.

Gibbs Sampling (GS)

These figures are histograms of 10000 Gibbs Sampling estimates (average of last 200 samples) with a prior parameter $\alpha=1.0$ .

There is almost no difference between MLE and GS when N is large enough.
On the other hand, the Bayesian model is effective when N is small and MLE estimate may be negative. In MLE, estimates may be negative, but it is with in the positive range in Gibbs Sampling estimation. Its variance is also smaller than MLE’s.

In Bayesian model, the estimates tends to be leveled from the true value due to regularization effect of prior. The above table is the mean and standard deviation of the esitmate $\hat\pi_1$ for MLE and Gibbs sampling (“GS [num]” means Gibbs sampling estimation with prior parameter $\alpha=$ [num]).

In Gibbs Sampling with $\alpha=1.0$ , the estimate is approaching 0.25(=1/D). The leveling effect can be weakened by reducing the prior parameter $\alpha$ instead of increasing variance. But Gibbs Sampling estimation with a lower prior parameter has another problem.

These figures are histograms with $\alpha=0.01$ . Because of the strong influence of the Dirichlet prior, only one of the estimates is close to 1, and the others collapse to 0. Though the mean of the estimates is close to the true value, the median is 0 (useless!!).
It remains even with N=1000.

Collapsed Variational Bayesian (VB)

These figures are histograms of 10000 estimates on Collapsed Variational Bayesian with a prior parameter $\alpha=1.0$ .
The results of GS and VB look the same in the above figures, but VB has a smaller variance. The boxplots of estimates $\hat\pi_1, \hat\pi_4$ tells us that VB’s estimates are better than GS’s.

In all cases, VB(each left) has smaller variances than GS(each right). The standard deviations are written at the horizontal axis label.

In GS, there was a problem that estimates tend to degenerate when a prior parameter is small. What about VB?

The effect of prior remains for N=100, but “ordinary estimates” are obtained. The boxplot makes this clearer.

On GS, $\hat\pi_1$ of N=100 collapses to 0 or is outliers, and $\hat\pi_4$ takes evenly from 0 to 1 (the variance is also extremely large!). VB has ordinary mountain-shaped distributions.
VB also has smaller variances than GS for large N, so it is available to balance leveling and variance by choosing the appropriate a prior parameter.

Bayesian inference of Randomized Response with Gibbs Sampling and Collapsed Variational Bayes

February 15, 2021 ~ shuyo ~ Leave a comment

Randomized Response [Warner 65] is one of the methods to obtain statistics (such as average) while hiding sensitive information by randomizing answers of survey.

Let $X\in{1,\cdots,D}$ be a random variable of a survey’s answer $Q$ with $D$ choices, and $Y\in{1,\cdots,D}$ be a randomized version of $X$ . Then Randomized Response model is the following.

$\begin{cases}P(Y=j|X=i)=p_{ij} & i,j=1,\cdots,D, \; \sum_j p_{ij}=1 \\ P(X=i)=\pi_i & i=1,\cdots,D,\; \sum_i\pi_i=1 \end{cases}$

In Bayesian modeling, suppose $\boldsymbol\pi=(\pi_i)_i$ has a Dirichlet conjugate prior with a hyperparameter $\boldsymbol{\alpha}=(\alpha_i)_i$ .

$P(\boldsymbol{\pi}) = \rm{Dir}( \boldsymbol{\pi}| \boldsymbol{\alpha} )$

$N$ individuals answer $X_1,\cdots,X_N$ for the survey and send their randomized values $Y_1,\cdots,Y_N$ to a curator. The curator estimates some statistics like $\mathbb{E}[X]$ from $Y_1,\cdots,Y_N$ .

Let the model’s design matrix be $\boldsymbol{P}=(p_{ij})_{ij},\; p_{ij}=P(Y=j|X=i)$ .

Then I introduce 3 ways how to estimate $\boldsymbol{\pi}$ : Maximum Likelihood Estimation, Gibbs Sampling, and Collapsed Variational Bayes.

Maximum Likelihood Estimation

Let $\boldsymbol{c}=\left(c_i\right)_i$ be counts of observation $Y_1,\cdots,Y_N$ .

$c_i = \left| \left\{ n | y_n=i \right\} \right|$

In maximum likelihood estimation, $\hat{\boldsymbol{\pi}}$ , the estimator of $\boldsymbol{\pi}$ , is calculated as the following with the design matrix $\boldsymbol{P}$ .

$\displaystyle \hat{\boldsymbol{\pi}}=\boldsymbol{P}^{-1}\boldsymbol{c}$

Gibbs Sampling

The maximum likelihood estimation may have more than 1 or negative estimators even though each element of $\hat{\boldsymbol{\pi}}$ is a probability. Bayesian estimation using Gibbs Sampling [Oh 1994] solves this problem.

Let $\boldsymbol{X}=(X_1,\cdots,X_N),\; \boldsymbol{Y}=(Y_1,\cdots,Y_N)$ , $\boldsymbol{X}^{-n}$ be $\boldsymbol{X}$ minus $X_n$ :

$\boldsymbol{X}^{-n}=\left({X_1,\cdots,X_{n-1},X_{n+1},\cdots,X_N}\right)$ .

Then the full conditional distribution for Gibbs sampling is the following:

$\displaystyle \begin{aligned}& P(X_n=i | \boldsymbol{X}^{-n}, \boldsymbol{\pi},\boldsymbol{Y}) \\ \propto\;& P(X_n=i, Y_n=y_n | \boldsymbol{\pi}) \\ =\;& P(X_n=i | \boldsymbol{\pi}) P(Y_n =y_n | X_n=i) = \pi_i p_{iy_n} \hspace{1em}(1),\end{aligned}$

$\displaystyle \begin{aligned} &P(\boldsymbol{\pi}|X_1,\cdots,X_N,\boldsymbol{Y}) \\ \propto\;& P(\boldsymbol{\pi},X_1,\cdots,X_N,\boldsymbol{Y}) \\ =\;& P(\boldsymbol{\pi})\prod_{n=1}^N \left\{ P(X_n|\boldsymbol{\pi})P(Y_n|X_n) \right\}\propto \prod_{i=1}^D \pi_i^{\alpha_i-1+d_i} \hspace{1em}(2), \end{aligned}$

where $d_i = \left|\{n|X_n=i\}\right|$ .

As $\boldsymbol{X}, \boldsymbol{\pi}$ are randomly initialized and iteratively sampled from distribution (1) & (2), Let estimator $\hat{\boldsymbol{\pi}}$ be the sample sequence of $\boldsymbol{\pi}$ .

Collapsed Variational Bayes

I derived the collapsed variational bayesian estimation of Randomized Response as LDA-CVB0 [Teh+ 07, Asuncion+ 09].

Suppose collapsed variational approximation of the posterior $q(\boldsymbol{X},\boldsymbol{\pi})$ as the following:

$q(\boldsymbol{X},\boldsymbol{\pi}) = q(\boldsymbol{\pi}|\boldsymbol{X})q(\boldsymbol{X})\approx q(\boldsymbol{\pi}|\boldsymbol{X})\prod_{n=1}^Nq(X_n)$ .

Based on this assumption, the variational free energy of $q(\boldsymbol{X},\boldsymbol{\pi})$ is decomposed as the following:

$\begin{aligned} & \mathcal{F}\left(q(\boldsymbol{X},\boldsymbol{\pi})\right)\\ =\;& \mathbb{E}_{q(\boldsymbol{X},\boldsymbol{\pi})} [ -\log P(\boldsymbol{Y},\boldsymbol{X},\boldsymbol{\pi}) ] - \mathcal{H}(q(\boldsymbol{X},\boldsymbol{\pi})) \\ =\;& \mathbb{E}_{q(\boldsymbol{X})} [ \mathbb{E}_{q(\boldsymbol{\pi}|\boldsymbol{X})} [ -\log P(\boldsymbol{Y},\boldsymbol{X},\boldsymbol{\pi}) ] - \mathcal{H}(q(\boldsymbol{\pi}|\boldsymbol{X})) ] - \mathcal{H}(q(\boldsymbol{X}))\end{aligned}$

Instead of minimizing the free energy $\mathcal{F}\left(q(\boldsymbol{X},\boldsymbol{\pi})\right)$ with $q(\boldsymbol{X},\boldsymbol{\pi})$ , it is minimized with $q(\boldsymbol{\pi}|\boldsymbol{X})$ and then with $q(\boldsymbol{X})$ . Let $\mathcal{F}\left(q(\boldsymbol{X})\right) = \min_{q(\boldsymbol{\pi}|\boldsymbol{X})} \mathcal{F}\left(q(\boldsymbol{X},\boldsymbol{\pi})\right)$ .

$\mathcal{F}\left(q(\boldsymbol{X})\right) = \mathbb{E}_{q(\boldsymbol{X})} [ -\log P(\boldsymbol{Y},\boldsymbol{X}) ] - \mathcal{H}(q(\boldsymbol{X}))$

Then the parameter update formula is obtained by iterative minimization of $\mathcal{F}\left(q(\boldsymbol{X})\right)$ with each $q(X_n)$ .

Let $\gamma _ {ni}=q(X_n=i)$ be the parameter of the multinominal $q(X_n) = {\rm Multi}(X_n|(\gamma_{ni})_i)$ and $c_k^{-n} = \sum_{m\neq n}I(X_m=k)$ , $\;\gamma_k^{-n} = \sum_{m\neq n} \gamma_{mk}$ ( where $I(\cdot)$ is an indicator function), then the update function is the following:

$\begin{aligned}\gamma_{ni} \propto\;& \exp\left( \mathbb{E} _ {q(\boldsymbol{X}^{-n})} [ \log P(\boldsymbol{Y},\boldsymbol{X}^{-n},X _ n=i) ]\right) \\ \propto\;& \exp\left( \log P(Y_n|X_n=i) + \mathbb{E} _ {q(\boldsymbol{X}^{-n})} \left[ \log \int \pi_i \prod_k \pi_k^{\alpha_k+c_k^{-n}-1}d\boldsymbol{\pi} \right]\right) \\ \propto\;& \exp\left( \log p_{iY_n} + \mathbb{E}_{q(\boldsymbol{X}^{-n})} \left[ \log (\alpha_i+c_i^{-n}) \right]\right) \\ \approx\;& p _ {iY_n}(\alpha_i+\gamma_i^{-n})\end{aligned}$

The approximation in the last equation is a zero-order Taylor expansion as LDA-CVB0 [Asuncion+ 09].

Local Differential Privacy

Randomized Response satisfies the local differential privacy property [Wang+ 16].

A randomized mechanism $M:X\rightarrow Y$ has $\epsilon$ -local differential privacy if $P(M(x)\in S)\leq e^\epsilon\cdot P(M(x')\in S)$ for all $x, x'\in X$ and any $S\subset Y$ . This $\epsilon$ is called privacy budget.

For a design matrix $\boldsymbol{P}=(p_{ij})_{ij},\; p_{ij}=P(Y=j|X=i)$ , let $\displaystyle \epsilon=\log \max_j \frac{ \max_i p_{ij} }{ \min_i p_{ij} }$ . Then Randomized Response with the design matrix $\boldsymbol{P}$ has $\epsilon$ -local differential privacy.

In design matrices with the same privacy budget, the estimator’s variance minimization is provided by parameterization with one parameter $p$ .

$\displaystyle P(Y=j|X=i)=p_{ij}=\begin{cases}p + \frac1D(1-p)& (i=j) \\ \frac1D(1-p) & (i\neq j) \end{cases}$

This means that $Y$ is $X$ with probability $p$ and a random selection with probability $1-p$ .

In the next article, I will report on the implementations and experiments of these estimation methods.

Python implementation of Labeled LDA (Ramage+ EMNLP2009)

July 24, 2013February 10, 2021 ~ shuyo ~ 16 Comments

Labeled LDA (D. Ramage, D. Hall, R. Nallapati and C.D. Manning; EMNLP2009) is a supervised topic model derived from LDA (Blei+ 2003).
While LDA’s estimated topics don’t often equal to human’s expectation because it is unsupervised, Labeled LDA is to treat documents with multiple labels.

I implemented Labeled LDA in python. The code is here:

llda.py is LLDA estimation for arbitrary test. Its input file consits on each line as a document.
Each document can be given label(s) to put in brackets at the head of the line, like the following:

[label1,label2,...]some text

To give documents without labels, it works like semi-supervised.

llda_nltk.py is a sample with llda.py. It estimates LLDA model with 100 sampled documents from Reuters corpus in NLTK and outputs top 20 words for each label.

(Ramage+ EMNLP2009) doesn’t figure out LLDA’s perplexity, then I derived document-topic distributions and topic-word ones that it requires:

$\theta_{dz}=\mathbb{E}[p(z|d)]=\begin{cases}\frac{n_{dz}+\alpha}{n_{d\cdot}+M_d\alpha}\hspace{20}&\text{if }z\in\lambda^{(d)}\\0&\text{otherwize}\end{cases}$
$\phi_{zw}=\mathbb{E}[p(w|z)]=\frac{n_{zw}+\eta}{n_{z\cdot}+V\eta}$

where λ^(d) means a topic set corresponding to labels of the document d, and M_d is a size of the set.

I’m glad you to point out if these are wrong.

LLDA is neccessary that labels assign to topics explicitly and exactly. But it is very diffucult to know how many topics to assign to each label for better estimation.
Moreover it is natural that some categories may have common topics (e.g. “baseball” category and “soccer” category).

DP-MRM (Kim+ ICML2012), as I introduced in this blog, is a model to extend LLDA to estimate label-topic corresponding.

https://shuyo.wordpress.com/2012/07/31/kim-icml12-dirichlet-process-with-mixed-random-measures/

Though I want to implement it too, it is very complex…

HDP-LDA updates

August 15, 2012August 16, 2012 ~ shuyo ~ 9 Comments

Hierarchical Dirichlet Processes (Teh+ 2006) are a nonparametric bayesian topic model which can treat infinite topics.
In particular, HDP-LDA is interesting as an extention of LDA.

(Teh+ 2006) introduced updates of Collapsed Gibbs sampling for a general framework of HDP, but not for HDP-LDA.
To obtain updates of HDP-LDA, it is necessary to apply the base measure H and the emission F(phi) on HDP-LDA’s setting into the below equation:

$f_k^{-x_{ji}}(x_{ji}) = \frac{\int f(x_{ji}|\phi_k)\prod_{j'i'\neq ji,z_{j'i'}=k}f(x_{j'i'}|\phi_k)h(\phi_k)d\phi_k}{\int \prod_{j'i'\neq ji,z_{j'i'}=k}f(x_{j'i'}|\phi_k)h(\phi_k)d\phi_k}$ , (eq. 30 on [Teh+ 2006])

where h is a probabilistic density function of H and f is one of F.
In the case of HDP-LDA, H is a Dirichlet distribution over vocabulary and F is a topic-word multinominal distribution, that is

$h(\phi)=\rm{Dir}(\beta)=\frac 1Z \prod_v \phi_v^{\beta-1}$ where $Z=\frac{\prod_v \Gamma(\beta)}{\Gamma(V\beta)$ ,
$f(x_{ji}=v|\phi_k)=\phi_{kv}$ .

To substitute these for equation (30), we obtain

$f_k^{-x_{ji}}(x_{ji})=\frac{\int\phi_{kv}\prod_{w}\phi_{kw}^{n_{kw}^{-ji}}\prod_{w}\phi_{kw}^{\beta-1}d\phi_{k}}{\int\prod_{w}\phi_{kw}^{n_{kw}^{-ji}}\prod_{w}\phi_{kw}^{\beta-1}d\phi_{k}}$
$=\frac{\int\phi_{kv}^{n_{kv}^{-ji}+\beta}\prod_{w\neq v}\phi_{kw}^{n_{kw}^{-ji}+\beta-1}d\phi_{k}}{\int\prod_{w}\phi_{kw}^{n_{kw}^{-ji}+\beta-1}d\phi_{k}}$
$=\frac{\Gamma(\beta+n_{kv}^{-ji}+1)\prod_{w\neq v}\Gamma(\beta+n_{kw}^{-ji})}{\Gamma(V\beta+n_{k\cdot}^{-ji}+1)}\cdot\frac{\Gamma(V\beta+n_{k\cdot}^{-ji})}{\prod_{w}\Gamma(\beta+n_{kw}^{-ji})}$
$=\frac{\Gamma(\beta+n_{kv}^{-ji}+1)\Gamma(V\beta+n_{k\cdot}^{-ji})}{\Gamma(\beta+n_{kv}^{-ji})\Gamma(V\beta+n_{k\cdot}^{-ji}+1)} = \frac{\beta+n_{kv}^{-ji}}{V\beta+n_{k\cdot}^{-ji}}$ ,

where $v=x_{ji},\; k=k_{jt_{ji}},\; n_{kv}^{-ji}=\sharp\{x_{mn}|k_{mt_{mn}}=k,x_{mn}=v,(m,n)\neq(j,i)\}$

We also need f_k^new when t takes a new table. It is obtained as the following:

$f_{k^{\text{new}}}^{-x_{ji}}(x_{ji})= \int p(x_{ji}=v|\phi)p(\phi)d\bf{\phi}= \int \phi_v\cdot\frac{\Gamma(V\beta)}{\prod_w\Gamma(\beta)}\prod_w \phi_w^{\beta-1}d\bf{\phi}$
$=\frac{\Gamma(V\beta)}{\prod_w\Gamma(\beta)}\cdot\frac{\Gamma(\beta+1)\prod_{w\neq v}\Gamma(\beta)}{\Gamma(V\beta+1)}=\frac 1V$

And it is necessary to write down f_k(x_jt) also for sampling k.

$f_k^{-\bf{x}_{jt}}(\bf{x}_{jt})$
$=\frac{\int\prod_{x_{ji}\in\bf{x}_{jt}}f(x_{ji}|\ph_k)\prod_{x_{mn}\notin\bf{x}_{jt},z_{mn}=k}f(x_{mn}|\ph_k)h(\ph_k)d\ph_k}{\int\prod_{x_{mn}\notin\bf{x}_{jt},z_{mn}=k}f(x_{mn}|\ph_k)h(\ph_k)d\ph_k}$

For

$n_{kw}=\sharp\{x_{mn}|k_{mt_{mn}}=k,\;x_{mn}=w\}$ (it means “term count of word w with topic k”)
$n_{kw}^{-jt}=\sharp\{x_{mn}|k_{mt_{mn}}=k,\;x_{mn}=w,\;(m,t_{mn})\neq(j,t)\}$ (excluding $\bf{x}_{jt}=\{x_{ji}|t_{ji}=t\}$ ),

$f_{k}^{-\bf{x}_{jt}}(\bf{x}_{jt})= \frac{\prod_w\Gamma(\beta+n_{kw})}{\Gamma(V\beta+n_{k\cdot})} \cdot \frac{\Gamma(V\beta+n_{k\cdot}^{-jt})}{\prod_w\Gamma(\beta+n_{kw}^{-jt})}$
$= \frac{\prod_w(\beta+n_{kw}-1)\cdots(\beta+n_{kw}^{-jt})}{(V\beta+n_{k\cdot}-1)\cdots(V\beta+n_{k\cdot}^{-jt})}$

When implementation in Python, it is faster not to unfold Gamma functions than another. It is necessary to use these logarithms in either case, or f_k(x_jt) must overflow float range.

Finally,
$f_{k^{\text{new}}}^{-\bf{x}_{jt}}(\bf{x}_{jt}) = \frac{\Gamma(V\beta)\prod_v\Gamma(\beta+n_{\cdot v}^{jt})}{\Gamma(V\beta+n_{\cdot\cdot}^{jt})\prod_v\Gamma(\beta)}$

[Kim+ ICML12] Dirichlet Process with Mixed Random Measures

July 31, 2012August 15, 2012 ~ shuyo ~ 7 Comments

We held a private reading meeting for ICML 2012.
I took and introduced [Kim+ ICML12] “Dirichlet Process with Mixed Random Measures : A Nonparametric Topic Model for Labeled Data.”
This is the presentation for it.

DP-MRM [Kim+ ICML12] is a supervised topic model like sLDA [Blei+ 2007], DiscLDA [Lacoste-Julien+ 2008] and MedLDA [Zhu+ 2009], and is regarded as a nonparametric version of Labeled LDA [Ramage+ 2009] in particular.

Although Labeled LDA is easy to implement (my implementation is here), it has a disadvantage that you must specify label-topic correspondings explicitly and manually.
On the other hand, DP-MRM can automatically decide label-topic correspondings as distributions. I am very interested in it.
But it is hard to implement because it is a nonparametric bayesian modal.
Hence I don’t want infinite topics but hierarchical label-topic correspondings, I guess that it will become very useful and handy and fast to replace DPs into normal Dirichlet distributions in this model… I am going to try it! 😀

Short Text Language Detection with Infinity-Gram

May 17, 2012August 15, 2012 ~ shuyo ~ 27 Comments

I talked about language detection (language identification) for twitter at NAIST(NARA Institute of Science and Technology).
This is its slide.

Tweets are too short to detect their languages precisely. I guess that one reason is because features extracted from a short text are not enough to detect.
Another reason is because tweets have some unique representations, for example, u as you, 4 as for, LOL, F4F, various face marks and so on.

I developed ldig, a prototype of short text language detection, that solved those problems.
ldig can detect langages of tweets with over 99% accuracy for 19 languages.

https://github.com/shuyo/ldig

The above slide explains how ldig solves those problems.

[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems

May 1, 2012 ~ shuyo ~ Leave a comment

In April 2012, We held a private reading meeting for NIPS 2011.
I read “Iterative Learning for Reliable Crowdsourcing Systems” [Karger+ NIPS11].

[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems

View more presentations from Shuyo Nakatani

This paper targets Amazon Mechanical Turk(AMT) which separates a large task into microtasks. Each worker in AMT may be a spammer (who answers randomly to earn fee) or a hammer (who answers correctly).
This paper’s model simply assumes that each microtask has a coherent binary answer, each worker has probability to answer correctly which is independent on tasks. On the assumption, it estimates an average error rate when task size is enough large.
I don’t mind it needs simple strong assumption, but I’m sorry the model parameter q can’t be known its true value so that a practical problem can’t fit the model if the assumption was accepted.

[Freedman+ EMNLP11] Extreme Extraction – Machine Reading in a Week

May 1, 2012 ~ shuyo ~ Leave a comment

In December 2011, We held a private reading meeting for EMNLP 2011.
I read “Extreme Extraction – Machine Reading in a Week” [Freedman+ EMNLP11].

Extreme Extraction – Machine Reading in a Week

View more presentations from Shuyo Nakatani

This paper is to construct a concept and relation extraction system rapidly.
I’m afraid that there are no definite construction methods (they might be confidential…) but length of time to do each task in it.
Hence I have not yet learned this field very much yet, I knew what tasks the information extraction consists of generally.

Why is Norwegian and Danish identification difficult?

March 7, 2012August 16, 2013 ~ shuyo ~ 8 Comments

I re-post the estimation table of ldig (twitter language detection).

lang	size	detected	correct	precision	recall
cs	5329	5330	5319	0.9979	0.9981
da	5478	5483	5311	0.9686	0.9695
de	10065	10076	10014	0.9938	0.9949
en	9701	9670	9569	0.9896	0.9864
es	10066	10075	9989	0.9915	0.9924
fi	4490	4472	4459	0.9971	0.9931
fr	10098	10097	10048	0.9951	0.9950
id	10181	10233	10167	0.9936	0.9986
it	10150	10191	10109	0.9920	0.9960
nl	9671	9579	9521	0.9939	0.9845
no	8560	8442	8219	0.9736	0.9602
pl	10070	10079	10054	0.9975	0.9984
pt	9422	9441	9354	0.9908	0.9928
ro	5914	5831	5822	0.9985	0.9844
sv	9990	10034	9866	0.9833	0.9876
tr	10310	10321	10300	0.9980	0.9990
vi	10494	10486	10479	0.9993	0.9986
total	149989		148600		0.9907

This shows that the accuracies of Norwegian and Danish are lower than others.
It is because Norwegian and Danish are very similar.

Here is top 25 of the word distribution of Norwegian and Danish.

	word	Danish	Norwegian	amount
1	er	0.0311	0.0238	0.0311
2	det	0.0287	0.0228	0.0287
3	i	0.0253	0.0275	0.0275
4	på	0.0165	0.0263	0.0263
5	jeg	0.0185	0.0202	0.0202
6	og	0.0188	0.0202	0.0202
7	at	0.0183	0.0083	0.0183
8	å	0.0001	0.0167	0.0167
9	til	0.0157	0.0140	0.0157
10	en	0.0149	0.0120	0.0149
11	ikke	0.0119	0.0146	0.0146
12	har	0.0122	0.0132	0.0132
13	med	0.0120	0.0117	0.0120
14	som	0.0044	0.0116	0.0116
15	for	0.0097	0.0115	0.0115
16	du	0.0111	0.0093	0.0111
17	så	0.0110	0.0072	0.0110
18	der	0.0101	0.0016	0.0101
19	av	0.0000	0.0093	0.0093
20	den	0.0089	0.0056	0.0089
21	af	0.0089	0.0000	0.0089
22	om	0.0052	0.0073	0.0073
23	kan	0.0073	0.0044	0.0073
24	men	0.0062	0.0072	0.0072
25	de	0.0066	0.0054	0.0066

Most of high frequency words, ‘er'(it nearly corresponds to ‘is’ in English, following is the same), ‘det'(‘it’) and’i'(‘in’) are common function words between 2 languages, so it is very difficult to identify them.
Useful words for the identification are a pretty few. For example, ‘of’ in English corresponds to ‘af’ in Danish or ‘av’ in Norwegian, ‘me’ corresponds to ‘mig'(da) or ‘meg'(no), ‘now’ corresponds to ‘nu'(da) or ‘nå'(no), ‘just’ correspond to ‘lige'(da) or ‘like'(no), ‘what’ corresponds to ‘hvad'(da) or ‘hva'(no), ‘no’ corresponds to ‘nej'(da) or ‘nei'(no), and so on.

The most serious problem is my language identification skill for Danish and Norwegian…
I collected tens of thousands of tweets in them, but I’m afraid that they contains several percent errors.
Of cource, that causes detection errors.

If you know native Norwegian or Danish who are interested in language detection and corpus annotation, I’m glad you to introduce me! 😀

Estimation of ldig (twitter Language Detection) for LIGA dataset

March 2, 2012March 3, 2012 ~ shuyo ~ 2 Comments

LIGA[Tromp+ 11] is a twitter language detection for 6 languages (German, English, Spanish, French, Italian and Dutch).
It uses a graph with 3-grams for long distance features and detects 95-98% accuracy.
They open their dataset here which has 9066 tweets, so it is possible to compare ldig (Language Detection with Infinity-Gram: site, blog) to their result.

At first, it needs to comvert their dataset into ldig-available format.
Here is a ruby script to convert it.

#!/usr/bin/env ruby
# -*- coding: utf-8 -*-

open("liga_dataset.txt", "wb:UTF-8") do |f|
  ["de_DE","en_UK","es_ES","fr_FR","it_IT","nl_NL"].each do |dir|
    lang = dir[0,2]
    Dir.glob("#{dir}/*.txt") do |file|
      line = open(file, "rb:UTF-8") {|f| f.read}.gsub(/[\u0000-\u001f]/, " ").strip
      f.puts "#{lang}\t#{line}"
    end
  end
end

Here is a estimation of ldig for the generated dataset, liga_dataset.txt.

lang	size	detect	correct	precision	recall
de	1479	1469	1463	0.9959	0.9892
en	1505	1504	1489	0.9900	0.9894
es	1562	1550	1541	0.9942	0.9866
fr	1551	1545	1539	0.9961	0.9923
it	1539	1532	1526	0.9961	0.9916
nl	1430	1429	1425	0.9972	0.9965
total	9066		8983		0.9908

It shows that ldig can detect over 99% accuracy for their dataset.

Reference

[Erik Tromp and Mykola Pechenizkiy 11] “Graph-Based N-gram Language Identification on Short Texts”

$\hat\pi_1$ for N=100	MLE	GS 1.0	GS 0.1	GS 0.01
$\mathbb{E}[\hat\pi_1]$	0.10	0.18	0.13	0.11
${\rm stddev}[\hat\pi_1]$	0.21	0.09	0.18	0.26