Bayesian inference of Randomized Response: Python implementation

In the previous article, I introduced three estimation methods of Randomized Response: Maximum Likelihood, Gibbs Sampling and Collapsed Variational Bayesian.

I implemented and experimented with these methods in Python.

rr-mle.py and rr-gibbs.py draw the result as charts after 10000 trials.
rr-vb.py separates inference trials and result drawing because of the very long execution time.

$ python rr-vb.py 100 # 1000 inference trials (parallel execution)
$ python rr-vb.py     # result report

On a survey with D(=4) choices, it estimates the ratio of true answers from N(=100,1000,10000) answers at ratio of 1:2:3:4 (i.e. \boldsymbol{\pi}=(0.1, 0.2, 0.3, 0.4)) randomized by Randomized Response mechanism with a design matrix with p=1/5 (see the previous article for notation).

Maximum Likelihood Estimation (MLE)

These figures are histograms of the 10000 estimates from randomized answers.

MLE has the same variance regardless of the true value because they are solutions to a linear equation. The smaller N, the larger the variance, so it may have more than 1 or a negative estimate.

Gibbs Sampling (GS)

These figures are histograms of 10000 Gibbs Sampling estimates (average of last 200 samples) with a prior parameter \alpha=1.0.

There is almost no difference between MLE and GS when N is large enough.
On the other hand, the Bayesian model is effective when N is small and MLE estimate may be negative. In MLE, estimates may be negative, but it is with in the positive range in Gibbs Sampling estimation. Its variance is also smaller than MLE’s.

\hat\pi_1 for N=100MLEGS 1.0GS 0.1GS 0.01
\mathbb{E}[\hat\pi_1]0.100.180.130.11
{\rm stddev}[\hat\pi_1]0.210.090.180.26

In Bayesian model, the estimates tends to be leveled from the true value due to regularization effect of prior. The above table is the mean and standard deviation of the esitmate \hat\pi_1 for MLE and Gibbs sampling (“GS [num]” means Gibbs sampling estimation with prior parameter \alpha=[num]).

In Gibbs Sampling with \alpha=1.0, the estimate is approaching 0.25(=1/D). The leveling effect can be weakened by reducing the prior parameter \alpha instead of increasing variance. But Gibbs Sampling estimation with a lower prior parameter has another problem.

These figures are histograms with \alpha=0.01. Because of the strong influence of the Dirichlet prior, only one of the estimates is close to 1, and the others collapse to 0. Though the mean of the estimates is close to the true value, the median is 0 (useless!!).
It remains even with N=1000.

Collapsed Variational Bayesian (VB)

These figures are histograms of 10000 estimates on Collapsed Variational Bayesian with a prior parameter \alpha=1.0.
The results of GS and VB look the same in the above figures, but VB has a smaller variance. The boxplots of estimates \hat\pi_1, \hat\pi_4 tells us that VB’s estimates are better than GS’s.

In all cases, VB(each left) has smaller variances than GS(each right). The standard deviations are written at the horizontal axis label.

In GS, there was a problem that estimates tend to degenerate when a prior parameter is small. What about VB?

The effect of prior remains for N=100, but “ordinary estimates” are obtained. The boxplot makes this clearer.

On GS, \hat\pi_1 of N=100 collapses to 0 or is outliers, and \hat\pi_4 takes evenly from 0 to 1 (the variance is also extremely large!). VB has ordinary mountain-shaped distributions.
VB also has smaller variances than GS for large N, so it is available to balance leveling and variance by choosing the appropriate a prior parameter.

Bayesian inference of Randomized Response with Gibbs Sampling and Collapsed Variational Bayes

Randomized Response [Warner 65] is one of the methods to obtain statistics (such as average) while hiding sensitive information by randomizing answers of survey.

Let X\in{1,\cdots,D} be a random variable of a survey’s answer Q with D choices, and Y\in{1,\cdots,D} be a randomized version of X. Then Randomized Response model is the following.

\begin{cases}P(Y=j|X=i)=p_{ij} & i,j=1,\cdots,D, \; \sum_j p_{ij}=1 \\ P(X=i)=\pi_i & i=1,\cdots,D,\; \sum_i\pi_i=1 \end{cases}

In Bayesian modeling, suppose \boldsymbol\pi=(\pi_i)_i has a Dirichlet conjugate prior with a hyperparameter \boldsymbol{\alpha}=(\alpha_i)_i.

P(\boldsymbol{\pi}) = \rm{Dir}( \boldsymbol{\pi}| \boldsymbol{\alpha} )

N individuals answer X_1,\cdots,X_N for the survey and send their randomized values Y_1,\cdots,Y_N to a curator. The curator estimates some statistics like \mathbb{E}[X] from Y_1,\cdots,Y_N.

Let the model’s design matrix be \boldsymbol{P}=(p_{ij})_{ij},\; p_{ij}=P(Y=j|X=i).

Graphical Model of Randomized Response

Then I introduce 3 ways how to estimate \boldsymbol{\pi}: Maximum Likelihood Estimation, Gibbs Sampling, and Collapsed Variational Bayes.

Maximum Likelihood Estimation

Let \boldsymbol{c}=\left(c_i\right)_i be counts of observation Y_1,\cdots,Y_N.

c_i = \left| \left\{ n | y_n=i \right\} \right|

In maximum likelihood estimation, \hat{\boldsymbol{\pi}}, the estimator of \boldsymbol{\pi}, is calculated as the following with the design matrix \boldsymbol{P}.

\displaystyle \hat{\boldsymbol{\pi}}=\boldsymbol{P}^{-1}\boldsymbol{c}

Gibbs Sampling

The maximum likelihood estimation may have more than 1 or negative estimators even though each element of \hat{\boldsymbol{\pi}} is a probability. Bayesian estimation using Gibbs Sampling [Oh 1994] solves this problem.

Let \boldsymbol{X}=(X_1,\cdots,X_N),\; \boldsymbol{Y}=(Y_1,\cdots,Y_N), \boldsymbol{X}^{-n} be \boldsymbol{X} minus X_n:

\boldsymbol{X}^{-n}=\left({X_1,\cdots,X_{n-1},X_{n+1},\cdots,X_N}\right) .

Then the full conditional distribution for Gibbs sampling is the following:

\displaystyle \begin{aligned}& P(X_n=i | \boldsymbol{X}^{-n}, \boldsymbol{\pi},\boldsymbol{Y}) \\ \propto\;& P(X_n=i, Y_n=y_n | \boldsymbol{\pi}) \\ =\;& P(X_n=i | \boldsymbol{\pi}) P(Y_n =y_n | X_n=i) = \pi_i p_{iy_n} \hspace{1em}(1),\end{aligned}

\displaystyle  \begin{aligned} &P(\boldsymbol{\pi}|X_1,\cdots,X_N,\boldsymbol{Y}) \\ \propto\;& P(\boldsymbol{\pi},X_1,\cdots,X_N,\boldsymbol{Y}) \\ =\;& P(\boldsymbol{\pi})\prod_{n=1}^N \left\{ P(X_n|\boldsymbol{\pi})P(Y_n|X_n) \right\}\propto \prod_{i=1}^D \pi_i^{\alpha_i-1+d_i} \hspace{1em}(2), \end{aligned}

where d_i = \left|\{n|X_n=i\}\right|.

As \boldsymbol{X}, \boldsymbol{\pi} are randomly initialized and iteratively sampled from distribution (1) & (2), Let estimator \hat{\boldsymbol{\pi}} be the sample sequence of \boldsymbol{\pi} .

Collapsed Variational Bayes

I derived the collapsed variational bayesian estimation of Randomized Response as LDA-CVB0 [Teh+ 07, Asuncion+ 09].

Suppose collapsed variational approximation of the posterior q(\boldsymbol{X},\boldsymbol{\pi}) as the following:

q(\boldsymbol{X},\boldsymbol{\pi}) = q(\boldsymbol{\pi}|\boldsymbol{X})q(\boldsymbol{X})\approx q(\boldsymbol{\pi}|\boldsymbol{X})\prod_{n=1}^Nq(X_n).

Based on this assumption, the variational free energy of q(\boldsymbol{X},\boldsymbol{\pi}) is decomposed as the following:

\begin{aligned} & \mathcal{F}\left(q(\boldsymbol{X},\boldsymbol{\pi})\right)\\ =\;& \mathbb{E}_{q(\boldsymbol{X},\boldsymbol{\pi})} [ -\log P(\boldsymbol{Y},\boldsymbol{X},\boldsymbol{\pi}) ] - \mathcal{H}(q(\boldsymbol{X},\boldsymbol{\pi})) \\ =\;& \mathbb{E}_{q(\boldsymbol{X})} [ \mathbb{E}_{q(\boldsymbol{\pi}|\boldsymbol{X})} [ -\log P(\boldsymbol{Y},\boldsymbol{X},\boldsymbol{\pi}) ] - \mathcal{H}(q(\boldsymbol{\pi}|\boldsymbol{X})) ] - \mathcal{H}(q(\boldsymbol{X}))\end{aligned}

Instead of minimizing the free energy \mathcal{F}\left(q(\boldsymbol{X},\boldsymbol{\pi})\right) with q(\boldsymbol{X},\boldsymbol{\pi}), it is minimized with q(\boldsymbol{\pi}|\boldsymbol{X}) and then with q(\boldsymbol{X}). Let \mathcal{F}\left(q(\boldsymbol{X})\right) = \min_{q(\boldsymbol{\pi}|\boldsymbol{X})} \mathcal{F}\left(q(\boldsymbol{X},\boldsymbol{\pi})\right).

\mathcal{F}\left(q(\boldsymbol{X})\right) = \mathbb{E}_{q(\boldsymbol{X})} [ -\log P(\boldsymbol{Y},\boldsymbol{X}) ] - \mathcal{H}(q(\boldsymbol{X}))

Then the parameter update formula is obtained by iterative minimization of \mathcal{F}\left(q(\boldsymbol{X})\right) with each q(X_n).

Let \gamma _ {ni}=q(X_n=i) be the parameter of the multinominal q(X_n) = {\rm Multi}(X_n|(\gamma_{ni})_i) and c_k^{-n} = \sum_{m\neq n}I(X_m=k), \;\gamma_k^{-n} = \sum_{m\neq n} \gamma_{mk} ( where I(\cdot) is an indicator function), then the update function is the following:

\begin{aligned}\gamma_{ni} \propto\;& \exp\left( \mathbb{E} _ {q(\boldsymbol{X}^{-n})} [ \log P(\boldsymbol{Y},\boldsymbol{X}^{-n},X _ n=i) ]\right) \\ \propto\;& \exp\left( \log P(Y_n|X_n=i) + \mathbb{E} _ {q(\boldsymbol{X}^{-n})} \left[ \log \int \pi_i \prod_k \pi_k^{\alpha_k+c_k^{-n}-1}d\boldsymbol{\pi} \right]\right) \\ \propto\;& \exp\left( \log p_{iY_n} + \mathbb{E}_{q(\boldsymbol{X}^{-n})} \left[ \log (\alpha_i+c_i^{-n}) \right]\right) \\ \approx\;& p _ {iY_n}(\alpha_i+\gamma_i^{-n})\end{aligned}

The approximation in the last equation is a zero-order Taylor expansion as LDA-CVB0 [Asuncion+ 09].

Local Differential Privacy

Randomized Response satisfies the local differential privacy property [Wang+ 16].

A randomized mechanism M:X\rightarrow Y has \epsilon-local differential privacy if P(M(x)\in S)\leq e^\epsilon\cdot P(M(x')\in S) for all x, x'\in X and any S\subset Y. This \epsilon is called privacy budget.

For a design matrix \boldsymbol{P}=(p_{ij})_{ij},\; p_{ij}=P(Y=j|X=i), let \displaystyle \epsilon=\log \max_j \frac{ \max_i p_{ij} }{ \min_i p_{ij} } . Then Randomized Response with the design matrix \boldsymbol{P} has \epsilon-local differential privacy.

In design matrices with the same privacy budget, the estimator’s variance minimization is provided by parameterization with one parameter p.

\displaystyle P(Y=j|X=i)=p_{ij}=\begin{cases}p + \frac1D(1-p)& (i=j) \\ \frac1D(1-p) & (i\neq j) \end{cases}

This means that Y is X with probability p and a random selection with probability 1-p.

In the next article, I will report on the implementations and experiments of these estimation methods.

Python implementation of Labeled LDA (Ramage+ EMNLP2009)

Labeled LDA (D. Ramage, D. Hall, R. Nallapati and C.D. Manning; EMNLP2009) is a supervised topic model derived from LDA (Blei+ 2003).
While LDA’s estimated topics don’t often equal to human’s expectation because it is unsupervised, Labeled LDA is to treat documents with multiple labels.

I implemented Labeled LDA in python. The code is here:

llda.py is LLDA estimation for arbitrary test. Its input file consits on each line as a document.
Each document can be given label(s) to put in brackets at the head of the line, like the following:

[label1,label2,...]some text

To give documents without labels, it works like semi-supervised.

llda_nltk.py is a sample with llda.py. It estimates LLDA model with 100 sampled documents from Reuters corpus in NLTK and outputs top 20 words for each label.

(Ramage+ EMNLP2009) doesn’t figure out LLDA’s perplexity, then I derived document-topic distributions and topic-word ones that it requires:


where λ(d) means a topic set corresponding to labels of the document d, and Md is a size of the set.

I’m glad you to point out if these are wrong.

LLDA is neccessary that labels assign to topics explicitly and exactly. But it is very diffucult to know how many topics to assign to each label for better estimation.
Moreover it is natural that some categories may have common topics (e.g. “baseball” category and “soccer” category).

DP-MRM (Kim+ ICML2012), as I introduced in this blog, is a model to extend LLDA to estimate label-topic corresponding.

https://shuyo.wordpress.com/2012/07/31/kim-icml12-dirichlet-process-with-mixed-random-measures/

Though I want to implement it too, it is very complex…

HDP-LDA updates

Hierarchical Dirichlet Processes (Teh+ 2006) are a nonparametric bayesian topic model which can treat infinite topics.
In particular, HDP-LDA is interesting as an extention of LDA.

(Teh+ 2006) introduced updates of Collapsed Gibbs sampling for a general framework of HDP, but not for HDP-LDA.
To obtain updates of HDP-LDA, it is necessary to apply the base measure H and the emission F(phi) on HDP-LDA’s setting into the below equation:

, (eq. 30 on [Teh+ 2006])

where h is a probabilistic density function of H and f is one of F.
In the case of HDP-LDA, H is a Dirichlet distribution over vocabulary and F is a topic-word multinominal distribution, that is

where ,
.

To substitute these for equation (30), we obtain




,

where

We also need f_k^new when t takes a new table. It is obtained as the following:


And it is necessary to write down f_k(x_jt) also for sampling k.


For

(it means “term count of word w with topic k”)
(excluding ),


When implementation in Python, it is faster not to unfold Gamma functions than another. It is necessary to use these logarithms in either case, or f_k(x_jt) must overflow float range.

Finally,

[Kim+ ICML12] Dirichlet Process with Mixed Random Measures

We held a private reading meeting for ICML 2012.
I took and introduced [Kim+ ICML12] “Dirichlet Process with Mixed Random Measures : A Nonparametric Topic Model for Labeled Data.”
This is the presentation for it.

DP-MRM [Kim+ ICML12] is a supervised topic model like sLDA [Blei+ 2007], DiscLDA [Lacoste-Julien+ 2008] and MedLDA [Zhu+ 2009], and is regarded as a nonparametric version of Labeled LDA [Ramage+ 2009] in particular.

Although Labeled LDA is easy to implement (my implementation is here), it has a disadvantage that you must specify label-topic correspondings explicitly and manually.
On the other hand, DP-MRM can automatically decide label-topic correspondings as distributions. I am very interested in it.
But it is hard to implement because it is a nonparametric bayesian modal.
Hence I don’t want infinite topics but hierarchical label-topic correspondings, I guess that it will become very useful and handy and fast to replace DPs into normal Dirichlet distributions in this model… I am going to try it! 😀

Short Text Language Detection with Infinity-Gram

I talked about language detection (language identification) for twitter at NAIST(NARA Institute of Science and Technology).
This is its slide.


Tweets are too short to detect their languages precisely. I guess that one reason is because features extracted from a short text are not enough to detect.
Another reason is because tweets have some unique representations, for example, u as you, 4 as for, LOL, F4F, various face marks and so on.

I developed ldig, a prototype of short text language detection, that solved those problems.
ldig can detect langages of tweets with over 99% accuracy for 19 languages.

The above slide explains how ldig solves those problems.

[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems

In April 2012, We held a private reading meeting for NIPS 2011.
I read “Iterative Learning for Reliable Crowdsourcing Systems” [Karger+ NIPS11].

This paper targets Amazon Mechanical Turk(AMT) which separates a large task into microtasks. Each worker in AMT may be a spammer (who answers randomly to earn fee) or a hammer (who answers correctly).
This paper’s model simply assumes that each microtask has a coherent binary answer, each worker has probability to answer correctly which is independent on tasks. On the assumption, it estimates an average error rate when task size is enough large.
I don’t mind it needs simple strong assumption, but I’m sorry the model parameter q can’t be known its true value so that a practical problem can’t fit the model if the assumption was accepted.

[Freedman+ EMNLP11] Extreme Extraction – Machine Reading in a Week

In December 2011, We held a private reading meeting for EMNLP 2011.
I read “Extreme Extraction – Machine Reading in a Week” [Freedman+ EMNLP11].

This paper is to construct a concept and relation extraction system rapidly.
I’m afraid that there are no definite construction methods (they might be confidential…) but length of time to do each task in it.
Hence I have not yet learned this field very much yet, I knew what tasks the information extraction consists of generally.

Why is Norwegian and Danish identification difficult?

I re-post the estimation table of ldig (twitter language detection).

lang size detected correct precision recall
cs 5329 5330 5319 0.9979 0.9981
da 5478 5483 5311 0.9686 0.9695
de 10065 10076 10014 0.9938 0.9949
en 9701 9670 9569 0.9896 0.9864
es 10066 10075 9989 0.9915 0.9924
fi 4490 4472 4459 0.9971 0.9931
fr 10098 10097 10048 0.9951 0.9950
id 10181 10233 10167 0.9936 0.9986
it 10150 10191 10109 0.9920 0.9960
nl 9671 9579 9521 0.9939 0.9845
no 8560 8442 8219 0.9736 0.9602
pl 10070 10079 10054 0.9975 0.9984
pt 9422 9441 9354 0.9908 0.9928
ro 5914 5831 5822 0.9985 0.9844
sv 9990 10034 9866 0.9833 0.9876
tr 10310 10321 10300 0.9980 0.9990
vi 10494 10486 10479 0.9993 0.9986
total 149989 148600 0.9907

This shows that the accuracies of Norwegian and Danish are lower than others.
It is because Norwegian and Danish are very similar.

Here is top 25 of the word distribution of Norwegian and Danish.

word Danish Norwegian amount
1 er 0.0311 0.0238 0.0311
2 det 0.0287 0.0228 0.0287
3 i 0.0253 0.0275 0.0275
4 0.0165 0.0263 0.0263
5 jeg 0.0185 0.0202 0.0202
6 og 0.0188 0.0202 0.0202
7 at 0.0183 0.0083 0.0183
8 å 0.0001 0.0167 0.0167
9 til 0.0157 0.0140 0.0157
10 en 0.0149 0.0120 0.0149
11 ikke 0.0119 0.0146 0.0146
12 har 0.0122 0.0132 0.0132
13 med 0.0120 0.0117 0.0120
14 som 0.0044 0.0116 0.0116
15 for 0.0097 0.0115 0.0115
16 du 0.0111 0.0093 0.0111
17 0.0110 0.0072 0.0110
18 der 0.0101 0.0016 0.0101
19 av 0.0000 0.0093 0.0093
20 den 0.0089 0.0056 0.0089
21 af 0.0089 0.0000 0.0089
22 om 0.0052 0.0073 0.0073
23 kan 0.0073 0.0044 0.0073
24 men 0.0062 0.0072 0.0072
25 de 0.0066 0.0054 0.0066

Most of high frequency words, ‘er'(it nearly corresponds to ‘is’ in English, following is the same), ‘det'(‘it’) and’i'(‘in’) are common function words between 2 languages, so it is very difficult to identify them.
Useful words for the identification are a pretty few. For example, ‘of’ in English corresponds to ‘af’ in Danish or ‘av’ in Norwegian, ‘me’ corresponds to ‘mig'(da) or ‘meg'(no), ‘now’ corresponds to ‘nu'(da) or ‘nå'(no), ‘just’ correspond to ‘lige'(da) or ‘like'(no), ‘what’ corresponds to ‘hvad'(da) or ‘hva'(no), ‘no’ corresponds to ‘nej'(da) or ‘nei'(no), and so on.

The most serious problem is my language identification skill for Danish and Norwegian…
I collected tens of thousands of tweets in them, but I’m afraid that they contains several percent errors.
Of cource, that causes detection errors.

If you know native Norwegian or Danish who are interested in language detection and corpus annotation, I’m glad you to introduce me! 😀

Estimation of ldig (twitter Language Detection) for LIGA dataset

LIGA[Tromp+ 11] is a twitter language detection for 6 languages (German, English, Spanish, French, Italian and Dutch).
It uses a graph with 3-grams for long distance features and detects 95-98% accuracy.
They open their dataset here which has 9066 tweets, so it is possible to compare ldig (Language Detection with Infinity-Gram: site, blog) to their result.

At first, it needs to comvert their dataset into ldig-available format.
Here is a ruby script to convert it.

#!/usr/bin/env ruby
# -*- coding: utf-8 -*-

open("liga_dataset.txt", "wb:UTF-8") do |f|
  ["de_DE","en_UK","es_ES","fr_FR","it_IT","nl_NL"].each do |dir|
    lang = dir[0,2]
    Dir.glob("#{dir}/*.txt") do |file|
      line = open(file, "rb:UTF-8") {|f| f.read}.gsub(/[\u0000-\u001f]/, " ").strip
      f.puts "#{lang}\t#{line}"
    end
  end
end

Here is a estimation of ldig for the generated dataset, liga_dataset.txt.

lang size detect correct precision recall
de 1479 1469 1463 0.9959 0.9892
en 1505 1504 1489 0.9900 0.9894
es 1562 1550 1541 0.9942 0.9866
fr 1551 1545 1539 0.9961 0.9923
it 1539 1532 1526 0.9961 0.9916
nl 1430 1429 1425 0.9972 0.9965
total 9066 8983 0.9908

It shows that ldig can detect over 99% accuracy for their dataset.

Reference

  • [Erik Tromp and Mykola Pechenizkiy 11] “Graph-Based N-gram Language Identification on Short Texts”