<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Shuyo&#039;s Weblog</title>
	<atom:link href="http://shuyo.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://shuyo.wordpress.com</link>
	<description>About my favorite technical subjects</description>
	<lastBuildDate>Mon, 13 May 2013 07:19:50 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='shuyo.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Shuyo&#039;s Weblog</title>
		<link>http://shuyo.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://shuyo.wordpress.com/osd.xml" title="Shuyo&#039;s Weblog" />
	<atom:link rel='hub' href='http://shuyo.wordpress.com/?pushpress=hub'/>
		<item>
		<title>HDP-LDA updates</title>
		<link>http://shuyo.wordpress.com/2012/08/15/hdp-lda-updates/</link>
		<comments>http://shuyo.wordpress.com/2012/08/15/hdp-lda-updates/#comments</comments>
		<pubDate>Wed, 15 Aug 2012 10:44:28 +0000</pubDate>
		<dc:creator>shuyo</dc:creator>
				<category><![CDATA[LDA]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Nonparametric Bayesian]]></category>

		<guid isPermaLink="false">http://shuyo.wordpress.com/?p=306</guid>
		<description><![CDATA[Hierarchical Dirichlet Processes (Teh+ 2006) are a nonparametric bayesian topic model which can treat infinite topics. In particular, HDP-LDA is interesting as an extention of LDA. (Teh+ 2006) introduced updates of Collapsed Gibbs sampling for a general framework of HDP, &#8230; <a href="http://shuyo.wordpress.com/2012/08/15/hdp-lda-updates/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&#038;blog=4249505&#038;post=306&#038;subd=shuyo&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Hierarchical Dirichlet Processes (Teh+ 2006) are a nonparametric bayesian topic model which can treat infinite topics.<br />
In particular, HDP-LDA is interesting as an extention of LDA.</p>
<p>(Teh+ 2006) introduced updates of Collapsed Gibbs sampling for a general framework of HDP, but not for HDP-LDA.<br />
To obtain updates of HDP-LDA, it is necessary to apply the base measure H and the emission F(phi) on HDP-LDA&#8217;s setting into the below equation:</p>
<p><img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=f_k%5E%7B-x_%7Bji%7D%7D(x_%7Bji%7D)%20%3D%20%5Cfrac%7B%5Cint%20f(x_%7Bji%7D%7C%5Cphi_k)%5Cprod_%7Bj'i'%5Cneq%20ji%2Cz_%7Bj'i'%7D%3Dk%7Df(x_%7Bj'i'%7D%7C%5Cphi_k)h(%5Cphi_k)d%5Cphi_k%7D%7B%5Cint%20%5Cprod_%7Bj'i'%5Cneq%20ji%2Cz_%7Bj'i'%7D%3Dk%7Df(x_%7Bj'i'%7D%7C%5Cphi_k)h(%5Cphi_k)d%5Cphi_k%7D%0A">,   (eq. 30 on [Teh+ 2006])</p>
<p>where h is a probabilistic density function of H and f is one of F.<br />
In the case of HDP-LDA, H is a Dirichlet distribution over vocabulary and F is a topic-word multinominal distribution, that is</p>
<p><img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=h(%5Cphi)%3D%5Crm%7BDir%7D(%5Cbeta)%3D%5Cfrac%201Z%20%5Cprod_v%20%5Cphi_v%5E%7B%5Cbeta-1%7D%0A"> where <img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=Z%3D%5Cfrac%7B%5Cprod_v%20%5CGamma(%5Cbeta)%7D%7B%5CGamma(V%5Cbeta)">,<br />
<img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=f(x_%7Bji%7D%3Dv%7C%5Cphi_k)%3D%5Cphi_%7Bkv%7D">.</p>
<p>To substitute these for equation (30), we obtain</p>
<p><img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=f_k%5E%7B-x_%7Bji%7D%7D(x_%7Bji%7D)%3D%5Cfrac%7B%5Cint%5Cphi_%7Bkv%7D%5Cprod_%7Bw%7D%5Cphi_%7Bkw%7D%5E%7Bn_%7Bkw%7D%5E%7B-ji%7D%7D%5Cprod_%7Bw%7D%5Cphi_%7Bkw%7D%5E%7B%5Cbeta-1%7Dd%5Cphi_%7Bk%7D%7D%7B%5Cint%5Cprod_%7Bw%7D%5Cphi_%7Bkw%7D%5E%7Bn_%7Bkw%7D%5E%7B-ji%7D%7D%5Cprod_%7Bw%7D%5Cphi_%7Bkw%7D%5E%7B%5Cbeta-1%7Dd%5Cphi_%7Bk%7D%7D"><br />
<img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=%3D%5Cfrac%7B%5Cint%5Cphi_%7Bkv%7D%5E%7Bn_%7Bkv%7D%5E%7B-ji%7D%2B%5Cbeta%7D%5Cprod_%7Bw%5Cneq%20v%7D%5Cphi_%7Bkw%7D%5E%7Bn_%7Bkw%7D%5E%7B-ji%7D%2B%5Cbeta-1%7Dd%5Cphi_%7Bk%7D%7D%7B%5Cint%5Cprod_%7Bw%7D%5Cphi_%7Bkw%7D%5E%7Bn_%7Bkw%7D%5E%7B-ji%7D%2B%5Cbeta-1%7Dd%5Cphi_%7Bk%7D%7D"><br />
<img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=%3D%5Cfrac%7B%5CGamma(%5Cbeta%2Bn_%7Bkv%7D%5E%7B-ji%7D%2B1)%5Cprod_%7Bw%5Cneq%20v%7D%5CGamma(%5Cbeta%2Bn_%7Bkw%7D%5E%7B-ji%7D)%7D%7B%5CGamma(V%5Cbeta%2Bn_%7Bk%5Ccdot%7D%5E%7B-ji%7D%2B1)%7D%5Ccdot%5Cfrac%7B%5CGamma(V%5Cbeta%2Bn_%7Bk%5Ccdot%7D%5E%7B-ji%7D)%7D%7B%5Cprod_%7Bw%7D%5CGamma(%5Cbeta%2Bn_%7Bkw%7D%5E%7B-ji%7D)%7D"><br />
<img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=%3D%5Cfrac%7B%5CGamma(%5Cbeta%2Bn_%7Bkv%7D%5E%7B-ji%7D%2B1)%5CGamma(V%5Cbeta%2Bn_%7Bk%5Ccdot%7D%5E%7B-ji%7D)%7D%7B%5CGamma(%5Cbeta%2Bn_%7Bkv%7D%5E%7B-ji%7D)%5CGamma(V%5Cbeta%2Bn_%7Bk%5Ccdot%7D%5E%7B-ji%7D%2B1)%7D%0A%3D%20%5Cfrac%7B%5Cbeta%2Bn_%7Bkv%7D%5E%7B-ji%7D%7D%7BV%5Cbeta%2Bn_%7Bk%5Ccdot%7D%5E%7B-ji%7D%7D">,</p>
<p>where <img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=v%3Dx_%7Bji%7D%2C%5C%3B%0Ak%3Dk_%7Bjt_%7Bji%7D%7D%2C%5C%3B%0An_%7Bkv%7D%5E%7B-ji%7D%3D%5Csharp%5C%7Bx_%7Bmn%7D%7Ck_%7Bmt_%7Bmn%7D%7D%3Dk%2Cx_%7Bmn%7D%3Dv%2C(m%2Cn)%5Cneq(j%2Ci)%5C%7D"></p>
<p>We also need f_k^new when t takes a new table. It is obtained as the following:</p>
<p><img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=%0Af_%7Bk%5E%7B%5Ctext%7Bnew%7D%7D%7D%5E%7B-x_%7Bji%7D%7D(x_%7Bji%7D)%3D%0A%5Cint%20p(x_%7Bji%7D%3Dv%7C%5Cphi)p(%5Cphi)d%5Cbf%7B%5Cphi%7D%3D%0A%5Cint%20%5Cphi_v%5Ccdot%5Cfrac%7B%5CGamma(V%5Cbeta)%7D%7B%5Cprod_w%5CGamma(%5Cbeta)%7D%5Cprod_w%20%5Cphi_w%5E%7B%5Cbeta-1%7Dd%5Cbf%7B%5Cphi%7D%0A%0A"><br />
<img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=%3D%5Cfrac%7B%5CGamma(V%5Cbeta)%7D%7B%5Cprod_w%5CGamma(%5Cbeta)%7D%5Ccdot%5Cfrac%7B%5CGamma(%5Cbeta%2B1)%5Cprod_%7Bw%5Cneq%20v%7D%5CGamma(%5Cbeta)%7D%7B%5CGamma(V%5Cbeta%2B1)%7D%3D%5Cfrac%201V"></p>
<p>And it is necessary to write down f_k(x_jt) also for sampling k.</p>
<p><img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=%0Af_k%5E%7B-%5Cbf%7Bx%7D_%7Bjt%7D%7D(%5Cbf%7Bx%7D_%7Bjt%7D)"><br />
<img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=%3D%5Cfrac%7B%5Cint%5Cprod_%7Bx_%7Bji%7D%5Cin%5Cbf%7Bx%7D_%7Bjt%7D%7Df(x_%7Bji%7D%7C%5Cph_k)%5Cprod_%7Bx_%7Bmn%7D%5Cnotin%5Cbf%7Bx%7D_%7Bjt%7D%2Cz_%7Bmn%7D%3Dk%7Df(x_%7Bmn%7D%7C%5Cph_k)h(%5Cph_k)d%5Cph_k%7D%7B%5Cint%5Cprod_%7Bx_%7Bmn%7D%5Cnotin%5Cbf%7Bx%7D_%7Bjt%7D%2Cz_%7Bmn%7D%3Dk%7Df(x_%7Bmn%7D%7C%5Cph_k)h(%5Cph_k)d%5Cph_k%7D"></p>
<p>For</p>
<p><img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=n_%7Bkw%7D%3D%5Csharp%5C%7Bx_%7Bmn%7D%7Ck_%7Bmt_%7Bmn%7D%7D%3Dk%2C%5C%3Bx_%7Bmn%7D%3Dw%5C%7D"> (it means &#8220;term count of word w with topic k&#8221;)<br />
<img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=n_%7Bkw%7D%5E%7B-jt%7D%3D%5Csharp%5C%7Bx_%7Bmn%7D%7Ck_%7Bmt_%7Bmn%7D%7D%3Dk%2C%5C%3Bx_%7Bmn%7D%3Dw%2C%5C%3B(m%2Ct_%7Bmn%7D)%5Cneq(j%2Ct)%5C%7D"> (excluding <img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=%5Cbf%7Bx%7D_%7Bjt%7D%3D%5C%7Bx_%7Bji%7D%7Ct_%7Bji%7D%3Dt%5C%7D">),</p>
<p><img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=f_%7Bk%7D%5E%7B-%5Cbf%7Bx%7D_%7Bjt%7D%7D(%5Cbf%7Bx%7D_%7Bjt%7D)%3D%20%5Cfrac%7B%5Cprod_w%5CGamma(%5Cbeta%2Bn_%7Bkw%7D)%7D%7B%5CGamma(V%5Cbeta%2Bn_%7Bk%5Ccdot%7D)%7D%20%5Ccdot%20%5Cfrac%7B%5CGamma(V%5Cbeta%2Bn_%7Bk%5Ccdot%7D%5E%7B-jt%7D)%7D%7B%5Cprod_w%5CGamma(%5Cbeta%2Bn_%7Bkw%7D%5E%7B-jt%7D)%7D%0A"><br />
<img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=%0A%3D%20%5Cfrac%7B%5Cprod_w(%5Cbeta%2Bn_%7Bkw%7D-1)%5Ccdots(%5Cbeta%2Bn_%7Bkw%7D%5E%7B-jt%7D)%7D%7B(V%5Cbeta%2Bn_%7Bk%5Ccdot%7D-1)%5Ccdots(V%5Cbeta%2Bn_%7Bk%5Ccdot%7D%5E%7B-jt%7D)%7D"></p>
<p>When implementation in Python, it is faster not to unfold Gamma functions than another. It is necessary to use these logarithms in either case, or f_k(x_jt) must overflow float range.</p>
<p>Finally,<br />
<img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=f_%7Bk%5E%7B%5Ctext%7Bnew%7D%7D%7D%5E%7B-%5Cbf%7Bx%7D_%7Bjt%7D%7D(%5Cbf%7Bx%7D_%7Bjt%7D)%20%3D%20%5Cfrac%7B%5CGamma(V%5Cbeta)%5Cprod_v%5CGamma(%5Cbeta%2Bn_%7B%5Ccdot%20v%7D%5E%7Bjt%7D)%7D%7B%5CGamma(V%5Cbeta%2Bn_%7B%5Ccdot%5Ccdot%7D%5E%7Bjt%7D)%5Cprod_v%5CGamma(%5Cbeta)%7D"></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/shuyo.wordpress.com/306/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/shuyo.wordpress.com/306/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&#038;blog=4249505&#038;post=306&#038;subd=shuyo&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://shuyo.wordpress.com/2012/08/15/hdp-lda-updates/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/1f4bcc4ca602d28da9fb29d79fffcae4?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">shuyo</media:title>
		</media:content>

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=f_k%5E%7B-x_%7Bji%7D%7D(x_%7Bji%7D)%20%3D%20%5Cfrac%7B%5Cint%20f(x_%7Bji%7D%7C%5Cphi_k)%5Cprod_%7Bj&#039;i&#039;%5Cneq%20ji%2Cz_%7Bj&#039;i&#039;%7D%3Dk%7Df(x_%7Bj&#039;i&#039;%7D%7C%5Cphi_k)h(%5Cphi_k)d%5Cphi_k%7D%7B%5Cint%20%5Cprod_%7Bj&#039;i&#039;%5Cneq%20ji%2Cz_%7Bj&#039;i&#039;%7D%3Dk%7Df(x_%7Bj&#039;i&#039;%7D%7C%5Cphi_k)h(%5Cphi_k)d%5Cphi_k%7D" medium="image" />

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=h(%5Cphi)%3D%5Crm%7BDir%7D(%5Cbeta)%3D%5Cfrac%201Z%20%5Cprod_v%20%5Cphi_v%5E%7B%5Cbeta-1%7D" medium="image" />

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=Z%3D%5Cfrac%7B%5Cprod_v%20%5CGamma(%5Cbeta)%7D%7B%5CGamma(V%5Cbeta)" medium="image" />

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=f(x_%7Bji%7D%3Dv%7C%5Cphi_k)%3D%5Cphi_%7Bkv%7D" medium="image" />

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=f_k%5E%7B-x_%7Bji%7D%7D(x_%7Bji%7D)%3D%5Cfrac%7B%5Cint%5Cphi_%7Bkv%7D%5Cprod_%7Bw%7D%5Cphi_%7Bkw%7D%5E%7Bn_%7Bkw%7D%5E%7B-ji%7D%7D%5Cprod_%7Bw%7D%5Cphi_%7Bkw%7D%5E%7B%5Cbeta-1%7Dd%5Cphi_%7Bk%7D%7D%7B%5Cint%5Cprod_%7Bw%7D%5Cphi_%7Bkw%7D%5E%7Bn_%7Bkw%7D%5E%7B-ji%7D%7D%5Cprod_%7Bw%7D%5Cphi_%7Bkw%7D%5E%7B%5Cbeta-1%7Dd%5Cphi_%7Bk%7D%7D" medium="image" />

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=%3D%5Cfrac%7B%5Cint%5Cphi_%7Bkv%7D%5E%7Bn_%7Bkv%7D%5E%7B-ji%7D%2B%5Cbeta%7D%5Cprod_%7Bw%5Cneq%20v%7D%5Cphi_%7Bkw%7D%5E%7Bn_%7Bkw%7D%5E%7B-ji%7D%2B%5Cbeta-1%7Dd%5Cphi_%7Bk%7D%7D%7B%5Cint%5Cprod_%7Bw%7D%5Cphi_%7Bkw%7D%5E%7Bn_%7Bkw%7D%5E%7B-ji%7D%2B%5Cbeta-1%7Dd%5Cphi_%7Bk%7D%7D" medium="image" />

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=%3D%5Cfrac%7B%5CGamma(%5Cbeta%2Bn_%7Bkv%7D%5E%7B-ji%7D%2B1)%5Cprod_%7Bw%5Cneq%20v%7D%5CGamma(%5Cbeta%2Bn_%7Bkw%7D%5E%7B-ji%7D)%7D%7B%5CGamma(V%5Cbeta%2Bn_%7Bk%5Ccdot%7D%5E%7B-ji%7D%2B1)%7D%5Ccdot%5Cfrac%7B%5CGamma(V%5Cbeta%2Bn_%7Bk%5Ccdot%7D%5E%7B-ji%7D)%7D%7B%5Cprod_%7Bw%7D%5CGamma(%5Cbeta%2Bn_%7Bkw%7D%5E%7B-ji%7D)%7D" medium="image" />

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=%3D%5Cfrac%7B%5CGamma(%5Cbeta%2Bn_%7Bkv%7D%5E%7B-ji%7D%2B1)%5CGamma(V%5Cbeta%2Bn_%7Bk%5Ccdot%7D%5E%7B-ji%7D)%7D%7B%5CGamma(%5Cbeta%2Bn_%7Bkv%7D%5E%7B-ji%7D)%5CGamma(V%5Cbeta%2Bn_%7Bk%5Ccdot%7D%5E%7B-ji%7D%2B1)%7D%3D%20%5Cfrac%7B%5Cbeta%2Bn_%7Bkv%7D%5E%7B-ji%7D%7D%7BV%5Cbeta%2Bn_%7Bk%5Ccdot%7D%5E%7B-ji%7D%7D" medium="image" />

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=v%3Dx_%7Bji%7D%2C%5C%3Bk%3Dk_%7Bjt_%7Bji%7D%7D%2C%5C%3Bn_%7Bkv%7D%5E%7B-ji%7D%3D%5Csharp%5C%7Bx_%7Bmn%7D%7Ck_%7Bmt_%7Bmn%7D%7D%3Dk%2Cx_%7Bmn%7D%3Dv%2C(m%2Cn)%5Cneq(j%2Ci)%5C%7D" medium="image" />

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=f_%7Bk%5E%7B%5Ctext%7Bnew%7D%7D%7D%5E%7B-x_%7Bji%7D%7D(x_%7Bji%7D)%3D%5Cint%20p(x_%7Bji%7D%3Dv%7C%5Cphi)p(%5Cphi)d%5Cbf%7B%5Cphi%7D%3D%5Cint%20%5Cphi_v%5Ccdot%5Cfrac%7B%5CGamma(V%5Cbeta)%7D%7B%5Cprod_w%5CGamma(%5Cbeta)%7D%5Cprod_w%20%5Cphi_w%5E%7B%5Cbeta-1%7Dd%5Cbf%7B%5Cphi%7D" medium="image" />

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=%3D%5Cfrac%7B%5CGamma(V%5Cbeta)%7D%7B%5Cprod_w%5CGamma(%5Cbeta)%7D%5Ccdot%5Cfrac%7B%5CGamma(%5Cbeta%2B1)%5Cprod_%7Bw%5Cneq%20v%7D%5CGamma(%5Cbeta)%7D%7B%5CGamma(V%5Cbeta%2B1)%7D%3D%5Cfrac%201V" medium="image" />

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=f_k%5E%7B-%5Cbf%7Bx%7D_%7Bjt%7D%7D(%5Cbf%7Bx%7D_%7Bjt%7D)" medium="image" />

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=%3D%5Cfrac%7B%5Cint%5Cprod_%7Bx_%7Bji%7D%5Cin%5Cbf%7Bx%7D_%7Bjt%7D%7Df(x_%7Bji%7D%7C%5Cph_k)%5Cprod_%7Bx_%7Bmn%7D%5Cnotin%5Cbf%7Bx%7D_%7Bjt%7D%2Cz_%7Bmn%7D%3Dk%7Df(x_%7Bmn%7D%7C%5Cph_k)h(%5Cph_k)d%5Cph_k%7D%7B%5Cint%5Cprod_%7Bx_%7Bmn%7D%5Cnotin%5Cbf%7Bx%7D_%7Bjt%7D%2Cz_%7Bmn%7D%3Dk%7Df(x_%7Bmn%7D%7C%5Cph_k)h(%5Cph_k)d%5Cph_k%7D" medium="image" />

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=n_%7Bkw%7D%3D%5Csharp%5C%7Bx_%7Bmn%7D%7Ck_%7Bmt_%7Bmn%7D%7D%3Dk%2C%5C%3Bx_%7Bmn%7D%3Dw%5C%7D" medium="image" />

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=n_%7Bkw%7D%5E%7B-jt%7D%3D%5Csharp%5C%7Bx_%7Bmn%7D%7Ck_%7Bmt_%7Bmn%7D%7D%3Dk%2C%5C%3Bx_%7Bmn%7D%3Dw%2C%5C%3B(m%2Ct_%7Bmn%7D)%5Cneq(j%2Ct)%5C%7D" medium="image" />

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=%5Cbf%7Bx%7D_%7Bjt%7D%3D%5C%7Bx_%7Bji%7D%7Ct_%7Bji%7D%3Dt%5C%7D" medium="image" />

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=f_%7Bk%7D%5E%7B-%5Cbf%7Bx%7D_%7Bjt%7D%7D(%5Cbf%7Bx%7D_%7Bjt%7D)%3D%20%5Cfrac%7B%5Cprod_w%5CGamma(%5Cbeta%2Bn_%7Bkw%7D)%7D%7B%5CGamma(V%5Cbeta%2Bn_%7Bk%5Ccdot%7D)%7D%20%5Ccdot%20%5Cfrac%7B%5CGamma(V%5Cbeta%2Bn_%7Bk%5Ccdot%7D%5E%7B-jt%7D)%7D%7B%5Cprod_w%5CGamma(%5Cbeta%2Bn_%7Bkw%7D%5E%7B-jt%7D)%7D" medium="image" />

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=%3D%20%5Cfrac%7B%5Cprod_w(%5Cbeta%2Bn_%7Bkw%7D-1)%5Ccdots(%5Cbeta%2Bn_%7Bkw%7D%5E%7B-jt%7D)%7D%7B(V%5Cbeta%2Bn_%7Bk%5Ccdot%7D-1)%5Ccdots(V%5Cbeta%2Bn_%7Bk%5Ccdot%7D%5E%7B-jt%7D)%7D" medium="image" />

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=f_%7Bk%5E%7B%5Ctext%7Bnew%7D%7D%7D%5E%7B-%5Cbf%7Bx%7D_%7Bjt%7D%7D(%5Cbf%7Bx%7D_%7Bjt%7D)%20%3D%20%5Cfrac%7B%5CGamma(V%5Cbeta)%5Cprod_v%5CGamma(%5Cbeta%2Bn_%7B%5Ccdot%20v%7D%5E%7Bjt%7D)%7D%7B%5CGamma(V%5Cbeta%2Bn_%7B%5Ccdot%5Ccdot%7D%5E%7Bjt%7D)%5Cprod_v%5CGamma(%5Cbeta)%7D" medium="image" />
	</item>
		<item>
		<title>[Kim+ ICML12] Dirichlet Process with Mixed Random Measures</title>
		<link>http://shuyo.wordpress.com/2012/07/31/kim-icml12-dirichlet-process-with-mixed-random-measures/</link>
		<comments>http://shuyo.wordpress.com/2012/07/31/kim-icml12-dirichlet-process-with-mixed-random-measures/#comments</comments>
		<pubDate>Tue, 31 Jul 2012 04:28:26 +0000</pubDate>
		<dc:creator>shuyo</dc:creator>
				<category><![CDATA[LDA]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Nonparametric Bayesian]]></category>

		<guid isPermaLink="false">http://shuyo.wordpress.com/?p=295</guid>
		<description><![CDATA[We held a private reading meeting for ICML 2012. I took and introduced [Kim+ ICML12] &#8220;Dirichlet Process with Mixed Random Measures : A Nonparametric Topic Model for Labeled Data.&#8221; This is the presentation for it. DP-MRM [Kim+ ICML12] is a &#8230; <a href="http://shuyo.wordpress.com/2012/07/31/kim-icml12-dirichlet-process-with-mixed-random-measures/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&#038;blog=4249505&#038;post=295&#038;subd=shuyo&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>We held a private reading meeting for ICML 2012.<br />
I took and introduced [Kim+ ICML12] &#8220;Dirichlet Process with Mixed Random Measures : A Nonparametric Topic Model for Labeled Data.&#8221;<br />
This is the presentation for it.</p>
<iframe src='http://www.slideshare.net/slideshow/embed_code/13783045' width='640' height='525'></iframe>
<p>DP-MRM [Kim+ ICML12] is a supervised topic model like sLDA [Blei+ 2007], DiscLDA [Lacoste-Julien+ 2008] and MedLDA [Zhu+ 2009], and is regarded as a nonparametric version of Labeled LDA [Ramage+ 2009] in particular.</p>
<p>Although Labeled LDA is easy to implement (my implementation is <a href="https://github.com/shuyo/iir/blob/master/lda/llda.py">here</a>), it has a disadvantage that you must specify label-topic correspondings explicitly and manually.<br />
On the other hand, DP-MRM can automatically decide label-topic correspondings as distributions. I am very interested in it.<br />
But it is hard to implement because it is a nonparametric bayesian modal.<br />
Hence I don&#8217;t want infinite topics but hierarchical label-topic correspondings, I guess that it will become very useful and handy and fast to replace DPs into normal Dirichlet distributions in this model&#8230; I am going to try it! <img src='http://s0.wp.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /> </p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/shuyo.wordpress.com/295/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/shuyo.wordpress.com/295/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&#038;blog=4249505&#038;post=295&#038;subd=shuyo&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://shuyo.wordpress.com/2012/07/31/kim-icml12-dirichlet-process-with-mixed-random-measures/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/1f4bcc4ca602d28da9fb29d79fffcae4?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">shuyo</media:title>
		</media:content>
	</item>
		<item>
		<title>Short Text Language Detection with Infinity-Gram</title>
		<link>http://shuyo.wordpress.com/2012/05/17/short-text-language-detection-with-infinity-gram/</link>
		<comments>http://shuyo.wordpress.com/2012/05/17/short-text-language-detection-with-infinity-gram/#comments</comments>
		<pubDate>Thu, 17 May 2012 10:31:45 +0000</pubDate>
		<dc:creator>shuyo</dc:creator>
				<category><![CDATA[Language Detection]]></category>
		<category><![CDATA[NLP]]></category>

		<guid isPermaLink="false">http://shuyo.wordpress.com/?p=286</guid>
		<description><![CDATA[I talked about language detection (language identification) for twitter at NAIST(NARA Institute of Science and Technology). This is its slide. Tweets are too short to detect their languages precisely. I guess that one reason is because features extracted from a &#8230; <a href="http://shuyo.wordpress.com/2012/05/17/short-text-language-detection-with-infinity-gram/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&#038;blog=4249505&#038;post=286&#038;subd=shuyo&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I talked about language detection (language identification) for twitter at NAIST(NARA Institute of Science and Technology).<br />
This is its slide.</p>
<p><iframe src='http://www.slideshare.net/slideshow/embed_code/12949447' width='640' height='525'></iframe><br />
</p>
<p>Tweets are too short to detect their languages precisely. I guess that one reason is because features extracted from a short text are not enough to detect.<br />
Another reason is because tweets have some unique representations, for example, u as you, 4 as for, LOL, F4F, various face marks and so on.</p>
<p>I developed <em>ldig</em>, a prototype of short text language detection, that solved those problems.<br />
ldig can detect langages of tweets with over 99% accuracy for 19 languages.</p>
<ul>
<li><a href="https://github.com/shuyo/ldig">https://github.com/shuyo/ldig</a></li>
</ul>
<p>The above slide explains how ldig solves those problems.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/shuyo.wordpress.com/286/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/shuyo.wordpress.com/286/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&#038;blog=4249505&#038;post=286&#038;subd=shuyo&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://shuyo.wordpress.com/2012/05/17/short-text-language-detection-with-infinity-gram/feed/</wfw:commentRss>
		<slash:comments>22</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/1f4bcc4ca602d28da9fb29d79fffcae4?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">shuyo</media:title>
		</media:content>
	</item>
		<item>
		<title>[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems</title>
		<link>http://shuyo.wordpress.com/2012/05/01/karger-nips11-iterative-learning-for-reliable-crowdsourcing-systems/</link>
		<comments>http://shuyo.wordpress.com/2012/05/01/karger-nips11-iterative-learning-for-reliable-crowdsourcing-systems/#comments</comments>
		<pubDate>Tue, 01 May 2012 07:36:07 +0000</pubDate>
		<dc:creator>shuyo</dc:creator>
				<category><![CDATA[NLP]]></category>

		<guid isPermaLink="false">http://shuyo.wordpress.com/?p=282</guid>
		<description><![CDATA[In April 2012, We held a private reading meeting for NIPS 2011. I read &#8220;Iterative Learning for Reliable Crowdsourcing Systems&#8221; [Karger+ NIPS11]. [Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems View more presentations from Shuyo Nakatani This paper targets Amazon &#8230; <a href="http://shuyo.wordpress.com/2012/05/01/karger-nips11-iterative-learning-for-reliable-crowdsourcing-systems/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&#038;blog=4249505&#038;post=282&#038;subd=shuyo&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>In April 2012, We held a private reading meeting for NIPS 2011.<br />
I read &#8220;Iterative Learning for Reliable Crowdsourcing Systems&#8221; [Karger+ NIPS11].</p>
<div style="width:425px;" id="__ss_12312278"> <strong><a href="http://www.slideshare.net/shuyo/karger-croudsourcingnips11" title="[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems" target="_blank">[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems</a></strong> <iframe src='http://www.slideshare.net/slideshow/embed_code/12312278' width='425' height='348' scrolling='no'></iframe>
<div style="padding:5px 0 12px;"> View more <a href="http://www.slideshare.net/" target="_blank">presentations</a> from <a href="http://www.slideshare.net/shuyo" target="_blank">Shuyo Nakatani</a> </div>
</p></div>
<p>This paper targets Amazon Mechanical Turk(AMT) which separates a large task into microtasks. Each worker in AMT may be a spammer (who answers randomly to earn fee) or a hammer (who answers correctly).<br />
This paper&#8217;s model simply assumes that each microtask has a coherent binary answer, each worker has probability to answer correctly which is independent on tasks. On the assumption, it estimates an average error rate when task size is enough large.<br />
I don&#8217;t mind it needs simple strong assumption, but I&#8217;m sorry the model parameter q can&#8217;t be known its true value so that a practical problem can&#8217;t fit the model if the assumption was accepted.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/shuyo.wordpress.com/282/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/shuyo.wordpress.com/282/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&#038;blog=4249505&#038;post=282&#038;subd=shuyo&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://shuyo.wordpress.com/2012/05/01/karger-nips11-iterative-learning-for-reliable-crowdsourcing-systems/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/1f4bcc4ca602d28da9fb29d79fffcae4?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">shuyo</media:title>
		</media:content>
	</item>
		<item>
		<title>[Freedman+ EMNLP11] Extreme Extraction – Machine Reading in a Week</title>
		<link>http://shuyo.wordpress.com/2012/05/01/freedman-emnlp11-extreme-extraction-machine-reading-in-a-week/</link>
		<comments>http://shuyo.wordpress.com/2012/05/01/freedman-emnlp11-extreme-extraction-machine-reading-in-a-week/#comments</comments>
		<pubDate>Tue, 01 May 2012 05:54:42 +0000</pubDate>
		<dc:creator>shuyo</dc:creator>
				<category><![CDATA[NLP]]></category>

		<guid isPermaLink="false">http://shuyo.wordpress.com/?p=277</guid>
		<description><![CDATA[In December 2011, We held a private reading meeting for EMNLP 2011. I read &#8220;Extreme Extraction – Machine Reading in a Week&#8221; [Freedman+ EMNLP11]. Extreme Extraction &#8211; Machine Reading in a Week View more presentations from Shuyo Nakatani This paper &#8230; <a href="http://shuyo.wordpress.com/2012/05/01/freedman-emnlp11-extreme-extraction-machine-reading-in-a-week/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&#038;blog=4249505&#038;post=277&#038;subd=shuyo&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>In December 2011, We held a private reading meeting for EMNLP 2011.<br />
I read &#8220;Extreme Extraction – Machine Reading in a Week&#8221; [Freedman+ EMNLP11].</p>
<div style="width:425px;" id="__ss_10681653"> <strong><a href="http://www.slideshare.net/shuyo/machine-reading-emnlp11" title="Extreme Extraction - Machine Reading in a Week" target="_blank">Extreme Extraction &#8211; Machine Reading in a Week</a></strong> <iframe src='http://www.slideshare.net/slideshow/embed_code/10681653' width='425' height='348' scrolling='no'></iframe>
<div style="padding:5px 0 12px;"> View more <a href="http://www.slideshare.net/" target="_blank">presentations</a> from <a href="http://www.slideshare.net/shuyo" target="_blank">Shuyo Nakatani</a> </div>
</p></div>
<p>This paper is to construct a concept and relation extraction system rapidly.<br />
I&#8217;m afraid that there are no definite construction methods (they might be confidential&#8230;) but length of time to do each task in it.<br />
Hence I have not yet learned this field very much yet, I knew what tasks the information extraction consists of generally.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/shuyo.wordpress.com/277/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/shuyo.wordpress.com/277/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&#038;blog=4249505&#038;post=277&#038;subd=shuyo&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://shuyo.wordpress.com/2012/05/01/freedman-emnlp11-extreme-extraction-machine-reading-in-a-week/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/1f4bcc4ca602d28da9fb29d79fffcae4?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">shuyo</media:title>
		</media:content>
	</item>
		<item>
		<title>Why is Norwegian and Danish identification difficult?</title>
		<link>http://shuyo.wordpress.com/2012/03/07/why-is-norwegian-and-danish-identification-difficult/</link>
		<comments>http://shuyo.wordpress.com/2012/03/07/why-is-norwegian-and-danish-identification-difficult/#comments</comments>
		<pubDate>Tue, 06 Mar 2012 17:28:00 +0000</pubDate>
		<dc:creator>shuyo</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://shuyo.wordpress.com/?p=272</guid>
		<description><![CDATA[I re-post the estimation table of ldig (twitter language detection). lang size detected correct precision recall cs 5329 5330 5319 0.9979 0.9981 da 5478 5483 5311 0.9686 0.9695 de 10065 10076 10014 0.9938 0.9949 en 9701 9670 9569 0.9896 0.9864 &#8230; <a href="http://shuyo.wordpress.com/2012/03/07/why-is-norwegian-and-danish-identification-difficult/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&#038;blog=4249505&#038;post=272&#038;subd=shuyo&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I re-post the estimation table of ldig (twitter language detection).</p>
<table>
<tr>
<th>lang</th>
<th>size</th>
<th>detected</th>
<th>correct</th>
<th>precision</th>
<th>recall</th>
</tr>
<tr>
<td align="right">cs</td>
<td align="right">5329</td>
<td align="right">5330</td>
<td align="right">5319</td>
<td align="right">0.9979</td>
<td align="right">0.9981</td>
</tr>
<tr>
<td align="right">da</td>
<td align="right">5478</td>
<td align="right">5483</td>
<td align="right">5311</td>
<td align="right">0.9686</td>
<td align="right">0.9695</td>
</tr>
<tr>
<td align="right">de</td>
<td align="right">10065</td>
<td align="right">10076</td>
<td align="right">10014</td>
<td align="right">0.9938</td>
<td align="right">0.9949</td>
</tr>
<tr>
<td align="right">en</td>
<td align="right">9701</td>
<td align="right">9670</td>
<td align="right">9569</td>
<td align="right">0.9896</td>
<td align="right">0.9864</td>
</tr>
<tr>
<td align="right">es</td>
<td align="right">10066</td>
<td align="right">10075</td>
<td align="right">9989</td>
<td align="right">0.9915</td>
<td align="right">0.9924</td>
</tr>
<tr>
<td align="right">fi</td>
<td align="right">4490</td>
<td align="right">4472</td>
<td align="right">4459</td>
<td align="right">0.9971</td>
<td align="right">0.9931</td>
</tr>
<tr>
<td align="right">fr</td>
<td align="right">10098</td>
<td align="right">10097</td>
<td align="right">10048</td>
<td align="right">0.9951</td>
<td align="right">0.9950</td>
</tr>
<tr>
<td align="right">id</td>
<td align="right">10181</td>
<td align="right">10233</td>
<td align="right">10167</td>
<td align="right">0.9936</td>
<td align="right">0.9986</td>
</tr>
<tr>
<td align="right">it</td>
<td align="right">10150</td>
<td align="right">10191</td>
<td align="right">10109</td>
<td align="right">0.9920</td>
<td align="right">0.9960</td>
</tr>
<tr>
<td align="right">nl</td>
<td align="right">9671</td>
<td align="right">9579</td>
<td align="right">9521</td>
<td align="right">0.9939</td>
<td align="right">0.9845</td>
</tr>
<tr>
<td align="right">no</td>
<td align="right">8560</td>
<td align="right">8442</td>
<td align="right">8219</td>
<td align="right">0.9736</td>
<td align="right">0.9602</td>
</tr>
<tr>
<td align="right">pl</td>
<td align="right">10070</td>
<td align="right">10079</td>
<td align="right">10054</td>
<td align="right">0.9975</td>
<td align="right">0.9984</td>
</tr>
<tr>
<td align="right">pt</td>
<td align="right">9422</td>
<td align="right">9441</td>
<td align="right">9354</td>
<td align="right">0.9908</td>
<td align="right">0.9928</td>
</tr>
<tr>
<td align="right">ro</td>
<td align="right">5914</td>
<td align="right">5831</td>
<td align="right">5822</td>
<td align="right">0.9985</td>
<td align="right">0.9844</td>
</tr>
<tr>
<td align="right">sv</td>
<td align="right">9990</td>
<td align="right">10034</td>
<td align="right">9866</td>
<td align="right">0.9833</td>
<td align="right">0.9876</td>
</tr>
<tr>
<td align="right">tr</td>
<td align="right">10310</td>
<td align="right">10321</td>
<td align="right">10300</td>
<td align="right">0.9980</td>
<td align="right">0.9990</td>
</tr>
<tr>
<td align="right">vi</td>
<td align="right">10494</td>
<td align="right">10486</td>
<td align="right">10479</td>
<td align="right">0.9993</td>
<td align="right">0.9986</td>
</tr>
<tr>
<td align="right">total</td>
<td align="right">149989</td>
<td align="right"> </td>
<td align="right">148600</td>
<td align="right"></td>
<td align="right">0.9907</td>
</tr>
</table>
<p>This shows that the accuracies of Norwegian and Danish are lower than others.<br />
It is because Norwegian and Danish are very similar.</p>
<p>Here is top 25 of the word distribution of Norwegian and Danish.</p>
<table>
<tr>
<td></td>
<td>word</td>
<td>Danish</td>
<td>Norwegian</td>
<td>amount</td>
</tr>
<tr>
<td>1</td>
<td>er</td>
<td>0.0311</td>
<td>0.0238</td>
<td>0.0311</td>
</tr>
<tr>
<td>2</td>
<td>det</td>
<td>0.0287</td>
<td>0.0228</td>
<td>0.0287</td>
</tr>
<tr>
<td>3</td>
<td>i</td>
<td>0.0253</td>
<td>0.0275</td>
<td>0.0275</td>
</tr>
<tr>
<td>4</td>
<td>på</td>
<td>0.0165</td>
<td>0.0263</td>
<td>0.0263</td>
</tr>
<tr>
<td>5</td>
<td>jeg</td>
<td>0.0185</td>
<td>0.0202</td>
<td>0.0202</td>
</tr>
<tr>
<td>6</td>
<td>og</td>
<td>0.0188</td>
<td>0.0202</td>
<td>0.0202</td>
</tr>
<tr>
<td>7</td>
<td>at</td>
<td>0.0183</td>
<td>0.0083</td>
<td>0.0183</td>
</tr>
<tr>
<td>8</td>
<td>å</td>
<td>0.0001</td>
<td>0.0167</td>
<td>0.0167</td>
</tr>
<tr>
<td>9</td>
<td>til</td>
<td>0.0157</td>
<td>0.0140</td>
<td>0.0157</td>
</tr>
<tr>
<td>10</td>
<td>en</td>
<td>0.0149</td>
<td>0.0120</td>
<td>0.0149</td>
</tr>
<tr>
<td>11</td>
<td>ikke</td>
<td>0.0119</td>
<td>0.0146</td>
<td>0.0146</td>
</tr>
<tr>
<td>12</td>
<td>har</td>
<td>0.0122</td>
<td>0.0132</td>
<td>0.0132</td>
</tr>
<tr>
<td>13</td>
<td>med</td>
<td>0.0120</td>
<td>0.0117</td>
<td>0.0120</td>
</tr>
<tr>
<td>14</td>
<td>som</td>
<td>0.0044</td>
<td>0.0116</td>
<td>0.0116</td>
</tr>
<tr>
<td>15</td>
<td>for</td>
<td>0.0097</td>
<td>0.0115</td>
<td>0.0115</td>
</tr>
<tr>
<td>16</td>
<td>du</td>
<td>0.0111</td>
<td>0.0093</td>
<td>0.0111</td>
</tr>
<tr>
<td>17</td>
<td>så</td>
<td>0.0110</td>
<td>0.0072</td>
<td>0.0110</td>
</tr>
<tr>
<td>18</td>
<td>der</td>
<td>0.0101</td>
<td>0.0016</td>
<td>0.0101</td>
</tr>
<tr>
<td>19</td>
<td>av</td>
<td>0.0000</td>
<td>0.0093</td>
<td>0.0093</td>
</tr>
<tr>
<td>20</td>
<td>den</td>
<td>0.0089</td>
<td>0.0056</td>
<td>0.0089</td>
</tr>
<tr>
<td>21</td>
<td>af</td>
<td>0.0089</td>
<td>0.0000</td>
<td>0.0089</td>
</tr>
<tr>
<td>22</td>
<td>om</td>
<td>0.0052</td>
<td>0.0073</td>
<td>0.0073</td>
</tr>
<tr>
<td>23</td>
<td>kan</td>
<td>0.0073</td>
<td>0.0044</td>
<td>0.0073</td>
</tr>
<tr>
<td>24</td>
<td>men</td>
<td>0.0062</td>
<td>0.0072</td>
<td>0.0072</td>
</tr>
<tr>
<td>25</td>
<td>de</td>
<td>0.0066</td>
<td>0.0054</td>
<td>0.0066</td>
</tr>
</table>
<p>Most of high frequency words, &#8216;er&#8217;(it nearly corresponds to &#8216;is&#8217; in English, following is the same), &#8216;det&#8217;(&#8216;it&#8217;) and&#8217;i'(&#8216;in&#8217;) are common function words between 2 languages, so it is very difficult to identify them.<br />
Useful words for the identification are a pretty few. For example, &#8216;of&#8217; in English corresponds to &#8216;af&#8217; in Danish or &#8216;av&#8217; in Norwegian, &#8216;me&#8217; corresponds to &#8216;mig&#8217;(da) or &#8216;meg&#8217;(no), &#8216;now&#8217; corresponds to &#8216;nu&#8217;(da) or &#8216;nå&#8217;(no), &#8216;just&#8217; correspond to &#8216;lige&#8217;(da) or &#8216;like&#8217;(no), &#8216;what&#8217; corresponds to &#8216;hvad&#8217;(da) or &#8216;hva&#8217;(no), &#8216;no&#8217; corresponds to &#8216;nej&#8217;(da) or &#8216;nei&#8217;(no), and so on.</p>
<p>The most serious problem is my language identification skill for Danish and Norwegian&#8230;<br />
I collected tens of thousands of tweets in them, but I&#8217;m afraid that they contains several percent errors.<br />
Of cource, that causes detection errors.</p>
<p>If you know native Norwegian or Danish who are interested in language detection and corpus annotation, I&#8217;m glad you to introduce me! <img src='http://s0.wp.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /> </p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/shuyo.wordpress.com/272/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/shuyo.wordpress.com/272/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&#038;blog=4249505&#038;post=272&#038;subd=shuyo&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://shuyo.wordpress.com/2012/03/07/why-is-norwegian-and-danish-identification-difficult/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/1f4bcc4ca602d28da9fb29d79fffcae4?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">shuyo</media:title>
		</media:content>
	</item>
		<item>
		<title>Estimation of ldig (twitter Language Detection) for LIGA dataset</title>
		<link>http://shuyo.wordpress.com/2012/03/02/estimation-of-ldig-twitter-language-detection-for-liga-dataset/</link>
		<comments>http://shuyo.wordpress.com/2012/03/02/estimation-of-ldig-twitter-language-detection-for-liga-dataset/#comments</comments>
		<pubDate>Fri, 02 Mar 2012 10:14:53 +0000</pubDate>
		<dc:creator>shuyo</dc:creator>
				<category><![CDATA[Language Detection]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://shuyo.wordpress.com/?p=263</guid>
		<description><![CDATA[LIGA[Tromp+ 11] is a twitter language detection for 6 languages (German, English, Spanish, French, Italian and Dutch). It uses a graph with 3-grams for long distance features and detects 95-98% accuracy. They open their dataset here which has 9066 tweets, &#8230; <a href="http://shuyo.wordpress.com/2012/03/02/estimation-of-ldig-twitter-language-detection-for-liga-dataset/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&#038;blog=4249505&#038;post=263&#038;subd=shuyo&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>LIGA[Tromp+ 11] is a twitter language detection for 6 languages (German, English, Spanish, French, Italian and Dutch).<br />
It uses a graph with 3-grams for long distance features and detects 95-98% accuracy.<br />
They open their dataset <a href="http://www.win.tue.nl/~mpechen/projects/smm/">here</a> which has 9066 tweets, so it is possible to compare ldig (Language Detection with Infinity-Gram: <a href="https://github.com/shuyo/ldig">site</a>, <a href="http://shuyo.wordpress.com/2012/02/21/language-detection-for-twitter-with-99-1-accuracy/">blog</a>) to their result.</p>
<p>At first, it needs to comvert their dataset into ldig-available format.<br />
Here is a ruby script to convert it.</p>
<pre>
#!/usr/bin/env ruby
# -*- coding: utf-8 -*-

open("liga_dataset.txt", "wb:UTF-8") do |f|
  ["de_DE","en_UK","es_ES","fr_FR","it_IT","nl_NL"].each do |dir|
    lang = dir[0,2]
    Dir.glob("#{dir}/*.txt") do |file|
      line = open(file, "rb:UTF-8") {|f| f.read}.gsub(/[\u0000-\u001f]/, " ").strip
      f.puts "#{lang}\t#{line}"
    end
  end
end
</pre>
<p>Here is a estimation of ldig for the generated dataset, liga_dataset.txt.</p>
<table>
<tr>
<th>lang</th>
<th>size</th>
<th>detect</th>
<th>correct</th>
<th>precision</th>
<th>recall</th>
</tr>
<tr>
<td>de</td>
<td>1479</td>
<td>1469</td>
<td>1463</td>
<td>0.9959</td>
<td>0.9892</td>
</tr>
<tr>
<td>en</td>
<td>1505</td>
<td>1504</td>
<td>1489</td>
<td>0.9900</td>
<td>0.9894</td>
</tr>
<tr>
<td>es</td>
<td>1562</td>
<td>1550</td>
<td>1541</td>
<td>0.9942</td>
<td>0.9866</td>
</tr>
<tr>
<td>fr</td>
<td>1551</td>
<td>1545</td>
<td>1539</td>
<td>0.9961</td>
<td>0.9923</td>
</tr>
<tr>
<td>it</td>
<td>1539</td>
<td>1532</td>
<td>1526</td>
<td>0.9961</td>
<td>0.9916</td>
</tr>
<tr>
<td>nl</td>
<td>1430</td>
<td>1429</td>
<td>1425</td>
<td>0.9972</td>
<td>0.9965</td>
</tr>
<tr>
<td>total</td>
<td>9066</td>
<td> </td>
<td>8983</td>
<td></td>
<td>0.9908</td>
</tr>
</table>
<p>It shows that ldig can detect over 99% accuracy for their dataset.</p>
<h3>Reference</h3>
<ul>
<li>[Erik Tromp and Mykola Pechenizkiy 11] “Graph-Based N-gram Language Identification on Short Texts”</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/shuyo.wordpress.com/263/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/shuyo.wordpress.com/263/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&#038;blog=4249505&#038;post=263&#038;subd=shuyo&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://shuyo.wordpress.com/2012/03/02/estimation-of-ldig-twitter-language-detection-for-liga-dataset/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/1f4bcc4ca602d28da9fb29d79fffcae4?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">shuyo</media:title>
		</media:content>
	</item>
		<item>
		<title>Precision and Recall of ldig (twitter language detection)</title>
		<link>http://shuyo.wordpress.com/2012/03/02/precision-and-recall-of-ldig-twitter-language-detection/</link>
		<comments>http://shuyo.wordpress.com/2012/03/02/precision-and-recall-of-ldig-twitter-language-detection/#comments</comments>
		<pubDate>Fri, 02 Mar 2012 06:51:21 +0000</pubDate>
		<dc:creator>shuyo</dc:creator>
				<category><![CDATA[Language Detection]]></category>
		<category><![CDATA[NLP]]></category>

		<guid isPermaLink="false">http://shuyo.wordpress.com/?p=254</guid>
		<description><![CDATA[In the previous article, I introduced ldig (Language Detection with Infinity-Gram: site, blog), which detects tweet languages. There are some requests to tell ldig&#8217;s precision and recall, so I calculated them. lang size detected correct precision recall cs 5329 5330 &#8230; <a href="http://shuyo.wordpress.com/2012/03/02/precision-and-recall-of-ldig-twitter-language-detection/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&#038;blog=4249505&#038;post=254&#038;subd=shuyo&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>In the previous article, I introduced ldig (Language Detection with Infinity-Gram: <a href="https://github.com/shuyo/ldig">site</a>, <a href="http://shuyo.wordpress.com/2012/02/21/language-detection-for-twitter-with-99-1-accuracy/">blog</a>), which detects tweet languages.<br />
There are some requests to tell ldig&#8217;s precision and recall, so I calculated them. </p>
<table>
<tr>
<th>lang</th>
<th>size</th>
<th>detected</th>
<th>correct</th>
<th>precision</th>
<th>recall</th>
</tr>
<tr>
<td align="right">cs</td>
<td align="right">5329</td>
<td align="right">5330</td>
<td align="right">5319</td>
<td align="right">0.9979</td>
<td align="right">0.9981</td>
</tr>
<tr>
<td align="right">da</td>
<td align="right">5478</td>
<td align="right">5483</td>
<td align="right">5311</td>
<td align="right">0.9686</td>
<td align="right">0.9695</td>
</tr>
<tr>
<td align="right">de</td>
<td align="right">10065</td>
<td align="right">10076</td>
<td align="right">10014</td>
<td align="right">0.9938</td>
<td align="right">0.9949</td>
</tr>
<tr>
<td align="right">en</td>
<td align="right">9701</td>
<td align="right">9670</td>
<td align="right">9569</td>
<td align="right">0.9896</td>
<td align="right">0.9864</td>
</tr>
<tr>
<td align="right">es</td>
<td align="right">10066</td>
<td align="right">10075</td>
<td align="right">9989</td>
<td align="right">0.9915</td>
<td align="right">0.9924</td>
</tr>
<tr>
<td align="right">fi</td>
<td align="right">4490</td>
<td align="right">4472</td>
<td align="right">4459</td>
<td align="right">0.9971</td>
<td align="right">0.9931</td>
</tr>
<tr>
<td align="right">fr</td>
<td align="right">10098</td>
<td align="right">10097</td>
<td align="right">10048</td>
<td align="right">0.9951</td>
<td align="right">0.9950</td>
</tr>
<tr>
<td align="right">id</td>
<td align="right">10181</td>
<td align="right">10233</td>
<td align="right">10167</td>
<td align="right">0.9936</td>
<td align="right">0.9986</td>
</tr>
<tr>
<td align="right">it</td>
<td align="right">10150</td>
<td align="right">10191</td>
<td align="right">10109</td>
<td align="right">0.9920</td>
<td align="right">0.9960</td>
</tr>
<tr>
<td align="right">nl</td>
<td align="right">9671</td>
<td align="right">9579</td>
<td align="right">9521</td>
<td align="right">0.9939</td>
<td align="right">0.9845</td>
</tr>
<tr>
<td align="right">no</td>
<td align="right">8560</td>
<td align="right">8442</td>
<td align="right">8219</td>
<td align="right">0.9736</td>
<td align="right">0.9602</td>
</tr>
<tr>
<td align="right">pl</td>
<td align="right">10070</td>
<td align="right">10079</td>
<td align="right">10054</td>
<td align="right">0.9975</td>
<td align="right">0.9984</td>
</tr>
<tr>
<td align="right">pt</td>
<td align="right">9422</td>
<td align="right">9441</td>
<td align="right">9354</td>
<td align="right">0.9908</td>
<td align="right">0.9928</td>
</tr>
<tr>
<td align="right">ro</td>
<td align="right">5914</td>
<td align="right">5831</td>
<td align="right">5822</td>
<td align="right">0.9985</td>
<td align="right">0.9844</td>
</tr>
<tr>
<td align="right">sv</td>
<td align="right">9990</td>
<td align="right">10034</td>
<td align="right">9866</td>
<td align="right">0.9833</td>
<td align="right">0.9876</td>
</tr>
<tr>
<td align="right">tr</td>
<td align="right">10310</td>
<td align="right">10321</td>
<td align="right">10300</td>
<td align="right">0.9980</td>
<td align="right">0.9990</td>
</tr>
<tr>
<td align="right">vi</td>
<td align="right">10494</td>
<td align="right">10486</td>
<td align="right">10479</td>
<td align="right">0.9993</td>
<td align="right">0.9986</td>
</tr>
<tr>
<td align="right">total</td>
<td align="right">149989</td>
<td align="right"> </td>
<td align="right">148600</td>
<td align="right"></td>
<td align="right">0.9907</td>
</tr>
</table>
<p>The sum of data size is not equal to the amount of detected languages because ldig outputs &#8220;&#8221; as language when the max probability is lower than 0.6.<br />
And the data size is not equal to one in the previous article because the dataset is updated.</p>
<p>I reckoned it doesn&#8217;t make sense over 99% accuracy, then what&#8217;s about?</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/shuyo.wordpress.com/254/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/shuyo.wordpress.com/254/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&#038;blog=4249505&#038;post=254&#038;subd=shuyo&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://shuyo.wordpress.com/2012/03/02/precision-and-recall-of-ldig-twitter-language-detection/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/1f4bcc4ca602d28da9fb29d79fffcae4?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">shuyo</media:title>
		</media:content>
	</item>
		<item>
		<title>Language Detection for twitter with 99.1% Accuracy</title>
		<link>http://shuyo.wordpress.com/2012/02/21/language-detection-for-twitter-with-99-1-accuracy/</link>
		<comments>http://shuyo.wordpress.com/2012/02/21/language-detection-for-twitter-with-99-1-accuracy/#comments</comments>
		<pubDate>Tue, 21 Feb 2012 14:23:33 +0000</pubDate>
		<dc:creator>shuyo</dc:creator>
				<category><![CDATA[Language Detection]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://shuyo.wordpress.com/?p=249</guid>
		<description><![CDATA[I released a newer prototype of Language Detection ( Language Identification ) with Infinity-Gram (ldig), a language detection prototype for twitter. https://github.com/shuyo/ldig It detects tweets in 17 languages with 99.1% accuracy (Czech, Dannish, German, English, Spanish, Finnish, French, Indonesian, Italian, &#8230; <a href="http://shuyo.wordpress.com/2012/02/21/language-detection-for-twitter-with-99-1-accuracy/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&#038;blog=4249505&#038;post=249&#038;subd=shuyo&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I released a newer prototype of Language Detection ( Language Identification ) with Infinity-Gram (ldig), a language detection prototype for twitter.</p>
<ul>
<li><a href="https://github.com/shuyo/ldig">https://github.com/shuyo/ldig</a></li>
</ul>
<p>It detects tweets in 17 languages with 99.1% accuracy (Czech, Dannish, German, English, Spanish, Finnish, French, Indonesian, Italian, Dutch, Norwegian, Polish, Portuguese, Romanian, Swedish, Turkish and Vietnamese).<br />
ldig specialized with noisy short text (more than 3 words) and is limited to Latin alphabet language because input text can separate into character type blocks and Latin alphabet detection is most difficult.</p>
<p>My language-detection (langdetect) is not good at short text detection, so that most users seem troubled in language detection for twitter.<br />
langdetect uses character 3-grams as feature so it is insufficient for short text detection.<br />
I supposed that maximal substrings [Okanohara+ 09] makes sufficient features for short text detection and prepared twitter corpus with 17 languages.<br />
Then training/test corpus size and estimation of ldig prototype is the below.</p>
<table border>
<tr>
<th>lang</th>
<th>training</th>
<th>test</th>
<th>correct</th>
<th>accuracy</th>
</tr>
<tr>
<th>cs</th>
<td align="right">4581</td>
<td align="right">5329</td>
<td align="right">5319</td>
<td align="right">99.81</td>
</tr>
<tr>
<th>da</th>
<td align="right">5480</td>
<td align="right">5476</td>
<td align="right">5308</td>
<td align="right">96.93</td>
</tr>
<tr>
<th>de</th>
<td align="right">43930</td>
<td align="right">9659</td>
<td align="right">9611</td>
<td align="right">99.50</td>
</tr>
<tr>
<th>en</th>
<td align="right">44912</td>
<td align="right">9612</td>
<td align="right">9497</td>
<td align="right">98.80</td>
</tr>
<tr>
<th>es</th>
<td align="right">44921</td>
<td align="right">10127</td>
<td align="right">10050</td>
<td align="right">99.24</td>
</tr>
<tr>
<th>fi</th>
<td align="right">4576</td>
<td align="right">4490</td>
<td align="right">4464</td>
<td align="right">99.42</td>
</tr>
<tr>
<th>fr</th>
<td align="right">44142</td>
<td align="right">10063</td>
<td align="right">10014</td>
<td align="right">99.51</td>
</tr>
<tr>
<th>id</th>
<td align="right">44873</td>
<td align="right">10183</td>
<td align="right">10163</td>
<td align="right">99.80</td>
</tr>
<tr>
<th>it</th>
<td align="right">44045</td>
<td align="right">10152</td>
<td align="right">10110</td>
<td align="right">99.59</td>
</tr>
<tr>
<th>nl</th>
<td align="right">44933</td>
<td align="right">9677</td>
<td align="right">9532</td>
<td align="right">98.50</td>
</tr>
<tr>
<th>no</th>
<td align="right">7525</td>
<td align="right">8513</td>
<td align="right">8192</td>
<td align="right">96.23</td>
</tr>
<tr>
<th>pl</th>
<td align="right">12854</td>
<td align="right">10070</td>
<td align="right">10059</td>
<td align="right">99.89</td>
</tr>
<tr>
<th>pt</th>
<td align="right">44464</td>
<td align="right">9459</td>
<td align="right">9359</td>
<td align="right">98.94</td>
</tr>
<tr>
<th>ro</th>
<td align="right">6114</td>
<td align="right">5902</td>
<td align="right">5812</td>
<td align="right">98.48</td>
</tr>
<tr>
<th>sv</th>
<td align="right">44339</td>
<td align="right">9952</td>
<td align="right">9870</td>
<td align="right">99.18</td>
</tr>
<tr>
<th>tr</th>
<td align="right">44787</td>
<td align="right">10309</td>
<td align="right">10301</td>
<td align="right">99.92</td>
</tr>
<tr>
<th>vi</th>
<td align="right">10413</td>
<td align="right">10494</td>
<td align="right">10481</td>
<td align="right">99.88</td>
</tr>
<tr>
<th>total</th>
<td align="right">496889</td>
<td align="right">149467</td>
<td align="right">148142</td>
<td align="right">99.11</td>
</tr>
</table>
<p>I&#8217;m preparing Catalan corpus with some helps (THANKS!), so supported languages will increase sooner.</p>
<p>ldig write out model as numpy binary format now, but I will modify it into more portable format, MessagePack like, then detector of ldig can probably port in other platforms easily.</p>
<p>The presentation of ldig is here (but this is written in Japanese! <img src='http://s0.wp.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' />  )</p>
<ul>
<li>
<a href="http://www.slideshare.net/shuyo/gram-10286133">Language Detection with Infinity-Gram</a> (in Japanese)
</li>
</ul>
<p>And I&#8217;ll read its paper at <a href="http://www.anlp.jp/nlp2012/">the annual conference of The Association for Natural Language Processing in Japan (NLP2012)</a>.<br />
I&#8217;ll publish the paper on this blog after the conference (but it&#8217;s in Japanese! <img src='http://s2.wp.com/wp-includes/images/smilies/icon_razz.gif' alt=':P' class='wp-smiley' /> ).</p>
<h3>P.S.</h3>
<p>I opened a slide about twitter language detection in English.</p>
<div style="width:425px;" id="__ss_12949447"> <strong><a href="http://www.slideshare.net/shuyo/short-text-language-detection-with-infinitygram-12949447" title="Short Text Language Detection with Infinity-Gram" target="_blank">Short Text Language Detection with Infinity-Gram</a></strong> <iframe src='http://www.slideshare.net/slideshow/embed_code/12949447' width='425' height='348' scrolling='no'></iframe>
<div style="padding:5px 0 12px;"> View more <a href="http://www.slideshare.net/" target="_blank">presentations</a> from <a href="http://www.slideshare.net/shuyo" target="_blank">Shuyo Nakatani</a> </div>
</p></div>
<h3>Reference</h3>
<ul>
<li>[Daisuke Okanohara and Jun'ichi Tsujii 09] “Text Categorization with All Substring Features”</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/shuyo.wordpress.com/249/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/shuyo.wordpress.com/249/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&#038;blog=4249505&#038;post=249&#038;subd=shuyo&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://shuyo.wordpress.com/2012/02/21/language-detection-for-twitter-with-99-1-accuracy/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/1f4bcc4ca602d28da9fb29d79fffcae4?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">shuyo</media:title>
		</media:content>
	</item>
		<item>
		<title>Repository Migration from subversion into git on Google Code Project</title>
		<link>http://shuyo.wordpress.com/2012/01/16/repository-migration-from-subversion-into-git-on-google-code-project/</link>
		<comments>http://shuyo.wordpress.com/2012/01/16/repository-migration-from-subversion-into-git-on-google-code-project/#comments</comments>
		<pubDate>Mon, 16 Jan 2012 12:57:16 +0000</pubDate>
		<dc:creator>shuyo</dc:creator>
				<category><![CDATA[Development]]></category>

		<guid isPermaLink="false">http://shuyo.wordpress.com/?p=245</guid>
		<description><![CDATA[I migrated language-detection&#8217;s repository on Google Code Project from subversion into git. It is because its directory layout must be changed much for maven-support . (I HATE the branching of subversion! ) Google Code Project supports subversion, git and Mercurial &#8230; <a href="http://shuyo.wordpress.com/2012/01/16/repository-migration-from-subversion-into-git-on-google-code-project/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&#038;blog=4249505&#038;post=245&#038;subd=shuyo&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I migrated language-detection&#8217;s repository on Google Code Project from subversion into git.<br />
It is because its directory layout must be changed much for maven-support . (I HATE the branching of subversion! <img src='http://s0.wp.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /> )</p>
<p>Google Code Project supports subversion, git and Mercurial as Version Control System. Each repository is independent and exclusive, and it is necessary to migrate between them by manual.<br />
So I wrote migration process from subversion into git as step-bt-step.</p>
<p>1. Prepare git and git-svn</p>
<p>I used them on Cygwin. But I reckon it is easier on Linux <img src='http://s0.wp.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /> </p>
<p>2. Migrate into local git repositories</p>
<p>At first, migrate the subversion repository into local git repositories (because of their exclusiveness).<br />
The subversion repository of Google Code Project has histories of  not only codes(trunk/branch/tag) but also wikis. In the case of git, codes and wikis are stored in each repository.<br />
So you need to migrate 2 repositories if you didn&#8217;t use branches/tagges</p>
<p>Execute the below commands on an appropriate directory.</p>
<pre>
$ cd your-working-directory
$ git svn clone -s http://your-project-name.googlecode.com/svn/ 
$ git svn clone http://your-project-name.googlecode.com/svn/wiki/
</pre>
<p>&#8220;your-working-directory&#8221; and &#8220;your-project-name&#8221; must be replaced into your suitable names. It matters that the second command does not has -s option.<br />
If you use branches and tagges, then you might need several steps for tagging on the git repository, but I didn&#8217;t research its methods because I didn&#8217;t  use them <img src='http://s2.wp.com/wp-includes/images/smilies/icon_razz.gif' alt=':P' class='wp-smiley' /> </p>
<p>If you run into an error like the below with git-svn on Cygwin (it may often occur at Windows 7 x64???), </p>
<pre>
10192702 [main] perl 9904 C:\cygwin\bin\perl.exe: *** fatal error - unable to re 
map C:\cygwin\bin\cygsvn_client-1-0.dll to same address as parent: 0xAB0000 != 0xAE0000
</pre>
<p>then execute rebaseall on cmd.exe as the below.</p>
<pre>
$ cd c:\cygwin\bin
$ ash rebaseall -v
</pre>
<p>3. Switch the project&#8217;s VCS into git</p>
<p>Switching into git is available at the source tab on the administrator&#8217;s console of the project.<br />
The operation does not affect no inference to the subversion repository, but limit the only selected system to access. So you can restore the subversion at any time.</p>
<p>4. Push the local repositories into new ones</p>
<p>Execute the below commands.</p>
<pre>
$ cd svn
$ git push https://code.google.com/p/your-project-name/ master
$ cd ..
$ cd wiki 
$ git push https://code.google.com/p/your-project-name.wiki/ master
</pre>
<p>Then the git repositories will be available. You shoud check whether the project&#8217;s wiki is as before.<br />
I like that the repositories of codes and wikis are separeted.</p>
<p>Reference:<br />
- <a href="http://d.hatena.ne.jp/kkobayashi_a/20090615/p1" rel="nofollow">http://d.hatena.ne.jp/kkobayashi_a/20090615/p1</a> (in Japanese)</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/shuyo.wordpress.com/245/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/shuyo.wordpress.com/245/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&#038;blog=4249505&#038;post=245&#038;subd=shuyo&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://shuyo.wordpress.com/2012/01/16/repository-migration-from-subversion-into-git-on-google-code-project/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/1f4bcc4ca602d28da9fb29d79fffcae4?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">shuyo</media:title>
		</media:content>
	</item>
	</channel>
</rss>
