<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Shuyo&#039;s Weblog</title>
	<atom:link href="http://shuyo.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://shuyo.wordpress.com</link>
	<description>About my favorite technical subjects</description>
	<lastBuildDate>Thu, 26 Jan 2012 02:06:34 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='shuyo.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Shuyo&#039;s Weblog</title>
		<link>http://shuyo.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://shuyo.wordpress.com/osd.xml" title="Shuyo&#039;s Weblog" />
	<atom:link rel='hub' href='http://shuyo.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Repository Migration from subversion into git on Google Code Project</title>
		<link>http://shuyo.wordpress.com/2012/01/16/repository-migration-from-subversion-into-git-on-google-code-project/</link>
		<comments>http://shuyo.wordpress.com/2012/01/16/repository-migration-from-subversion-into-git-on-google-code-project/#comments</comments>
		<pubDate>Mon, 16 Jan 2012 12:57:16 +0000</pubDate>
		<dc:creator>shuyo</dc:creator>
				<category><![CDATA[Development]]></category>

		<guid isPermaLink="false">http://shuyo.wordpress.com/?p=245</guid>
		<description><![CDATA[I migrated language-detection&#8217;s repository on Google Code Project from subversion into git. It is because its directory layout must be changed much for maven-support . (I HATE the branching of subversion! ) Google Code Project supports subversion, git and Mercurial &#8230; <a href="http://shuyo.wordpress.com/2012/01/16/repository-migration-from-subversion-into-git-on-google-code-project/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&amp;blog=4249505&amp;post=245&amp;subd=shuyo&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I migrated language-detection&#8217;s repository on Google Code Project from subversion into git.<br />
It is because its directory layout must be changed much for maven-support . (I HATE the branching of subversion! <img src='http://s0.wp.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /> )</p>
<p>Google Code Project supports subversion, git and Mercurial as Version Control System. Each repository is independent and exclusive, and it is necessary to migrate between them by manual.<br />
So I wrote migration process from subversion into git as step-bt-step.</p>
<p>1. Prepare git and git-svn</p>
<p>I used them on Cygwin. But I reckon it is easier on Linux <img src='http://s0.wp.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /> </p>
<p>2. Migrate into local git repositories</p>
<p>At first, migrate the subversion repository into local git repositories (because of their exclusiveness).<br />
The subversion repository of Google Code Project has histories of  not only codes(trunk/branch/tag) but also wikis. In the case of git, codes and wikis are stored in each repository.<br />
So you need to migrate 2 repositories if you didn&#8217;t use branches/tagges</p>
<p>Execute the below commands on an appropriate directory.</p>
<pre>
$ cd your-working-directory
$ git svn clone -s http://your-project-name.googlecode.com/svn/
$ git svn clone http://your-project-name.googlecode.com/svn/wiki/
</pre>
<p>&#8220;your-working-directory&#8221; and &#8220;your-project-name&#8221; must be replaced into your suitable names. It matters that the second command does not has -s option.<br />
If you use branches and tagges, then you might need several steps for tagging on the git repository, but I didn&#8217;t research its methods because I didn&#8217;t  use them <img src='http://s2.wp.com/wp-includes/images/smilies/icon_razz.gif' alt=':P' class='wp-smiley' /> </p>
<p>If you run into an error like the below with git-svn on Cygwin (it may often occur at Windows 7 x64???), </p>
<pre>
10192702 [main] perl 9904 C:\cygwin\bin\perl.exe: *** fatal error - unable to re
map C:\cygwin\bin\cygsvn_client-1-0.dll to same address as parent: 0xAB0000 != 0xAE0000
</pre>
<p>then execute rebaseall on cmd.exe as the below.</p>
<pre>
$ cd c:\cygwin\bin
$ ash rebaseall -v
</pre>
<p>3. Switch the project&#8217;s VCS into git</p>
<p>Switching into git is available at the source tab on the administrator&#8217;s console of the project.<br />
The operation does not affect no inference to the subversion repository, but limit the only selected system to access. So you can restore the subversion at any time.</p>
<p>4. Push the local repositories into new ones</p>
<p>Execute the below commands.</p>
<pre>
$ cd svn
$ git push https://code.google.com/p/your-project-name/ master
$ cd ..
$ cd wiki
$ git push https://code.google.com/p/your-project-name.wiki/ master
</pre>
<p>Then the git repositories will be available. You shoud check whether the project&#8217;s wiki is as before.<br />
I like that the repositories of codes and wikis are separeted.</p>
<p>Reference:<br />
- http://d.hatena.ne.jp/kkobayashi_a/20090615/p1 (in Japanese)</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/shuyo.wordpress.com/245/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/shuyo.wordpress.com/245/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/shuyo.wordpress.com/245/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/shuyo.wordpress.com/245/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/shuyo.wordpress.com/245/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/shuyo.wordpress.com/245/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/shuyo.wordpress.com/245/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/shuyo.wordpress.com/245/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/shuyo.wordpress.com/245/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/shuyo.wordpress.com/245/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/shuyo.wordpress.com/245/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/shuyo.wordpress.com/245/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/shuyo.wordpress.com/245/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/shuyo.wordpress.com/245/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&amp;blog=4249505&amp;post=245&amp;subd=shuyo&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://shuyo.wordpress.com/2012/01/16/repository-migration-from-subversion-into-git-on-google-code-project/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/1f4bcc4ca602d28da9fb29d79fffcae4?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">shuyo</media:title>
		</media:content>
	</item>
		<item>
		<title>Whether language profile should be bundled or not?</title>
		<link>http://shuyo.wordpress.com/2011/12/15/whether-language-profile-should-be-bundled-or-not/</link>
		<comments>http://shuyo.wordpress.com/2011/12/15/whether-language-profile-should-be-bundled-or-not/#comments</comments>
		<pubDate>Thu, 15 Dec 2011 08:47:53 +0000</pubDate>
		<dc:creator>shuyo</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[Language Detection]]></category>

		<guid isPermaLink="false">http://shuyo.wordpress.com/?p=239</guid>
		<description><![CDATA[I&#8217;m going to support maven for language-detection, but I have some troubles about language profiles&#8230; language-detection have separated language profiles from jar file because they are lots of (meanwhile jar without profiles has only 100KB around, jar with ones has &#8230; <a href="http://shuyo.wordpress.com/2011/12/15/whether-language-profile-should-be-bundled-or-not/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&amp;blog=4249505&amp;post=239&amp;subd=shuyo&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m going to support maven for language-detection, but I have some troubles about language profiles&#8230;</p>
<p>language-detection have separated language profiles from jar file because they are lots of (meanwhile jar without profiles has only 100KB around, jar with ones has over 1MB.)<br />
However there are some reasonable requests that want to bundle them in jar file.</p>
<ul>
<li>Profile-separated one makes the application know the installed directory.</li>
<li>Hadoop can&#8217;t distribute language profiles outside jar file.</li>
<li>I want to register the library into Maven Central, then profile-bundled one is more useful.</li>
</ul>
<p>And a user told me how to generate jar file with selected profiles with maven&#8217;s template, so a developer who hope the slim-size library can generate a jar file with necessary profiles only (also no profiles!).</p>
<p>Then I begin to consider that the jar file packaged in language-detection library can bundle all language profiles. How do you think?</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/shuyo.wordpress.com/239/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/shuyo.wordpress.com/239/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/shuyo.wordpress.com/239/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/shuyo.wordpress.com/239/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/shuyo.wordpress.com/239/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/shuyo.wordpress.com/239/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/shuyo.wordpress.com/239/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/shuyo.wordpress.com/239/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/shuyo.wordpress.com/239/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/shuyo.wordpress.com/239/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/shuyo.wordpress.com/239/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/shuyo.wordpress.com/239/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/shuyo.wordpress.com/239/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/shuyo.wordpress.com/239/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&amp;blog=4249505&amp;post=239&amp;subd=shuyo&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://shuyo.wordpress.com/2011/12/15/whether-language-profile-should-be-bundled-or-not/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/1f4bcc4ca602d28da9fb29d79fffcae4?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">shuyo</media:title>
		</media:content>
	</item>
		<item>
		<title>language-detection supported 17 language profiles for short messages</title>
		<link>http://shuyo.wordpress.com/2011/11/28/language-detection-supported-17-language-profiles-for-short-messages/</link>
		<comments>http://shuyo.wordpress.com/2011/11/28/language-detection-supported-17-language-profiles-for-short-messages/#comments</comments>
		<pubDate>Mon, 28 Nov 2011 10:49:51 +0000</pubDate>
		<dc:creator>shuyo</dc:creator>
				<category><![CDATA[Language Detection]]></category>
		<category><![CDATA[NLP]]></category>

		<guid isPermaLink="false">http://shuyo.wordpress.com/?p=221</guid>
		<description><![CDATA[language-detection( http://code.google.com/p/language-detection/ , langdetect) supported newly 17 language detection generated from twitter corpus. These are published at trunk of langdetect repository (which will be packaged sooner or later). http://code.google.com/p/language-detection/source/browse/#svn%2Ftrunk%2Fprofiles.sm Those 17 languages are as the below. cs : Czech da &#8230; <a href="http://shuyo.wordpress.com/2011/11/28/language-detection-supported-17-language-profiles-for-short-messages/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&amp;blog=4249505&amp;post=221&amp;subd=shuyo&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>language-detection( http://code.google.com/p/language-detection/ , langdetect) supported newly 17 language detection generated from twitter corpus.<br />
These are published at trunk of langdetect repository (which will be packaged sooner or later).</p>
<ul>
<li>
<a href="http://code.google.com/p/language-detection/source/browse/#svn%2Ftrunk%2Fprofiles.sm">http://code.google.com/p/language-detection/source/browse/#svn%2Ftrunk%2Fprofiles.sm</a>
</li>
</ul>
<p>Those 17 languages are as the below.</p>
<ul>
<li>cs : Czech</li>
<li>da : Dannish</li>
<li>de : German</li>
<li>en : English</li>
<li>es : Spanish</li>
<li>fi : Finnish</li>
<li>fr : French</li>
<li>id : Indonesian</li>
<li>it : Italian</li>
<li>nl : Dutch</li>
<li>no : Norwegian</li>
<li>pl : Polish</li>
<li>pt : Portuguese</li>
<li>ro : Romanian</li>
<li>sv : Swedish</li>
<li>tr : Turkish</li>
<li>vi : Vietnamese</li>
</ul>
<p>These profiles perform more 2 point for short messages detection (tweet and so on) than the bundled profiles generated from Wikipedia abstracts.</p>
<table>
<tr>
<th rowspan="2">language</th>
<th rowspan="2">test size</th>
<th colspan="2">original profiles</th>
<th colspan="2">new profiles</th>
</tr>
<tr>
<th>correct</th>
<th>accuracy</th>
<th>correct</th>
<th>accuracy</th>
</tr>
<tr>
<td>cs</td>
<td align="right">4269</td>
<td align="right">4261</td>
<td align="right">100.0%</td>
<td align="right">4258</td>
<td align="right">100.0%</td>
</tr>
<tr>
<td>da</td>
<td align="right">5484</td>
<td align="right">5255</td>
<td align="right">96.0%</td>
<td align="right">5188</td>
<td align="right">95.0%</td>
</tr>
<tr>
<td>de</td>
<td align="right">9608</td>
<td align="right">8495</td>
<td align="right">88.0%</td>
<td align="right">9020</td>
<td align="right">94.0%</td>
</tr>
<tr>
<td>en</td>
<td align="right">9630</td>
<td align="right">8796</td>
<td align="right">91.0%</td>
<td align="right">9188</td>
<td align="right">95.0%</td>
</tr>
<tr>
<td>es</td>
<td align="right">10133</td>
<td align="right">9721</td>
<td align="right">96.0%</td>
<td align="right">9943</td>
<td align="right">98.0%</td>
</tr>
<tr>
<td>fi</td>
<td align="right">2241</td>
<td align="right">2238</td>
<td align="right">100.0%</td>
<td align="right">2236</td>
<td align="right">100.0%</td>
</tr>
<tr>
<td>fr</td>
<td align="right">10067</td>
<td align="right">9719</td>
<td align="right">97.0%</td>
<td align="right">9906</td>
<td align="right">98.0%</td>
</tr>
<tr>
<td>id</td>
<td align="right">10184</td>
<td align="right">9869</td>
<td align="right">97.0%</td>
<td align="right">10061</td>
<td align="right">99.0%</td>
</tr>
<tr>
<td>it</td>
<td align="right">10167</td>
<td align="right">9844</td>
<td align="right">97.0%</td>
<td align="right">9960</td>
<td align="right">98.0%</td>
</tr>
<tr>
<td>nl</td>
<td align="right">9680</td>
<td align="right">8449</td>
<td align="right">87.0%</td>
<td align="right">9399</td>
<td align="right">97.0%</td>
</tr>
<tr>
<td>no</td>
<td align="right">10505</td>
<td align="right">10148</td>
<td align="right">97.0%</td>
<td align="right">10015</td>
<td align="right">95.0%</td>
</tr>
<tr>
<td>pl</td>
<td align="right">9886</td>
<td align="right">9833</td>
<td align="right">99.0%</td>
<td align="right">9852</td>
<td align="right">100.0%</td>
</tr>
<tr>
<td>pt</td>
<td align="right">9456</td>
<td align="right">8720</td>
<td align="right">92.0%</td>
<td align="right">9170</td>
<td align="right">97.0%</td>
</tr>
<tr>
<td>ro</td>
<td align="right">4057</td>
<td align="right">3791</td>
<td align="right">93.0%</td>
<td align="right">3993</td>
<td align="right">98.0%</td>
</tr>
<tr>
<td>sv</td>
<td align="right">9932</td>
<td align="right">9670</td>
<td align="right">97.0%</td>
<td align="right">9762</td>
<td align="right">98.0%</td>
</tr>
<tr>
<td>tr</td>
<td align="right">10309</td>
<td align="right">10145</td>
<td align="right">98.0%</td>
<td align="right">10251</td>
<td align="right">99.0%</td>
</tr>
<tr>
<td>vi</td>
<td align="right">10932</td>
<td align="right">10832</td>
<td align="right">99.0%</td>
<td align="right">10832</td>
<td align="right">99.0%</td>
</tr>
<tr>
<td>total</td>
<td align="right">146540</td>
<td align="right">139786</td>
<td align="right">95.4%</td>
<td align="right">143034</td>
<td align="right">97.6%</td>
</tr>
</table>
<p>The Twitter corpus (training and test) used by generating profiles are based on collected tweets via &#8216;sample&#8217; method of Twitter Streaming API, all of which are annotated by myself.<br />
Those corpus size are as the below.</p>
<table>
<tr>
<th>language</th>
<th>training</th>
<th>test</th>
</tr>
<tr>
<td>cs : Czech</td>
<td align="right">3514</td>
<td align="right">4342</td>
</tr>
<tr>
<td>da : Dannish</td>
<td align="right">4007</td>
<td align="right">5645</td>
</tr>
<tr>
<td>de : German</td>
<td align="right">44115</td>
<td align="right">9998</td>
</tr>
<tr>
<td>en : English</td>
<td align="right">44335</td>
<td align="right">10167</td>
</tr>
<tr>
<td>es : Spanish</td>
<td align="right">44976</td>
<td align="right">10296</td>
</tr>
<tr>
<td>fi : Finnish</td>
<td align="right">3400</td>
<td align="right">2310</td>
</tr>
<tr>
<td>fr : French</td>
<td align="right">44279</td>
<td align="right">10410</td>
</tr>
<tr>
<td>id : Indonesian</td>
<td align="right">44912</td>
<td align="right">10395</td>
</tr>
<tr>
<td>it : Italian</td>
<td align="right">44183</td>
<td align="right">10562</td>
</tr>
<tr>
<td>nl : Dutch</td>
<td align="right">45033</td>
<td align="right">10109</td>
</tr>
<tr>
<td>no : Norwegian</td>
<td align="right">4272</td>
<td align="right">10721</td>
</tr>
<tr>
<td>pl : Polish</td>
<td align="right">5040</td>
<td align="right">10282</td>
</tr>
<tr>
<td>pt : Portuguese</td>
<td align="right">44505</td>
<td align="right">9749</td>
</tr>
<tr>
<td>ro : Romanian</td>
<td align="right">3490</td>
<td align="right">4151</td>
</tr>
<tr>
<td>sv : Swedish</td>
<td align="right">44330</td>
<td align="right">10232</td>
</tr>
<tr>
<td>tr : Turkish</td>
<td align="right">45024</td>
<td align="right">10828</td>
</tr>
<tr>
<td>vi : Vietnamese</td>
<td align="right">5029</td>
<td align="right">11065</td>
</tr>
</table>
<p>Those corpus are originally in order to a new short text detection, but as profile for langdetect, I confirmed they show higher performance of short message detection than the bundled profiles. So the new profiles are published too.<br />
In using this profiles, text to detect should be converted into the lower case. Meanwhile langdetect tend to remove all upper case word as an acronym, twitter-like short messages are often written as all upper case sentence for emphasis.</p>
<p>Of cource the present profiles are bundled as until now. They have higher accuracy for news text and so on! <img src='http://s0.wp.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /> </p>
<p>The prototype of the language detection for short texts are published at https://github.com/shuyo/ldig .<br />
It is shortened from &#8220;Language Detection with Infinity-Gram&#8221;.<br />
I wrote the presentation of ldig, but yet in Japanese only. I&#8217;ll translate it in English later&#8230;</p>
<ul>
<li><a href="http://www.slideshare.net/shuyo/gram-10286133">The presentation of &#8220;Language Detection with Infinity-Gram&#8221;</a> (in Japanese)</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/shuyo.wordpress.com/221/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/shuyo.wordpress.com/221/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/shuyo.wordpress.com/221/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/shuyo.wordpress.com/221/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/shuyo.wordpress.com/221/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/shuyo.wordpress.com/221/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/shuyo.wordpress.com/221/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/shuyo.wordpress.com/221/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/shuyo.wordpress.com/221/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/shuyo.wordpress.com/221/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/shuyo.wordpress.com/221/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/shuyo.wordpress.com/221/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/shuyo.wordpress.com/221/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/shuyo.wordpress.com/221/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&amp;blog=4249505&amp;post=221&amp;subd=shuyo&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://shuyo.wordpress.com/2011/11/28/language-detection-supported-17-language-profiles-for-short-messages/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/1f4bcc4ca602d28da9fb29d79fffcae4?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">shuyo</media:title>
		</media:content>
	</item>
		<item>
		<title>langdetect is updated(added profiles of Estonian / Lithuanian / Latvian / Slovene, and so on)</title>
		<link>http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/</link>
		<comments>http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/#comments</comments>
		<pubDate>Thu, 29 Sep 2011 03:36:32 +0000</pubDate>
		<dc:creator>shuyo</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://shuyo.wordpress.com/?p=212</guid>
		<description><![CDATA[My language detection library &#8220;langdetect&#8221; was updated. http://code.google.com/p/language-detection/ The added features are the following: Added 4 language profiles of Estonian, Lithuanian, Latvian and Slovene. Supported retrieving a list of loaded language profiles as getLangList() Supported generating a language profile from &#8230; <a href="http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&amp;blog=4249505&amp;post=212&amp;subd=shuyo&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>My language detection library &#8220;langdetect&#8221; was updated.</p>
<ul>
<li><a href="http://code.google.com/p/language-detection/">http://code.google.com/p/language-detection/</a></li>
</ul>
<p>The added features are the following:</p>
<ul>
<li>Added 4 language profiles of Estonian, Lithuanian, Latvian and Slovene.</li>
<li>Supported retrieving a list of loaded language profiles as getLangList()</li>
<li>Supported generating a language profile from plain text</li>
</ul>
<p>and fixed some bugs.</p>
<p>Then I published a test data set for 21 languages based on Europarl Parallel Corpus ( http://www.statmt.org/europarl/ ) so that anyone are able to verify the library on the same condition.</p>
<ul>
<li><a href="http://code.google.com/p/language-detection/downloads/detail?name=europarl-test.zip">http://code.google.com/p/language-detection/downloads/detail?name=europarl-test.zip</a></li>
</ul>
<p>It randomly samples 1000 sentences(lines) for each language from Europarl corpus.<br />
Each line forms &#8220;[language code]\t[plain text by UTF-8]&#8221; to be avaiable for batch test tool of this langdetect library.<br />
Europarl corpus has some of very short sentence(e.g. 2 words only!) that langdetect is not very good at. I remained them for fairness! <img src='http://s0.wp.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /> </p>
<p>Then the following is a result with this test data.</p>
<pre>
bg (985/1000=0.99): {bg=985, ru=4, mk=11}
cs (993/1000=0.99): {sk=6, en=1, cs=993}
da (971/1000=0.97): {da=971, no=28, en=1}
de (998/1000=1.00): {de=998, da=1, af=1}
el (1000/1000=1.00): {el=1000}
en (997/1000=1.00): {fr=1, en=997, nl=1, af=1}
es (995/1000=1.00): {pt=4, en=1, es=995}
et (996/1000=1.00): {de=1, fi=2, et=996, af=1}
fi (998/1000=1.00): {fi=998, et=2}
fr (998/1000=1.00): {it=1, sv=1, fr=998}
hu (999/1000=1.00): {id=1, hu=999}
it (998/1000=1.00): {it=998, es=2}
lt (998/1000=1.00): {lv=2, lt=998}
lv (999/1000=1.00): {pt=1, lv=999}
nl (977/1000=0.98): {de=1, sv=1, nl=977, af=21}
pl (999/1000=1.00): {pl=999, nl=1}
pt (994/1000=0.99): {it=1, hu=1, pt=994, en=1, es=3}
ro (999/1000=1.00): {ro=999, fr=1}
sk (987/1000=0.99): {sl=2, sk=987, ro=1, lt=1, et=1, cs=8}
sl (972/1000=0.97): {hr=27, sl=972, en=1}
sv (990/1000=0.99): {da=2, no=8, sv=990}
total: 20843/21000 = 0.993
</pre>
<p>This is obtained by batchtest tool.</p>
<pre>
java -jar lib/langdetect.jar -d profiles -s 0 --batchtest europarl.test
</pre>
<p>The random seed(-s) is set to 0 for reproduce.</p>
<p>Hence Danish(da), Dutch(nl) and Slovene(sl) are very similar to Norwegian(no), Afrikaans(af) and Croatian(hr) respectively, their detection accuracies are lower than others.<br />
(While Norwegian has some proper letters which are not used in Dutch, so its accuracy leaves higher)</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/shuyo.wordpress.com/212/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/shuyo.wordpress.com/212/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/shuyo.wordpress.com/212/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/shuyo.wordpress.com/212/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/shuyo.wordpress.com/212/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/shuyo.wordpress.com/212/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/shuyo.wordpress.com/212/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/shuyo.wordpress.com/212/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/shuyo.wordpress.com/212/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/shuyo.wordpress.com/212/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/shuyo.wordpress.com/212/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/shuyo.wordpress.com/212/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/shuyo.wordpress.com/212/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/shuyo.wordpress.com/212/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&amp;blog=4249505&amp;post=212&amp;subd=shuyo&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/1f4bcc4ca602d28da9fb29d79fffcae4?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">shuyo</media:title>
		</media:content>
	</item>
		<item>
		<title>twitter replaces a string &#8216;\u2028&#8242; into &#8216;\u2070&#8242;</title>
		<link>http://shuyo.wordpress.com/2011/08/31/twitter-replaces-a-string-u2028-into-u2070/</link>
		<comments>http://shuyo.wordpress.com/2011/08/31/twitter-replaces-a-string-u2028-into-u2070/#comments</comments>
		<pubDate>Wed, 31 Aug 2011 08:03:50 +0000</pubDate>
		<dc:creator>shuyo</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://shuyo.wordpress.com/?p=208</guid>
		<description><![CDATA[I posted a tweet about Unicode&#8217;s line feed code, including a string &#8216;\u2028&#8242;. Then it was replaced &#8216;\u2070&#8242;! Hence not only &#8216;\u2028&#8242;(LINE SEPARATOR) but also &#8216;\u2029&#8242;(PARAGRAPH SEPARATOR) is done so, twitter intends to do something (awful? ) for line feed &#8230; <a href="http://shuyo.wordpress.com/2011/08/31/twitter-replaces-a-string-u2028-into-u2070/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&amp;blog=4249505&amp;post=208&amp;subd=shuyo&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I posted a tweet about Unicode&#8217;s line feed code, including a string &#8216;\u2028&#8242;.<br />
Then it was replaced &#8216;\u2070&#8242;!</p>
<p>Hence not only &#8216;\u2028&#8242;(LINE SEPARATOR) but also &#8216;\u2029&#8242;(PARAGRAPH SEPARATOR) is done so, twitter intends to do something (awful? <img src='http://s2.wp.com/wp-includes/images/smilies/icon_razz.gif' alt=':P' class='wp-smiley' /> ) for line feed codes.<br />
But &#8216;\u2028&#8242; is mere a string in ascii&#8230;</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/shuyo.wordpress.com/208/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/shuyo.wordpress.com/208/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/shuyo.wordpress.com/208/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/shuyo.wordpress.com/208/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/shuyo.wordpress.com/208/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/shuyo.wordpress.com/208/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/shuyo.wordpress.com/208/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/shuyo.wordpress.com/208/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/shuyo.wordpress.com/208/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/shuyo.wordpress.com/208/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/shuyo.wordpress.com/208/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/shuyo.wordpress.com/208/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/shuyo.wordpress.com/208/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/shuyo.wordpress.com/208/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&amp;blog=4249505&amp;post=208&amp;subd=shuyo&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://shuyo.wordpress.com/2011/08/31/twitter-replaces-a-string-u2028-into-u2070/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/1f4bcc4ca602d28da9fb29d79fffcae4?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">shuyo</media:title>
		</media:content>
	</item>
		<item>
		<title>Collapsed Gibbs Sampling Estimation for Latent Dirichlet Allocation (3)</title>
		<link>http://shuyo.wordpress.com/2011/06/27/collapsed-gibbs-sampling-estimation-for-latent-dirichlet-allocation-3/</link>
		<comments>http://shuyo.wordpress.com/2011/06/27/collapsed-gibbs-sampling-estimation-for-latent-dirichlet-allocation-3/#comments</comments>
		<pubDate>Mon, 27 Jun 2011 10:56:34 +0000</pubDate>
		<dc:creator>shuyo</dc:creator>
				<category><![CDATA[LDA]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://shuyo.wordpress.com/?p=204</guid>
		<description><![CDATA[In the previous article, I introduced the simple implement of the collapsed gibbs sampling estimation for Latent Dirichlet Allocation(LDA). However each word topic z_mn is initialized to a random topic in this implement, there are some toubles. First, it needs &#8230; <a href="http://shuyo.wordpress.com/2011/06/27/collapsed-gibbs-sampling-estimation-for-latent-dirichlet-allocation-3/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&amp;blog=4249505&amp;post=204&amp;subd=shuyo&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://shuyo.wordpress.com/2011/05/31/collapsed-gibbs-sampling-estimation-for-latent-dirichlet-allocation-2/">In the previous article</a>, I introduced the simple implement of the collapsed gibbs sampling estimation for Latent Dirichlet Allocation(LDA).<br />
However each word topic z_mn is initialized to a random topic in this implement, there are some toubles.</p>
<p>First, it needs many iterations before its perplexity begins to decrease.<br />
Second, almost topics have some stopwords like &#8216;the&#8217; and &#8216;of&#8217; with high probabilities when converging its perplexity.<br />
Moreover there are some topic groups which share similar word distributions. Therefore the substantial topics are less than topic size parameter K.</p>
<p>Instead of the random initialization, draw from the posterior probability form as sampling the new topic incrementally.</p>
<p><img src='http://chart.apis.google.com/chart?cht=tx&amp;chl=P(z_{mn}=z|\bf{z}^{-mn},\bf{w})\;\prop\;(n_{mz}^{-mn}%2B\alpha)\cdot\frac{n_{tz}^{-mn}%2B\beta}{n_{z}^{-mn}%2BV\beta},' /></p>
<p>Furthermore, Hence n_zt is only used with beta additionally, n_mz with alpha, and n_z with V * beta, it is an efficient implement to add the corresponding parameters to these variables beforehand.</p>
<p>Then, a sample initialization code is the following,</p>
<pre>
# docs : documents which consists of word array
# K : number of topics
# V : vocaburary size

z_m_n = [] # topics of words of documents
n_m_z = numpy.zeros((len(self.docs), K)) + alpha     # word count of each document and topic
n_z_t = numpy.zeros((K, V)) + beta # word count of each topic and vocabulary
n_z = numpy.zeros(K) + V * beta    # word count of each topic

for m, doc in enumerate(docs):
    z_n = []
    for t in doc:
        # draw from the posterior
        p_z = n_z_t[:, t] * n_m_z[m] / n_z
        z = numpy.random.multinomial(1, p_z / p_z.sum()).argmax()

        z_n.append(z)
        n_m_z[m, z] += 1
        n_z_t[z, t] += 1
        n_z[z] += 1
    z_m_n.append(numpy.array(z_n))
</pre>
<p>and so is a sample inference code.</p>
<pre>
for m, doc in enumerate(docs):
    for n, t in enumerate(doc):
        # discount for n-th word t with topic z
        z = z_m_n[m][n]
        n_m_z[m, z] -= 1
        n_z_t[z, t] -= 1
        n_z[z] -= 1

        # sampling new topic
        p_z = n_z_t[:, t] * n_m_z[m] / n_z # Here is only changed.
        new_z = numpy.random.multinomial(1, p_z / p_z.sum()).argmax()

        # preserve the new topic and increase the counters
        z_m_n[m][n] = new_z
        n_m_z[m, new_z] += 1
        n_z_t[new_z, t] += 1
        n_z[new_z] += 1
</pre>
<p>(The inference code is similar to the previous version but is simplified at the posterior calculation.)</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/shuyo.wordpress.com/204/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/shuyo.wordpress.com/204/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/shuyo.wordpress.com/204/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/shuyo.wordpress.com/204/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/shuyo.wordpress.com/204/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/shuyo.wordpress.com/204/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/shuyo.wordpress.com/204/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/shuyo.wordpress.com/204/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/shuyo.wordpress.com/204/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/shuyo.wordpress.com/204/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/shuyo.wordpress.com/204/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/shuyo.wordpress.com/204/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/shuyo.wordpress.com/204/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/shuyo.wordpress.com/204/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&amp;blog=4249505&amp;post=204&amp;subd=shuyo&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://shuyo.wordpress.com/2011/06/27/collapsed-gibbs-sampling-estimation-for-latent-dirichlet-allocation-3/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/1f4bcc4ca602d28da9fb29d79fffcae4?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">shuyo</media:title>
		</media:content>

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=P(z_mn=z&#124;bfz-mn,bfw);prop;(n_mz-mn%2Balpha)cdotfracn_tz-mn%2Bbetan_z-mn%2BVbeta," medium="image" />
	</item>
		<item>
		<title>Collapsed Gibbs Sampling Estimation for Latent Dirichlet Allocation (2)</title>
		<link>http://shuyo.wordpress.com/2011/05/31/collapsed-gibbs-sampling-estimation-for-latent-dirichlet-allocation-2/</link>
		<comments>http://shuyo.wordpress.com/2011/05/31/collapsed-gibbs-sampling-estimation-for-latent-dirichlet-allocation-2/#comments</comments>
		<pubDate>Tue, 31 May 2011 09:41:45 +0000</pubDate>
		<dc:creator>shuyo</dc:creator>
				<category><![CDATA[LDA]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://shuyo.wordpress.com/?p=192</guid>
		<description><![CDATA[Before iterations of LDA estimation, it is necessary to initialize parameters. Collapsed Gibbs Sampling (CGS) estimation has the following parameters. z_mn : topic of word n of document m n_mz : word count of document m with topic z n_tz &#8230; <a href="http://shuyo.wordpress.com/2011/05/31/collapsed-gibbs-sampling-estimation-for-latent-dirichlet-allocation-2/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&amp;blog=4249505&amp;post=192&amp;subd=shuyo&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Before iterations of LDA estimation, it is necessary to initialize parameters.<br />
Collapsed Gibbs Sampling (CGS) estimation has the following parameters.</p>
<ul>
<li>z_mn : topic of word n of document m</li>
<li>n_mz : word count of document m with topic z</li>
<li>n_tz : count of word t with topic z</li>
<li>n_z : word count with topic z</li>
</ul>
<p>The most simple initialization is to assign each word to a random topic and increase the corresponding counters n_mz, n_tz and n_z.</p>
<pre>
# docs : documents which consists of word array
# K : number of topics
# V : vocaburary size

z_m_n = [] # topics of words of documents
n_m_z = numpy.zeros((len(docs), K))     # word count of each document and topic
n_z_t = numpy.zeros((K, V)) # word count of each topic and vocabulary
n_z = numpy.zeros(K)        # word count of each topic

for m, doc in enumerate(docs):
    z_n = []
    for t in doc:
        z = numpy.random.randint(0, K)
        z_n.append(z)
        n_m_z[m, z] += 1
        n_z_t[z, t] += 1
        n_z[z] += 1
    z_m_n.append(numpy.array(z_n))
</pre>
<p>Then the iterative inference with the full-conditional in the previous article is the following. That is repeated until the perplexity gets stable.</p>
<pre>
for m, doc in enumerate(docs):
    for n, t in enumerate(doc):
        # discount for n-th word t with topic z
        z = z_m_n[m][n]
        n_m_z[m, z] -= 1
        n_z_t[z, t] -= 1
        n_z[z] -= 1

        # sampling new topic
        p_z = (n_z_t[:, t] + beta) * (n_m_z[m] + alpha) / (n_z + V * beta)
        new_z = numpy.random.multinomial(1, p_z / p_z.sum()).argmax()

        # preserve the new topic and increase the counters
        z_m_n[m][n] = new_z
        n_m_z[m, new_z] += 1
        n_z_t[new_z, t] += 1
        n_z[new_z] += 1
</pre>
<p>It is using numpy.random.multinomial() method with set 1 to number of experiments for drawing from multinomial distribution.<br />
Hence this method returns a array of which a certain k-th element is 1 and the remainder are 0, argmax() retrives the k value. This is a little wasteful&#8230;</p>
<p>The next article will show another efficient initialization.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/shuyo.wordpress.com/192/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/shuyo.wordpress.com/192/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/shuyo.wordpress.com/192/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/shuyo.wordpress.com/192/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/shuyo.wordpress.com/192/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/shuyo.wordpress.com/192/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/shuyo.wordpress.com/192/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/shuyo.wordpress.com/192/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/shuyo.wordpress.com/192/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/shuyo.wordpress.com/192/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/shuyo.wordpress.com/192/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/shuyo.wordpress.com/192/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/shuyo.wordpress.com/192/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/shuyo.wordpress.com/192/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&amp;blog=4249505&amp;post=192&amp;subd=shuyo&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://shuyo.wordpress.com/2011/05/31/collapsed-gibbs-sampling-estimation-for-latent-dirichlet-allocation-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/1f4bcc4ca602d28da9fb29d79fffcae4?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">shuyo</media:title>
		</media:content>
	</item>
		<item>
		<title>Collapsed Gibbs Sampling Estimation for Latent Dirichlet Allocation (1)</title>
		<link>http://shuyo.wordpress.com/2011/05/24/collapsed-gibbs-sampling-estimation-for-latent-dirichlet-allocation-1/</link>
		<comments>http://shuyo.wordpress.com/2011/05/24/collapsed-gibbs-sampling-estimation-for-latent-dirichlet-allocation-1/#comments</comments>
		<pubDate>Tue, 24 May 2011 11:49:21 +0000</pubDate>
		<dc:creator>shuyo</dc:creator>
				<category><![CDATA[LDA]]></category>
		<category><![CDATA[Machine Learning]]></category>

		<guid isPermaLink="false">http://shuyo.wordpress.com/?p=181</guid>
		<description><![CDATA[Latent Dirichlet Allocation (LDA) is a generative model which is used as a language topic model and so on. Each random variable means the following θ : document-topic distribution, φ : topic-word distribution, Z : word topic, W : word, &#8230; <a href="http://shuyo.wordpress.com/2011/05/24/collapsed-gibbs-sampling-estimation-for-latent-dirichlet-allocation-1/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&amp;blog=4249505&amp;post=181&amp;subd=shuyo&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Latent Dirichlet Allocation (LDA) is a generative model which is used as a language topic model and so on.</p>
<div id="attachment_182" class="wp-caption alignnone" style="width: 504px"><a href="http://en.wikipedia.org/wiki/File:Smoothed_LDA.png"><img src="http://shuyo.files.wordpress.com/2011/05/smoothed_lda.png?w=640" alt="Graphical model of LDA" title="Graphical model of LDA"   class="size-full wp-image-182" /></a><p class="wp-caption-text">Graphical model of LDA (this figure from Wikipedia)</p></div>
<p>Each random variable means the following</p>
<ul>
<li>θ : document-topic distribution, <img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=\theta_m\sim\text{Dir}(\alpha)" alt="document-topic multinomial drawn from Dirichlet distribution" /></li>
<li>φ : topic-word distribution, <img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=\varphi_z\sim\text{Dir}(\beta)" alt="topic-word multinomial drawn from Dirichlet distribution" /></li>
<li>Z : word topic, <img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=z_{mn}\sim\text{Multi}(\theta_m)" alt="word topic drawn from multinomial" /></li>
<li>W : word, <img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=w_{mn}\sim\text{Multi}(\phi_{z_mn})" alt="word drawn from multinomial" /></li>
</ul>
<p>There are some populaer estimation methods for LDA, and Collapsed Gibbs sampling (CGS) is one of them.<br />
This method is to integral out random variables except for word topic {z_mn} and draw each z_mn from posterior.<br />
The posterior of z_mn is the following:</p>
<p><img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=P(z_{mn}=z|\bf{z}^{-mn},\bf{w})\;\prop\;(n_{mz}^{-mn}%2B\alpha)\cdot\frac{n_{tz}^{-mn}%2B\beta}{n_{z}^{-mn}%2BV\beta}," alt="Collapsed Gibbs sampling of LDA" /></p>
<p>where n_mz is a word count of document m with topic z, n_tz is a count of word t with topic z, n_z is a word count with topic z and -mn means &#8220;except z_mn.&#8221;<br />
The estimation iterates until its perplexity converges or appropriate times.</p>
<p><img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=\text{perplexity}(\bf{w})=\exp\left{\frac1N\sum_{mn}\log(\sum_z\theta_{mz}\varphi_{zw_{mn}})\right}," alt="Perplexity of LDA" /></p>
<p>where </p>
<p><img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=\varphi_{zt}=\frac{n_{tz}%2B\beta}{n_z%2BV\beta}" alt="Topic-word distributions" /><br />
<img src="http://chart.apis.google.com/chart?cht=tx&amp;chl=\theta_{mz}=\frac{n_{mz}%2B\alpha}{n_m%2BK\alpha}," alt="Document-topic distributions" /></p>
<p>and n_m is a word count of document m.<br />
However perplexities usually decrease as learnings are progressing, my experiment told some different tendencies.</p>
<p>Continued on the next post.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/shuyo.wordpress.com/181/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/shuyo.wordpress.com/181/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/shuyo.wordpress.com/181/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/shuyo.wordpress.com/181/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/shuyo.wordpress.com/181/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/shuyo.wordpress.com/181/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/shuyo.wordpress.com/181/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/shuyo.wordpress.com/181/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/shuyo.wordpress.com/181/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/shuyo.wordpress.com/181/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/shuyo.wordpress.com/181/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/shuyo.wordpress.com/181/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/shuyo.wordpress.com/181/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/shuyo.wordpress.com/181/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&amp;blog=4249505&amp;post=181&amp;subd=shuyo&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://shuyo.wordpress.com/2011/05/24/collapsed-gibbs-sampling-estimation-for-latent-dirichlet-allocation-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/1f4bcc4ca602d28da9fb29d79fffcae4?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">shuyo</media:title>
		</media:content>

		<media:content url="http://shuyo.files.wordpress.com/2011/05/smoothed_lda.png" medium="image">
			<media:title type="html">Graphical model of LDA</media:title>
		</media:content>

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=theta_msimtextDir(alpha)" medium="image">
			<media:title type="html">document-topic multinomial drawn from Dirichlet distribution</media:title>
		</media:content>

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=varphi_zsimtextDir(beta)" medium="image">
			<media:title type="html">topic-word multinomial drawn from Dirichlet distribution</media:title>
		</media:content>

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=z_mnsimtextMulti(theta_m)" medium="image">
			<media:title type="html">word topic drawn from multinomial</media:title>
		</media:content>

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=w_mnsimtextMulti(phi_z_mn)" medium="image">
			<media:title type="html">word drawn from multinomial</media:title>
		</media:content>

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=P(z_mn=z&#124;bfz-mn,bfw);prop;(n_mz-mn%2Balpha)cdotfracn_tz-mn%2Bbetan_z-mn%2BVbeta," medium="image">
			<media:title type="html">Collapsed Gibbs sampling of LDA</media:title>
		</media:content>

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=textperplexity(bfw)=expleftfrac1Nsum_mnlog(sum_ztheta_mzvarphi_zw_mn)right," medium="image">
			<media:title type="html">Perplexity of LDA</media:title>
		</media:content>

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=varphi_zt=fracn_tz%2Bbetan_z%2BVbeta" medium="image">
			<media:title type="html">Topic-word distributions</media:title>
		</media:content>

		<media:content url="http://chart.apis.google.com/chart?cht=tx&#38;chl=theta_mz=fracn_mz%2Balphan_m%2BKalpha," medium="image">
			<media:title type="html">Document-topic distributions</media:title>
		</media:content>
	</item>
		<item>
		<title>Latent Dirichlet Allocation in Python</title>
		<link>http://shuyo.wordpress.com/2011/05/18/latent-dirichlet-allocation-in-python/</link>
		<comments>http://shuyo.wordpress.com/2011/05/18/latent-dirichlet-allocation-in-python/#comments</comments>
		<pubDate>Wed, 18 May 2011 10:21:15 +0000</pubDate>
		<dc:creator>shuyo</dc:creator>
				<category><![CDATA[LDA]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[text analysis]]></category>

		<guid isPermaLink="false">http://shuyo.wordpress.com/?p=170</guid>
		<description><![CDATA[Latent Dirichlet Allocation (LDA) is a language topic model. In LDA, each document has a topic distribution and each topic has a word distribution. Words are generated from topic-word distribution with respect to the drawn topics in the document. However &#8230; <a href="http://shuyo.wordpress.com/2011/05/18/latent-dirichlet-allocation-in-python/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&amp;blog=4249505&amp;post=170&amp;subd=shuyo&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Latent Dirichlet Allocation (LDA) is a language topic model.</p>
<p>In LDA, each document has a topic distribution and each topic has a word distribution.<br />
Words are generated from topic-word distribution with respect to the drawn topics in the document.</p>
<p>However LDA&#8217;s estimation uses Variational Bayesian originally (Blei+ 2003), Collapsed Gibbs sampling (CGS) method is known as a more precise estimation.<br />
So I tried implementing the CGS estimation of LDA in Python.</p>
<ul>
<li> <a href="https://github.com/shuyo/iir/blob/master/lda/lda.py">https://github.com/shuyo/iir/blob/master/lda/lda.py</a></li>
<li> <a href="https://github.com/shuyo/iir/blob/master/lda/vocabulary.py">https://github.com/shuyo/iir/blob/master/lda/vocabulary.py</a></li>
</ul>
<p>It requires Python 2.6, numpy and NLTK.</p>
<p><code><br />
$ ./lda.py -h<br />
Usage: lda.py [options]</p>
<p>Options:<br />
  -h, --help     show this help message and exit<br />
  -f FILENAME    corpus filename<br />
  -c CORPUS      using range of Brown corpus' files(start:end)<br />
  --alpha=ALPHA  parameter alpha<br />
  --beta=BETA    parameter beta<br />
  -k K           number of topics<br />
  -i ITERATION   iteration count<br />
  -s             smart initialize of parameters<br />
  --stopwords    exclude stop words<br />
  --seed=SEED    random seed<br />
  --df=DF        threshold of document freaquency to cut words</p>
<p>$ ./lda.py -c 0:20 -k 10 --alpha=0.5 --beta=0.5 -i 50<br />
</code></p>
<p>This command outputs perplexities of every iteration and the estimated topic-word distribution(top 20 words in their probabilities).<br />
I will explain the main point of this implementation in the next article.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/shuyo.wordpress.com/170/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/shuyo.wordpress.com/170/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/shuyo.wordpress.com/170/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/shuyo.wordpress.com/170/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/shuyo.wordpress.com/170/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/shuyo.wordpress.com/170/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/shuyo.wordpress.com/170/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/shuyo.wordpress.com/170/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/shuyo.wordpress.com/170/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/shuyo.wordpress.com/170/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/shuyo.wordpress.com/170/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/shuyo.wordpress.com/170/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/shuyo.wordpress.com/170/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/shuyo.wordpress.com/170/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&amp;blog=4249505&amp;post=170&amp;subd=shuyo&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://shuyo.wordpress.com/2011/05/18/latent-dirichlet-allocation-in-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/1f4bcc4ca602d28da9fb29d79fffcae4?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">shuyo</media:title>
		</media:content>
	</item>
		<item>
		<title>Minimalist Program respects to Erlangen Program?</title>
		<link>http://shuyo.wordpress.com/2011/04/26/minimalist-program-respects-to-erlangen-program/</link>
		<comments>http://shuyo.wordpress.com/2011/04/26/minimalist-program-respects-to-erlangen-program/#comments</comments>
		<pubDate>Tue, 26 Apr 2011 03:36:44 +0000</pubDate>
		<dc:creator>shuyo</dc:creator>
				<category><![CDATA[Linguistics]]></category>

		<guid isPermaLink="false">http://shuyo.wordpress.com/2011/04/26/minimalist-program-respects-to-erlangen-program/</guid>
		<description><![CDATA[I&#8217;m learning Linguistics, mainly Syntax and generative grammars. Minimalist Program puts me in mind of Klein&#8217;s Erlangen Program. Had Chomsky taken it into account?<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&amp;blog=4249505&amp;post=166&amp;subd=shuyo&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m learning Linguistics, mainly Syntax and generative grammars.<br />
Minimalist Program puts me in mind of Klein&#8217;s Erlangen Program.<br />
Had Chomsky taken it into account?</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/shuyo.wordpress.com/166/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/shuyo.wordpress.com/166/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/shuyo.wordpress.com/166/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/shuyo.wordpress.com/166/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/shuyo.wordpress.com/166/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/shuyo.wordpress.com/166/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/shuyo.wordpress.com/166/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/shuyo.wordpress.com/166/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/shuyo.wordpress.com/166/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/shuyo.wordpress.com/166/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/shuyo.wordpress.com/166/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/shuyo.wordpress.com/166/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/shuyo.wordpress.com/166/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/shuyo.wordpress.com/166/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=shuyo.wordpress.com&amp;blog=4249505&amp;post=166&amp;subd=shuyo&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://shuyo.wordpress.com/2011/04/26/minimalist-program-respects-to-erlangen-program/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/1f4bcc4ca602d28da9fb29d79fffcae4?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">shuyo</media:title>
		</media:content>
	</item>
	</channel>
</rss>
